The nowcasting package provides the tools to make forecasts of monthly or quarterly economic variables using dynamic factor models. The objective is to help the user at each step of the forecasting process, starting with the construction of a database, all the way to the interpretation of the forecasts. The dynamic factor model adopted in this package is based on the articles from Giannone et al. (2008) and Banbura et al. (2011). Although there exist several other dynamic factor model packages available for R, ours provides an environment to easily forecast economic variables and interpret results.
Important economic decisions are made based on current and future conditions. Oftentimes, the variables used to measure such conditions are not available even for the recent past. This is, for instance, the case with US GDP that is published 45 days after the end of the quarter. Similarly, Brazilian GDP is published with a 60-day lag. There is therefore a need for forecasting the current value of given variables. To this end, (Giannone et al. 2008) proposed a statistical model that allows quarterly variables, such as US GDP, to be forecast using a large set of monthly variables released with different lags. GDP forecasts for the current quarter are, furthermore, updated whenever new information is available. Different central banks have shown interest in this methodology, among them the European Central Bank (Angelini et al. 2008; Van Nieuwenhuyze et al. 2008; Bańbura and Rünstler 2011), and the central banks of Ireland (D’Agostino et al. 2008), New Zealand (Matheson 2010) and Norway (Aastveit and Trovik 2012).
Factor models are designed to summarize the variation contained in a large dataset into only a few variables (Stock and Watson 2006). In (Giannone et al. 2008), the authors show how to reduce the information contained in dozens of monthly time series into only two dynamic factors. These two estimated factors, which are initially monthly, are then transformed into quarterly factors and used in a regression against GDP. Various other authors, such as (Chauvet 2001; Marcellino et al. 2003; Forni et al. 2004; Boivin and Ng 2006; D’Agostino et al. 2006; Banbura et al. 2011; Dahlhaus et al. 2015; Stock and Watson 2016), have explored Dynamic Factor Models (DFMs) in time series forecasting and found promising results.
Given the publication lag of many variables, such as GDP, we can either forecast past, current or future values. In order to differentiate between those types of forecasts we adopt the terminology used in (Giannone et al. 2008) and (Banbura et al. 2011). Backcasting refers to forecasting the value of a yet unpublished variable for a past period, while nowcasting will be with respect to the current period. By way of illustration, suppose we want to forecast the GDP for the \(2^{nd}\) quarter of 2018. If the exercise is made during the \(2^{nd}\) quarter of 2018, then the forecast is classified as nowcasting. However, if the current date is before the \(2^{nd}\) quarter of 2018, then the term used is forecasting. Finally, if the date is after the \(2^{nd}\) quarter of 2018 and the GDP has not yet been released, then the forecast is classified as backcasting.
The aim of the package nowcasting is to offer the tools for the R user to implement dynamic factor models. The different steps in the forecasting process and the associated functions within the package are based on the literature. We have chosen to divide the process into 4 main steps: 1) constructing a dataset; 2) defining the model’s initiation parameters; 3) forecasting; 4) presenting results. This particular division will be maintained in most sections.
This brings us to the article’s sections that are organized as follows: 1) the theoretical framework is introduced; 2) the functions of our package are presented; 3) working examples of how to nowcast Brazilian GDP and of the New York FED nowcasting are given; 4) and finally the last section concludes with some considerations.
Let \(x_t = (x_{1,t},x_{2,t}, ..., x_{N,t})^{'}\) be the vector representing \(N\) monthly time series transformed to satisfy the weak stationarity assumption. The general specification of the dynamic factor model is given by:
\[\begin{aligned} x_t &= \mu + \Lambda f_t + \varepsilon_t \label{eq_xt} \\ \end{aligned} \tag{1}\]
\[\begin{aligned} f_{t} &= \sum_{i=1}^{p} A_i f_{t-i} + B u_t, \quad u_t \sim i.i.d. N(0,I_q) \label{eq_ft} \end{aligned} \tag{2}\]
In equation (1), the variables \(x_t\) are expressed as a function of an intercept \(\mu\) and \(r\) unobserved common factors \(f_t\). Since all variables \(x\) will later be demeaned, one may drop the unconditional means \(\mu\). The variables \(x_t\) will be loaded into the unobserved factors \(f_t\) through \(\Lambda\). Equation (2) imposes the structure of a VAR(p) process on the factors \(f_t\). Both \(\varepsilon_t\) and \(u_t\) are normal, allowing the use of the Kalman Filter. Furthermore, the vector of idiosyncratic component \(\varepsilon_t\) is unrelated to \(u_t\) at all lags, i.e., \(E[\epsilon_t u_{t-k}'] = 0\) for any k. An interesting feature of equation (2) is that the number of shocks \(q\) to the factors need not be equal to the number of factors \(r\). Structural breaks or lead/lag relationships of the \(r\) factors with \(q\) common shocks may motivate such a modeling choice (see (Stock and Watson 2016) for more information).
In the so-called exact dynamic factor model, the error components from equation ((1)) are assumed to be mutually uncorrelated at all lags, i.e., \(E[\varepsilon_{i,t} \varepsilon_{j,s}] = 0\) for \(i \neq j\). However, following (Banbura et al. 2011), the error term could be modeled as an AR(p’) process:
\[\label{eq_arq} \varepsilon_{i,t} = \sum_{j=1}^{p'} \alpha_{i,j} \varepsilon_{i,t-j} + e_{i,t}, \quad e_{i,t} \sim i.i.d. N(0,\sigma^2_i) \tag{3}\]
where \(E[e_{i,t} e_{j,s}] = 0\) for \(i \neq j\).
Following is an example, in matrix form, of equation ((2)) of the model for orders \(r = 2\), \(p = 2\) and \(q = 2\).
\[\begin{bmatrix} f_{1,t}\\ f_{2,t}\\ f_{1,t-1}\\ f_{2,t-1} \end{bmatrix} = \begin{bmatrix} a^1_{1,1} & a^1_{1,2} & a^2_{1,1} & a^2_{1,2} \\ a^1_{2,1} & a^1_{2,2} & a^2_{2,1} & a^2_{2,2} \\ 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \end{bmatrix} \begin{bmatrix} f_{1,t-1}\\ f_{2,t-1}\\ f_{1,t-2}\\ f_{2,t-2} \end{bmatrix} + \begin{bmatrix} b_{1,1} & b_{1,2}\\ b_{2,1} & b_{2,2}\\ 0 & 0\\ 0 & 0 \end{bmatrix} \begin{bmatrix} u_{1,t}\\ u_{2,t} \end{bmatrix}\]
\[F_t = \begin{bmatrix} A_1 & A_2 \\ I_2 & 0 \end{bmatrix} F_{t-1} + B u_t\]
In order to predict a quarterly variable using monthly data, we construct a partially observed monthly counterpart of the quarterly variable as proposed in (Mariano and Murasawa 2003). This allows, for instance, quarterly GDP to be explained by monthly variables. Continuing with this example, let \(Y_t^M\) be the level of the unobservable monthly GDP level and \(Y_t^Q\) the quarterly value of GDP for the partially observable monthly series. As is usual in the literature, we let quarterly GDP be observable in the third month of the quarter.
\[\label{eq::partially observable quarterly variable} Y_t^Q = \begin{cases} Y_t^M + Y_{t-1}^M + Y_{t-2}^M, & t = 3,6,9, \dots\\ \text{unobserved} & \text{ otherwise } \end{cases} \tag{4}\]
The above accounting rule states that the quarterly GDP flow is equal to the sum of the monthly flows. Looking at the quarterly change, \(y_t^Q = Y_t^Q - Y_{t-3}^Q\) ^{1}, it is easy to show that it can be expressed as a function of the differences of the monthly variable, \(y_t = Y_t^M - Y_{t-1}^M\), by using equation (4):
\[\begin{split} y_t^Q &= Y_t^Q - Y_{t-3}^Q \\ &= Y_t^Q + Y_{t-1}^Q - Y_{t-1}^Q + Y_{t-2}^Q - Y_{t-2}^Q - Y_{t-3}^Q\\ &= y_t + 2y_{t-1} + 3y_{t-2} + 2y_{t-3} + y_{t-4}, \quad t = 6,9, \dots \end{split}\]
Suppose that the variable of interest is a quarterly rate of change, \(x_t^Q\), defined as:
\[\label{eq::quarterly growth} x_t^Q \equiv \log(Y_t^Q) - \log(Y_{t-3}^Q) \tag{5}\]
Stating the approximation between the arithmetic and geometric means we have: \[\label{eq::arithmetic geometric approximation} \frac{1}{3}[Y_t^M+Y_{t-1}^M+Y_{t-2}^M] \approx \sqrt[3]{Y_t^MY_{t-1}^MY_{t-2}^M} \tag{6}\]
Combining equations (5) and (6) we obtain the approximation from (Mariano and Murasawa 2003) that expresses the quarterly growth rate of GDP as a function of the unobservable monthly growth rates \(x_t^M\):
\[\label{eq:Mariano&Murusawa} x_t^Q \approx \frac{1}{3} \bigg[x_t^M + 2x_{t-1}^M + 3x_{t-2}^M + 2x_{t-3}^M + x_{t-4}^M\bigg] (\#eq:Mariano&Murusawa)\]
Suppose that the unobserved monthly growth rate \(x_t^M\) also admits the same factor representation as in equation (1) with loadings \(\Lambda_Q\), then the quarterly GDP growth rate, \(x_t^Q\), can be expressed as a function of monthly factors.
\[\label{eq::quarterly loadings} x_t^Q = \overline{\Lambda_{Q}} \ \bigg[f'_t \ \dots \ f_{t-4}'\bigg]' + \bigg[1 \ 2 \ 3 \ 2 \ 1 \bigg] \bigg[\varepsilon_t^M \ \dots \ \varepsilon_{t-4}^M\bigg]' \tag{7}\]
where \(\overline{\Lambda_{Q}} = [\Lambda_Q \ 2\Lambda_Q \ 3\Lambda_Q \ 2\Lambda_Q \ \Lambda_Q]\) is a restricted matrix of loadings on the factors and their lags. Note that the errors are normal in the exact dynamic factor model or have an AR(1) structure as in (Banbura et al. 2011).
We follow the papers by Bai and Ng (2002) and Bai and Ng (2007) to respectively define 1) the number \(r\) of factors in equation ((1)) and 2) the number of shocks \(q\) to the factors in equation ((2)).
Let \(V(r,\hat{F}^r)\) be the sum of squared residuals when \(r\) factors are estimated using principal components. The the information criteria can then be written as follows:
\[\label{eq_ICP1} IC_{r1}(r) = ln(V(r,\widehat{F}^r)) + r\bigg(\frac{N+T}{NT}\bigg)ln\bigg(\frac{NT}{N+T}\bigg) \tag{8}\]
\[\label{eq_ICP2} IC_{r2}(r) = ln(V(r,\widehat{F}^r)) + r\bigg(\frac{N+T}{NT}\bigg)ln\big(\min\{N,T\}\big) \tag{9}\]
\[\label{eq_ICP3} IC_{r3}(r) = ln(V(r,\widehat{F}^r)) + r\bigg(\frac{ln(\min\{N,T\})}{min\{N,T\}}\bigg) \tag{10}\]
The chosen number of factors \(r^*\) will then correspond to \(\arg \min_r IC_{ri}(r)\), for \(i \in \{1,2,3\}\). Equations ((8)), ((9)), and ((10)) are asymptotically equivalent, but may nevertheless give significantly different results for finite samples. To this effect, observe that the penalty in equation ((9)) is highest when considering finite samples.
The number of shocks q
can be lower than the number of factors r
.
Once the number of factors is determined, we use an information
criterion from Bai and Ng (2007) to estimate the number of shocks \(q\) in
equation ((2)). Let \(\widehat{F}_t\) be the \(r\) factors
estimated using principal components and let \(\widehat{u}_t\) be the
residuals from the VAR \(A(l) \widehat{F}_t = \widehat{u}_t\). The idea is
to check whether the eigenvalues of the variance-covariance matrix
\(\widehat{\sum}_u\) are different from 0. Numerically, we will therefore
want to test whether a given eigenvalue is below a predefined tolerance
level.To this end, define the eigenvalues
\(c_1 > c_2 \geq ... \geq c_r \geq 0\) of \(\widehat{\sum}_u\) and define
the k\(^{th}\) normalization of the k+1\(^{th}\) eigenvalue
\[\widehat{D}_k = \bigg(\frac{c^2_{(k+1)}}{\sum^r_{j=1}c^2_j}\bigg)^{1/2}\]
Then for some \(0<m<\infty\) and \(0<\delta<1/2\) that set the tolerance level, define the vector K
\[\label{ICQ3} \textit{K} = \{k: \widehat{D}_k < m/ \min[N^{1/2-\delta},T^{1/2-\delta}] \} \tag{11}\]
where the estimated number of shocks to the factors will be \(\hat{q} = \min \{ k \in \textit{K}\}\). This estimator will converge in probability towards the real number of shocks given that \(r\) is the real number of factors.
We will describe two methodologies for estimating dynamic factors: Two-Stage and Expectation-Maximization.
Two-Stage: This approach is described in (Giannone et al. 2008) and refers to the exact DFM. In the first stage, the parameters of the matrices \(\Lambda\) and \(f_t\) are estimated by Principal Components Analysis (PCA) using a standardized, balanced panel (\(\overline{X}_t\)), in which there are no missing values and outliers. Standardization is important as PCA is not scale invariant. The estimators \(\widehat\Lambda\) and \(\widehat{f}_t\) can be obtained by solving the following optimization problem:
\[\label{otimiza_ft} \min_{f_1,...,f_T,\Lambda} \frac{1}{NT} \sum_{t=1}^T (\overline{X}_t -\Lambda f_t)'(\overline{X}_t -\Lambda f_t) \quad s.t. \quad N^{-1} \Lambda'\Lambda = I_r \tag{12}\]
The estimator for the variance and covariance matrix for \(\varepsilon_t\) is then given by
\[\widehat{\Psi} = diag\Bigg(\frac{1}{T} \sum_{t=1}^T (\overline{X}_t -\widehat{\Lambda} \widehat{f}_t)(\overline{X}_t -\widehat{\Lambda} \widehat{f}_t)'\Bigg)\]
According to (Stock and Watson 2011), the solution to ((12)) is to set \(\widehat{\Lambda}\) equal to the eigenvectors of the variance and covariance matrix of \(\overline{X}_t\) associated with the \(r\) largest eigenvalues, from which it follows that the vector \(\widehat{f_t}\) is the \(r\) first principal components of \(\overline{X}_t\). The coefficients of the matrix \(A_i\), \(i = 1,2,...,p,\) from equation (2), are estimated by OLS regression of \(f_t\) on \(f_{t-1},...,f_{t-p}\). Finally, \(BB'\) is estimated as the covariance matrix of the residuals of this regression.
In the second stage, Kalman smoothing (Durbin and Koopman 2012) is used to re-estimate the factors for the unbalanced panel \(x_t\) considering the parameters obtained in the previous step. There are some R packages that implemented the Kalman smoothing (Tusell 2011). However, for convenience, in the nowcasting package, we used the routine provided by (Giannone et al. 2008). Furthermore, two options are provided when estimating the factors:
y
of lower frequency than the explanatory
variables. Factors are estimated using the monthly explanatory
variables \(x\), after which the transformation from
(Mariano and Murasawa 2003) is applied in order to obtain factors
representing quarterly quantities. Those will be used to
forecast the dependent variable in the bridge equation
(13).\[\label{reg} y_t = \beta_0 + \beta' \widehat{f}_t + e_t \tag{13}\]
The parameters of equation (13) are estimated by OLS, and the forecast for \(y_{t+h}\) is given by
\[\widehat{y}_{t+h} = \widehat{\beta_0} + \widehat{\beta}' \widehat{f}_{t+h}\]
Expectation-Maximization: This estimation method is able to deal
with arbitrary patterns of missing values as shown in
(Bańbura and Modugno 2014). It is therefore less restrictive than the
Two-Stage method with regards to the frequencies of the variables
and allows for a mixed frequency database. Following
(Banbura et al. 2011), factors can be defined for different subgroups
of variables and no longer all need to be global as in the
Two-Stage estimation method. Below, we illustrate a case where
three factors are partitioned into three groups (global, real and
nominal) as in (Banbura et al. 2011). Rewriting equation
(1) accordingly gives equation
(14). As opposed to the Two-Stage estimation
method that builds on an exact dynamic factor model, the error term
is defined as an AR(1) process. A more restrictive assumption than
the Two-Stage method is that the number of shocks to the factors
q
is set equal to the number of factors r
.
\[\label{eq::block_structure} x_t = \mu + \begin{pmatrix} \Lambda_{N,G} & \Lambda_{N,N} & 0\\ \Lambda_{R,G} & 0 & \Lambda_{R,R} \end{pmatrix} \begin{pmatrix} f_t^G \\z f_t^N \\ f_t^R \end{pmatrix} + \varepsilon_t\\ \tag{14}\]
where
\[\begin{pmatrix} \Lambda_{N,G} & \Lambda_{N,N} & 0\\ \Lambda_{R,G} & 0 & \Lambda_{R,R} \end{pmatrix} = \Lambda\\\]
\[\begin{pmatrix} f_t^G \\ f_t^N \\ f_t^R \end{pmatrix} = f_t\\\]
The global factor is estimated considering all the explanatory variables, while the estimates of the nominal and real factors only consider variable classified, respectively, as nominal and real. The parameter \(\mu\) is a vector of constants of dimension N. As previously mentioned, the alternative proposed by (Banbura et al. 2011) to the exact DFM, allows for serial autocorrelation among the error of equation (1) along an \(AR(1)\) process:
\[\varepsilon_{i,t} = \alpha_i \epsilon_{i,t-1} + e_{i,t}, \quad e_{i,t} \sim i.i.d. N(0,\sigma_i^2)\] where \(E[e_{i,t} e_{j,s}] = 0\) for \(i \neq j\).
In this model, the parameters, the unobserved common factors and the missing values are estimated through the Expectation-Maximization algorithm, which uses the following recursive structure:
Convergence is achieved when the absolute change in the value of the log-likelihood function is less than \(10^{-4}\), the tolerance level used for this algorithm. The recursive process starts with the PCA estimates given in (Giannone et al. 2008) (first stage of the Two-Stage method).
The first step in the nowcasting process is to prepare the data in a way
that is compatible with the proposed models and estimation methods. One
of the motivations of the presented models is the forecasting
improvements that can be achieved by using higher frequency variables.
More specifically, the gains that can be obtained in using monthly
variables to forecast quarterly series. Hence, all functions require
monthly mts
objects. In practice, the quarterly variables are usually
represented as monthly variables for which the last month of the quarter
is observed. As illustrated in the working examples, such
straightforward transformations from one frequency representation to
another can be achieved by using the functions qtr2month()
or
month2qtr()
.
With regards to the estimation methods, different inputs may have to be
provided. As a matter of fact, the Two-Stage method is more
restrictive on the format of the variables as it depends on principal
components in the first stage. This requires a strategy to deal with
missing values, which are not part of the jagged edge, beforehand.
(Giannone et al. 2008) propose to replace such missing values with the
median of the series that are then smoothed with a moving average. Since
such a strategy assigns a value that is independent of the information
contained in other contemporaneous variables, it is advisable to exclude
series with many missing values. The EM algorithm, however, is able to
deal with missing values in a way that uses the information contained in
other variables and might therefore not require discarding such
variables. Finally, independently of the estimation method, stationary
series are required. The usual transformations for making time series
stationary and the different strategies to deal with missing values have
been included in the function Bpanel()
that prepares the database for
the nowcasting function. Since these choices require careful attention,
the function Bpanel()
is explained in further detail.
Bpanel(base, trans, NA.replace = TRUE, aggregate = FALSE, k.ma = 3, na.prop = 1/3, h = 12)
trans
is a vector indicating the transformations to be applied to the
variables. For most cases, the available transformations are
sufficient to make economic variables stationary. The transformation
must be specified by using one of the following values for the
argument trans
:
trans = 0
: the observed series is preserved;
trans = 1
: monthly rate of change:
\(\displaystyle\frac{x_{i,t} - x_{i,t-1}}{x_{i,t-1}}\);
trans = 2
: monthly difference: \(x_{i,t} - x_{i,t-1}\);
trans = 3
: monthly difference in year-over-year rate of
change:
\[\displaystyle\frac{x_{i,t} - x_{i,t-12}}{x_{i,t-12}} - \frac{x_{i,t-1} - x_{i,t-13}}{x_{i,t-13}};\]
trans = 4
: monthly difference in year-over-year difference:
\[(x_{i,t} - x_{i,t-12}) - (x_{i,t-1} - x_{i,t-13}).\]
trans = 5
: year difference:
\[(x_{i,t} - x_{i,t-12})\]
trans = 6
: year-over-year rate of change:
\[\frac{x_{i,t} - x_{i,t-12}}{x_{i,t-12}}\]
trans = 7
: quarterly rate of change
\[\frac{x_{i,t} - x_{i,t-3}}{x_{i,t-3}}\]
NA.replace
is a boolean to determine whether missing values should be replaced
(NA.replace = TRUE
) or not (NA.replace = FALSE
).
aggregate
is a boolean to indicate whether to aggregate the monthly variables
to represent quarterly quantities. If TRUE
the aggregation is made
following the approximation of (Mariano and Murasawa 2003).
k.ma
is a numeric representing the degree of the moving average
correction if NA.replace = TRUE
.
na.prop
is a number between 0 and 1 indicating the ratio of missing observations to the total number of observations beyond which series will be discarded. The default is 1/3, meaning that if more than 1/3 of the observations are missing the series will be discarded from the database.
h
indicates how many periods should be added to the database. Default
is 12. Those missing values will be predicted with the function
nowcast()
.
As explained in the section on parameter estimation, the package offers different functions to estimate the number of factors \(r\) and of idiosyncratic shocks \(q\) of equations ((1)) and ((2)) respectively.
Function ICfactors()
estimates the number of factors \(r^*\)
according to an information criterion. The argument x
is a
balanced panel and rmax
is an integer representing the maximum
number of factors for which the information criterion should be
calculated. The default value is 20. type
indicates which of the
information criterion from Bai and Ng (2002) to use. type
\(\in \{1,2,3\}\)
with the default being 2 as explained in the methodological section.
If x
is not a balanced panel, the function will delete rows with
missing values in order to use principal components.
ICfactors(x, rmax = 20, type = 2)
Function ICshocks()
estimates the number of idiosyncratic shocks
given a number \(r\) of factors according to the information criterion
introduced in the previous section. The argument x
is a balanced
panel. delta
and m
are parameters of the information criterion,
where \(0<m<\infty\) and \(0<\delta<1/2\). The default values are those
from Bai and Ng (2007): \(m=1\) and \(\delta = 0.1\). If the number of factors
r
is not specified it will be defined according to
ICfactors(x, rmax = 20, type = 2)
. p
is the number of lags in
the VAR of equation ((2)). If not specified, the default
is the lowest most occurring value from the information criteria
used within the function VARselect()
from the package vars.
ICshocks(x, r = NULL, p = NULL, delta = 0.1, m = 1)
An important feature of factor models is the dimensionality reduction of
(many) original variables into a few common factors. Hence, the target
variable y
will be expressed as a function of a few factors extracted
from the explanatory variables. This motivated the choice of the inputs
for the nowcast()
function. The formula format, which is well known to
R users, captures this idea as formula = y
\(\sim\).
can
be understood as the projection of y
on the information contained in
the dataset. The model’s parameters are estimated according to the
selected method (2s
, 2s_agg
and EM
, which correspond,
respectively, to “two-stage”, “two-stage with factor aggregation” and
“Expectation-Maximization algorithm”) described in the section on
estimation. The number r
of dynamic factors, the number q
of shocks
to the factors, and the lag order p
of the factors are determined
beforehand as shown in the previous subsection. The argument blocks
can be used with the EM
method to estimate factors for different
subgroups of variables. Finally, the argument frequency
is necessary
for all methods in order to identify the frequency of the variables.
nowcast(formula, data, q = NULL, r = NULL, p = NULL, method = 'EM', blocks = NULL,
frequency = NULL)
In the first two methods (2s
and 2s_agg
), the factors are calculated
based on the monthly variables, on which the dependent variable y
will
be regressed. The difference between 2s
and 2s_agg
is that for the
latter the monthly factors are transformed into quarterly quantities
while in the former no such aggregation is used. A linear regression
(bridge equation if y
is quarterly) of y
on the factors allows the
former to be forecast.
In the third method (EM
) no bridge equation is needed, as opposed to
the Two-Stage method. In practice, the algorithm will estimate all the
missing values respecting the restrictions imposed by equation
(7). The forecasts of quarterly time series are
defined as the estimated values of the third month of the out of sample
quarters. As opposed to the Two-Stage method, the number of common
shocks q
can not be specified and is assumed to be equal to r
, the
number of factors in each block.
The function nowcast.plot()
allows to plot several outputs from the
function nowcast()
.
nowcast.plot(out, type = "fcst")
The argument out
is the output from the function nowcast()
. The
argument type
can be chosen from the list
{"fcst","factors","eigenvalues","eigenvectors"}:
"fcst"
: shows the y
variable and its forecasts in sample and out
of sample."factors"
: shows all the estimated factors."eigenvalues"
: indicates how much of the variability in the
dataset is explained by each factor."eigenvectors"
: shows the importance of each variable in the first
factor.In this example we showcase how to nowcast Brazilian GDP using the
Two-Stage estimation method. Most of the variables of interest can be
downloaded from the Brazilian central bank using the function
BETSget()
from the package BETS. The variables and the associated
codes can be found on the Brazilian central bank’s website ^{2}. For the
sake of simplicity we have included the database, and all relevant
information within the package^{3}.
> library(nowcasting)
> data(BRGDP)
For this example we will construct a pseudo real-time dataset, using the
function PRTDB()
. Some variables, such as GDP, suffer revisions over
time. Since we do not take revisions into account, we refer to such
datasets as pseudo real-time (as opposed to vintages). The (approximate)
delays in days are included in the BRGDP
object and will be used to
define if observations were available at a specific moment in time. The
dataset is then treated for outliers and missing values that are not
part of the jagged edges of the data, i.e., that are not due to the
different publication lags of the variables. This is achieved through
the function Bpanel()
. Unless otherwise specified by the user, the
function will also discard series with over 1/3 missing values.
> vintage <- PRTDB(mts = BRGDP$base, delay = BRGDP$delay, vintage = "2015-06-01")
> base <- window(vintage, start = c(2005,06), frequency = 12)
> x <- Bpanel(base = base, trans = BRGDP$trans)
The function month2qtr()
transforms monthly time series into quarterly
ones. In this case we want to use the value of the third month as the
quarterly value.
> GDP <- base[,which(colnames(base) == "PIB")]
> window(GDP, start = c(2015,1))
Jan Feb Mar Apr May Jun2015 NA NA 170.68 NA NA NA
> GDP_qtr <- month2qtr(x = GDP, reference_month = 3)
> window(GDP_qtr, start = c(2015,1))
Qtr1 Qtr22015 170.68 NA
The quarterly GDP indicator, in this example, is an index representing the seasonal quarterly product. \(\Delta_4Y_t\) deals with seasonality, while \(\Delta\Delta_4Y_t\) is necessary to obtain a stationary time series. To test the latter, one could look at tests for unit roots or serial auto correlation that are included in many R packages.
> y <- diff(diff(GDP_qtr,4))
> y <- qtr2month(y)
The dataset x
, which now only posses jagged edges, is well suited for
the information criteria that make use of principal components. The
estimated number of factors is given by the function ICfactors()
. As
explained in the previous section, the information criteria might give
different results for finite samples.
> ICR1 <- ICfactors(x = x, type = 1)
> ICR2 <- ICfactors(x = x, type = 2)
Finally, given the chosen number of factors for our model, we can use an information criterion for determining the number of shocks to the factors.
> ICQ1 <- ICshocks(x = x, r = 2, p = 2)
> ICQ1$q_star
1] 2 [
Let the object data
be a monthly mts
object where the first column
is a partially observable stationary GDP series (y
) and the remaining
columns a balanced panel of stationary time series (x
). The
frequency
vector will be determined by the quarterly GDP series and
the remaining monthly series. In this example the factors will be
aggregated to obtain quarterly quantities by setting
method = "2s_agg"
.
> data <- cbind(y,x)
> frequency <- c(4,rep(12,ncol(x)))
> now <- nowcast(formula = y~., data = data, r = 2, q = 2 , p = 2, method = "2s_agg",
frequency = frequency)
> summary(now$reg)
:
Call::lm(formula = Y ~ ., data = Balanced_panel)
stats
:
Residuals1Q Median 3Q Max
Min -3.0248 -0.5679 0.1094 0.5835 1.8912
:
CoefficientsPr(>|t|)
Estimate Std. Error t value -0.19526 0.16940 -1.153 0.258
(Intercept) 0.22610 0.01456 15.528 < 2e-16 ***
Factor1 0.06135 0.01174 5.228 1.02e-05 ***
Factor2 ---
: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Signif. codes
: 1.002 on 32 degrees of freedom
Residual standard error-squared: 0.8995, Adjusted R-squared: 0.8932
Multiple R-statistic: 143.1 on 2 and 32 DF, p-value: < 2.2e-16 F
The function nowcast.plot()
enables the user to visualize some of the
results. Say, for instance, that we want to look at fitted values and
out-of-sample forecasts. This can be achieved by setting the type to
"fcst"
. We might also want to look at the eigenvalues of the
normalized variance-covariance matrix of our balanced panel or at how
variables enter the first factor.
> nowcast.plot(now, type = "fcst")
> nowcast.plot(now, type = "eigenvalues")
> nowcast.plot(now, type = "eigenvectors")
Up until now, we have been forecasting GDP after transforming it into a stationary variable. We might want to transform the former back into a level variable in order to forecast the actual growth rate. Remember that we transformed GDP according to
\[\begin{split} diff(diff(GDP_t,4)) & = (GDP_t - GDP_{t-4}) - (GDP_{t-1} - GDP_{t-5}) \\ & = GDP_t + GDP_{t-5} - GDP_{t-1} - GDP_{t-4} \end{split}\]
that can be rewritten as
\[\label{y_fcst} GDP_t = diff(diff(GDP_t,4)) - GDP_{t-5} + GDP_{t-1} + GDP_{t-4} \tag{15}\]
Equation ((15)) gives us the forecast of the new quarter GDP
level. The variable BRGDP$GDP
is the non-stationary GDP.
> level_forecast <- na.omit(now$yfcst[,3])[1] - tail(na.omit(GDP_qtr),5)[1] +
+ + tail(na.omit(GDP_qtr),5)[5] + tail(na.omit(GDP_qtr),5)[2]
> level_forecast
1] 170.4783 [
> position_q2_2015 <- which(time(BRGDP$GDP) == 2015.25)
> BRGDP$GDP[position_q2_2015]
1] 169.24 [
In this example we work with the data the Federal Reserve of New York made available to reproduce its weekly nowcasting report^{4}. The explanatory variables are mixed frequencies including both monthly and quarterly series.
> library(nowcasting)
> data(NYFED)
> NYFED$legend$SeriesName
1] "Payroll Employment" "Job Openings"
[3] "Consumer Price Index" "Durable Goods Orders"
[5] "Retail Sales" "Unemployment Rate"
[7] "Housing Starts" "Industrial Production"
[9] "Personal Income" "Exports"
[11] "Imports" "Construction Spending"
[13] "Import Price Index" "Core Consumer Price Index"
[15] "Core PCE Price Index" "PCE Price Index"
[17] "Building Permits" "Capacity Utilization Rate"
[19] "Business Inventories" "Unit Labor Cost"
[21] "Export Price Index" "Empire State Mfg Index"
[23] "Philadelphia Fed Mfg Index" "Real Consumption Spending"
[25] "Real Gross Domestic Product" [
Similarly to the previous working example, the object NYFED
contains
all the necessary information to run the nowcast()
function. The time
series, the block structure, the transformations to make the variables
stationary and the variables’ frequencies can be loaded as illustrated
below.
> base <- NYFED$base
> blocks <- NYFED$blocks$blocks
> trans <- NYFED$legend$Transformation
> frequency <- NYFED$legend$Frequency
> delay <- NYFED$legend$delay
The dataset data
can be prepared by using the function Bpanel()
.
Using the EM algorithm, there is no need to replace missing values that
are not part of the jagged edges, as was the case with the Two-Stage
method. This can be achieved by setting NA.replace
to FALSE
. In this
case we do not want to discard series based on a particular ratio of
missing values to total observations as was the case in the Two-Stage
method. This is done by setting na.prop = 1
, where 1 indicates that
only series with more than 100% missing values will be discarded.
> data <- Bpanel(base = base, trans = trans, NA.replace = FALSE, na.prop = 1)
The model´s specifications are the same as those used by the NY FED. We
therefore limit the number of factors, r
, per block to one and define
the factor process as a VAR(1), i.e., p = 1
. The convergence of the
log-likelihood function is displayed every 5 iterations.
> nowEM <- nowcast(formula = GDPC1~., data = data, r = 1, p = 1, method = "EM",
blocks = blocks, frequency = frequency)
5th iteration:
-2418.5983 to -2406.1482
The loglikelihood went from
...65th iteration:
-2354.084 to -2353.8435 The loglikelihood went from
Combining the functions nowcast()
and PRTB()
within a loop, we
illustrate how a pseudo out-of-sample end-of-quarter nowcast can be
made. The vector fcst_dates
defines the last month of the quarters for
which quarterly GDP growth will be nowcast. The vector delay
contains
approximate delays, in days, with which variables are published. This
enables us to construct a pseudo real-time dataset for a given day.
> fcst_dates <- seq.Date(from = as.Date("2013-03-01"),to = as.Date("2017-12-01"),
by = "quarter")
> fcst_results <- NULL
> for(date in fcst_dates)\{
+
+ vintage <- PRTDB(data, delay = delay, vintage = date)
+ nowEM <- nowcast(formula = GDPC1~., data = vintage, r = 1, p = 1, method = "EM",
blocks = blocks, frequency = frequency)
+ fcst_results <- c(fcst_results,tail(nowEM$yfcst[,3],1))
+
+ \}
The results of this out-of-sample nowcast example, as well as the results of an out-of-sample ARIMA, are displayed below.
The root mean square prediction error can easily be calculated for the 2013-2016 period. For this given example, when compared to one-period-ahead projections given by an ARIMA model, a Theil’s U statistic of 0.70 is obtained, signaling a 30% improvement over the benchmark.
The package nowcasting was developed in order to facilitate the use of
dynamic factor models for large datasets as set out in Giannone et al. (2008)
and Banbura et al. (2011). The package offers functions at each step of the
forecasting process to help the user treat data, choose and estimate the
value of parameters, as well as interpret results. We provided a working
example for nowcasting Brazilian GDP, illustrating each step and showing
how to implement the various functions available. We also used the New
York FED nowcasting exercise to illustrate the EM algorithm. We will, in
the future, work on adding new tools for the user to better leverage the
EM
method by identifying the source of forecast revisions. As shown by
the New York FED nowcasting report, this is an interesting policy
instrument that helps contextualizing forecast updates.
We thank Daniel Mesquita for revising some of the codes and our colleagues from FGV-IBRE for helpful inputs. We also thank an anonymous referee and the R journal editor Olivia Lau for constructive comments. The authors are responsible for any errors in this paper. This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001.
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Valk, et al., "Nowcasting: An R Package for Predicting Economic Variables Using Dynamic Factor Models", The R Journal, 2019
BibTeX citation
@article{RJ-2019-020, author = {Valk, Serge de and Mattos, Daiane Marcolino de and Ferreira, Pedro Guilherme Costa}, title = {Nowcasting: An R Package for Predicting Economic Variables Using Dynamic Factor Models}, journal = {The R Journal}, year = {2019}, note = {https://rjournal.github.io/}, volume = {11}, issue = {1}, issn = {2073-4859}, pages = {230-244} }