This paper introduces the R package ForecastComb. The aim is to provide researchers and practitioners with a comprehensive implementation of the most common ways in which forecasts can be combined. The package in its current version covers 15 popular estimation methods for creating a combined forecasts – including simple methods, regression-based methods, and eigenvector-based methods. It also includes useful tools to deal with common challenges of forecast combination (e.g., missing values in component forecasts, or multicollinearity), and to rationalize and visualize the combination results.
Model uncertainties and/or model instabilities are deeply-rooted in real-life forecasting challenges. More often than not, and particularly in soft sciences including econometrics, the underlying data-generating process is not well understood.
In the business of time series forecasting, accuracy is key. The advent of “black-box” statistical learning methods (such as Artificial Neural Networks) testifies to the fact that forecasting practitioners and researchers alike tend to favor accuracy over explainability of a forecast model. There is a strong consensus in the social sciences that the observed processes are too complex to be modelled perfectly. Excluding some natural sciences, it is generally undisputed that the underlying data-generating process is unknown1. Even in the unrealistic case where the structural form of a data-generating process is more or less known, except for the parameter values that has to be estimated, a misspecified model can produce better forecasts than a correctly specified model due to parameters’ estimation uncertainty2.
Hence, statistical models are habitually too simple, misspecified, and/or incomplete; a fact that is widely accepted in theory, but less widely applied in practice, in the field of econometrics, in which researchers often still hang on to the conceptual error of assuming one true data-generating process and putting too much focus on model selection to find the one true model. (Hansen 2005) takes note of this and related misconceptions of econometric forecasting practice in his essay on the challenges for econometric model selection: “models should be viewed as approximations, and econometric theory should take this seriously”. An obvious alternative to choosing a single ‘best’ forecasting method is to combine forecasts from different models. A combined forecast can also be thought of as one which originates from a very complex model, an overarching model with different underlying simpler models in order to achieve a better data-generating process approximation. By now it is well established, empirically but in many different settings, that combining different forecasts delivers results which are “better than the best”.
Following the advice in (Hansen 2005), we abandon the quest for the one true correctly specified model, so that we are free to include information from different models. While this freedom, as seen empirically, improve forecast accuracy it is generally more intricate to interpret a forecast combined from different models compared to a forecast generated from a single model. In the context of forecasting, this means combining forecasts from different sources (models, experts). This emphasizes the aforementioned shift in approaching practical econometric forecasting problems: model misspecification cannot be fully rectified, but strategies can be found to mitigate its adverse effects on forecast quality. Selecting a single “best” forecasting model bears the risk of ending up with a model which is only accurate when evaluated using some validation sample, yet might prove unreliable when applied to new data. In the past decades, ample empirical evidence on the merits of combining forecasts has piled up; it is generally accepted that the (mostly linear) combination of forecasts from different models is an appealing strategy to hedge against forecast risk. Even though reduction of forecast risk is the main argument for using combined forecasts, it should also be noted that there are cases when combined forecasts are more accurate than even its best component (e.g. Graefe et al. 2014). While this is not theoretically sound, it is somewhat intuitive that in a continuously changing environment different forecasting models deliver different results at different time periods. For example, it is reasonable for one method to perform well in a low-volatility regime, and for another method to perform poorly in that regime, but perform well in a high-volatility regime. Strong empirical support for this argument of change in relative performance of different methods over time is found among others by (Elliott and Timmermann 2005).
“Hedging” against model risk by combining different forecasts is intuitive and appealing. Joined with accuracy gains, the idea is widely adopted in macroeconomics and finance, with application abound. (Magnus and Wang 2014) explore growth determinants, (Stock and Watson 2004) use forecast combination for output forecasting, (Wright 2009) and (Kapetanios et al. 2008) for inflation forecasting, (Avramov 2002), (Rapach et al. 2010) and (Ravazzolo et al. 2007) for forecasting stock returns, (Wright 2008) for exchange rate forecasts, (Andrawis et al. 2011) for forecasting inbound tourism figures, (Nowotarski et al. 2014) and (Raviv et al. 2015) for electricity price forecasting, and (Weiss 2017) for healthcare demand forecasting. More recently, forecast combinations have been applied not only in “first moment” applications, but for higher moments as well. For example in (Christiansen et al. 2012) for volatility forecasting, and (Opschoor et al. 2014) for Value-at-Risk forecasting.
The theoretical foundations of forecast combination date back five decades to the seminal papers of (Crane and Crotty 1967) and (Bates and Granger 1969). Yet even in the recent past, papers discussing new combination techniques are published in reputable journals and stimulate further research still (Hansen 2007, 2008; Hansen and Racine 2012; Elliott et al. 2013; Cheng and Yang 2015; Morana 2015). Two extensive reviews of the literature, techniques and applications of forecast combinations are (Clemen 1989) and (Timmermann 2006).
Given the topic’s popularity in both theoretical and empirical research, it is somewhat surprising that very few combination techniques are readily available in R: There are some packages that cover specialized specific topics related to combination methods, e.g. the BMA package by (Raftery and Painter 2005) for Bayesian model averaging, as well as the opera package by (Gaillard and Goude 2016) and the forecastHybrid package by (Shaub and Ellis 2017). However, as of now the only two R packages that are entirely devoted to forecast combination are the ForecastCombinations package by (Raviv 2015), which focuses on regression-based combination methods, and the GeomComb package by (Weiss and Roetzer 2016), which focuses on geometric, eigenvector-based combination methods.
Aiming to improve user experience, we have merged these last two aforementioned packages, providing one unified package for the widest range of forecast combination approaches available today in R. The package is flexible and provides enough guidance for users familiar or unfamiliar with the world of forecasting. We have made both regression-based and eigenvector-based combination methods available to users in a single standardized framework based on S3 classes and methods. The logic behind this choice is that comparing regression-based and eigenvector-based combination methods is often insightful – as pointed out by (Hsiao and Wan 2014), the conditions under which these two approaches perform well differ from each other: regression-based methods tend to produce more accurate forecasts when one or a few of the individual forecasts are considerably better than the rest, while eigenvector-based methods perform better when the individual forecasts are in the same ballpark. This paper presents the functionalities made available in the package and demonstrates how to implement them in an empirical exercise.
In the following section we review the forecast combination methods that are available in the ForecastComb package. We then provides a detailed implementation description using the package, with a specific empirical example – we combine univariate time series forecasts for UK energy supply.
Since the seminal paper by (Bates and Granger 1969), myriad combination strategies have been put forward in theoretical and empirical literature. The best way to combine different forecasts has no theoretical underpinnings, a lot depends on the specifics of the data at hand. (Andrawis et al. 2011) even suggest to use hierarchical forecast combinations, i.e. combining combined forecasts. The purpose here is not to investigate which combination method is suitable in which context. Rather, we provide a broad range of options for the user to explore and experiment with. Again, owing to a lack of theoretical foundation or the empirical lack of evidence for a one dominant way in which forecasts should be combined.
To fix notations, denote \(F_{T \times P}\) as the matrix of forecast with dimension \(T \times P\) where \(T\) is the number of rows and \(P\) is the number of columns (so we have \(P\) forecasts at each point in time). Denote \(f_i\) as the forecast obtained using model \(i\), \(i \in \{1 \dots P\}\). When there is no danger of confusion, we omit the additional subscript \(t\) which denotes the time dimension of the forecast. Some combination methods require an ordering of the component forecasts. When this is the case, \(f_{(i)}\) denotes the \(i\)th order statistic of the cross-section of component forecasts. Finally, the weight given to that forecast in the overall combined forecast is denoted as \(w_i\), and the combined forecast as \(f^c\).
We now present few simple ways in which forecasts can be combined. They are simple in that there is no need to exactly estimate the weight each forecast should be given in the overall combination. We just use some location measure (for example the average) of the cross-sectional distribution of the individual forecasts. In theory some location measures are the same if the cross-sectional distribution is symmetric, but if the distribution is asymmetric, and/or if there is one method which produces extreme forecasts then the following few combination methods become pertinent.
Simple Average. The most intuitive approach to combine forecasts is using the average of all those forecasts. Over the years this innocent approach has established itself as an excellent benchmark, despite or perhaps because of its simplicity (e.g. Genre et al. 2013). The combined forecast is straightforwardly given by
\[\label{eq:simple} f^c = \frac{1}{P} \sum_{i = 1}^P f_i. \tag{1}\]
(Clemen 1989) argues that this equal weighting of component forecasts is often the best strategy in this context. This is still true almost thirty years later and called the “forecast combination puzzle”, a term coined by (Stock and Watson 2004). A rigorous attempt to explain why simple average weights often outperform more sophisticated forecast combination techniques is provided in a simulation study by (Smith and Wallis 2009), who ascribe this surprising empirical finding to the effect of finite-sample error in estimating the combination weights. Recently, (Claeskens et al. 2016) provide a theoretical argumentation to these empirical findings. The authors make the case that lower estimation noise, when the weights are determined rather than estimated, goes a long way in explaning the puzzle. A more detailed overview of the empirical support for the “forecast combination puzzle” can be found in (Graefe et al. 2014).
Median. Another fairly simple and appealing combination method is using the median of the component forecasts. The median is insensitive to outliers, which can be relevant for some applications. (Palm and Zellner 1992) suggest that simple averaging may not be a suitable combination method when some of the component forecasts are biased. This calls for the use of another location measures which is robust to outliers. The median method is an appealing, rank-based combination method that has been used in a wide range of empirical studies (e.g. Armstrong 1989; McNees 1992; Hendry and Clements 2004; Stock and Watson 2004; Timmermann 2006).
For the median method, the combined forecast is given by:
For odd \(P\):
\[\label{eq:median_odd} f^c = f_{(\frac{P}{2}+0.5)} \tag{2}\]
For even \(P\):
\[\label{eq:median_even} f^c = \frac{1}{2} \left(f_{(\frac{P}{2})}+f_{(\frac{P}{2}+1)}\right) \tag{3}\]
Trimmed Mean. Another outlier-robust location measure that is commonly used is the trimmed mean (e.g. Armstrong 2001; Stock and Watson 2004; Jose and Winkler 2008).
Using a trim factor \(\lambda\) (i.e. the top/bottom \(100 \times \lambda \%\) are trimmed) the combined forecast is calculated as:
\[\label{eq:trimmed_mean} f^c = \frac{1}{P(1-2\lambda)} \sum_{i = \lambda P + 1}^{(1-\lambda)P} f_{(i)} \tag{4}\]
Typically, we use \(\lambda = 0.1\) indicating we trim the top and bottom 10% of the most extreme component forecasts, excluding those from the computation of the combined forecast. The trimmed mean is an interpolation between the simple average (\(\lambda = 0\)) and the median (\(\lambda = 0.5\)).
Winsorized Mean. Like the trimmed mean, the winsorized mean is a robust statistic that is less sensitive to outliers than the simple average. It takes a softer line when handling outliers: Instead of altogether removing them as in the trimmed mean approach, it caps outliers at a certain level. By capping outliers rather than removing them, we allow for at least some degree of influence. For this reason, the measure is sometimes preferred, for example by (Jose and Winkler 2008).
Let \(\lambda\) be the trim factor (i.e. the top/bottom \(100 \times \lambda \%\) are winsorized) and \(K= \lambda P\). The combined forecast is then calculated as:
\[\label{eq:wins_mean} f^c = \frac{1}{P} \left[ K f_{(K+1)} + \sum_{i = K+1}^{P-K} K f_{(P-K)}\right] \tag{5}\]
Bates/Granger (1969). In their seminal paper, (Bates and Granger 1969) introduced the idea of combining forecasts. Their approach builds on portfolio diversification theory and uses the diagonal elements of the estimated mean squared prediction error matrix in order to compute combination weights:
\[\label{eq:BG} f^c = \sum_{i=1}^{P} f_i' \times \frac{\hat{\sigma}^{-2}(i)}{\sum_{j=1}^{P} \hat{\sigma}^{-2}(j)} \tag{6}\]
where \(\hat{\sigma}^{2}(i)\) is the estimated mean squared prediction error of model \(i\).
The approach ignores correlation between component forecasts due to difficulties in precisely estimating the covariance matrix.
Newbold/Granger (1974). Building on the earlier research by (Bates and Granger 1969), the methodology of (Newbold and Granger 1974) also extracts the combination weights from the estimated mean squared prediction error matrix.
Let \(\Sigma\) be the mean squared prediction error matrix of \(F_{N \times P}\) and \(e\) be a \(P \times 1\) vector of \((1,1,...,1)'\). (Newbold and Granger 1974)’s method is a constrained minimization of the mean squared prediction error using the normalization condition \(e'w=1\). This yields the following combined forecast:
\[\label{eq:NG} f^c = F_{N \times P} \times \frac{\Sigma^{-1} e}{e'\Sigma^{-1}e} \tag{7}\]
While the method dates back to (Newbold and Granger 1974), the variant of the method we use in the ForecastComb package does not impose the prior restriction that \(\Sigma\) is diagonal. This approach, used by (Hsiao and Wan 2014), is a generalization of the original method.
Inverse Rank. The inverse rank approach, suggested by (Aiolfi and Timmermann 2006), ranks the forecast models based on their performance up to time \(N\). The model with the lowest mean squared prediction error is assigned the rank 1, the model with the second lowest mean squared prediction error is assigned the rank 2, etc. The combined forecast is then calculated as follows:
\[\label{eq:InvR} f^c = \sum_{i = 1}^{P} f_i' \times \frac{Rank_i^{-1}}{\sum_{j=1}^{P} Rank_j^{-1}} \tag{8}\]
(Timmermann 2006) points out that this method, just like (Bates and Granger 1969), also ignores correlations across forecast errors. However, the method is more robust to outliers, since total rankings are not likely to change dramatically by the presence of extreme forecasts.
Ordinary Least Squares (OLS) regression. The idea to use regression for combining forecasts was put forward by (Crane and Crotty 1967) and successfully driven to the forefront by (Granger and Ramanathan 1984). Using this approach, the combined forecast is a linear function of the individual forecasts where the weights are determined using a regression of the individual forecasts on the target itself:
\[\label{eq:olstrain} y = {\alpha} + \sum_{i = 1}^P {w_i} f_i +\varepsilon, \tag{9}\]
Using a portion of the forecasts to train the regression model, the OLS coefficients can be estimated by way of minimizing the sum of squared errors in equation((9)). The combined forecast is then given by:
\[\label{eq:ools} f^c = \widehat{\alpha} + \sum_{i = 1}^P \widehat{w}_i f_i, \tag{10}\]
An advantage of the OLS forecast combinations is that the combined forecast is unbiased due to the intercept in the equation, even if one of the individual forecasts is biased. A disadvantage is that the method places no restriction on the combination weights (i.e. they do not add up to 1 and can be negative), which complicatesinterpretation, especially if the coefficients are non-convex.3
Least Absolute Deviation (LAD) regression. While the OLS regression estimates the coefficients in equation ((10)) by minimizing the sum of squared errors, we may want to estimate those coefficients differently, minimizing a different loss function4, for example the absolute sum of squares. The reason is best explained using an example: Assume we have a model that performs well in general, yet every now and then misses the target by a very large margin. Such a model would be weighted more heavily under the LAD scheme than under the OLS scheme since those large but infrequent errors will be more heavily penalized using OLS. Whether this is beneficial depends on the user’s preference and/or the cost of missing the target given the problem at hand. It should be noted that this lower sensitivity to outliers has another advantage: OLS weights can be unstable when predictors are highly correlated, which is the norm in forecast combination. Minor fluctuations in the sample can cause major shifts in the coefficient vector (‘bouncing betas’), often leading to poor out-of-sample performance. This suggests that LAD combination should be favored in the presence of highly correlated component forecasts (Nowotarski et al. 2014).
Constrained Least Squares (CLS) regression. Like the LAD approach,
CLS addresses the issue of ‘bounding betas’. It does so by
minimizing the sum of squared errors under some additional
constraints. Specifically, we constrain the estimated coefficients
\(\{w_i\}\), allowing only for positive solutions:
\(w_i \geq 0 \;\forall i,\) and to sum up to one:
\(\sum_{i = 1}^P w_i = 1\). The solution requires numerical
minimization, but good optimization algorithms are readily
available: The ForecastComb package relies on the function
solve.QP
available from the
quadprog package
(Turlach and Weingessel 2013). To tackle problems with high (but imperfect)
collinearity that can cause errors in the CLS estimation, we also
implement a revised Cholesky decomposition based on Ridge regression
which has been propsed by (Babaie-Kafaki and Roozbeh 2017) and can mitigate issues with
multicollinearity.
Theoretically, the additional constraints set CLS sub-optimally compared to the OLS. It lacks the good asymptotic properties admitted by OLS. However, in practice it is often found to perform better, especially so when the individual forecasts are highly correlated. In addition, the CLS weights are more easily interpretable. It is hard to justify a non-convex linear combination of two forecasts, while CLS weights can be conveniently interpreted for example as percentages devoted to each of the individual forecasts.
Complete subset regression. The ForecastComb package allows the relatively new idea of computing forecast combination weights using complete subset regression. The underlying idea is relatively straightforward: With \(P\) component forecasts that can serve as predictors in the regression model, we can form \(n\) regression models, each with a unique subset of predictors. \(n\), the total number of combined forecasts from regression models is given by
\[n = \sum_{i = 1}^P {{P}\choose{i}} = \sum_{i = 1}^P \frac{P!}{i! (P-i)!} = 2^P -1\]
In the most basic variant, the final combined forecast is obtained by taking the simple average over the cross-section of these \(n\) combined forecasts from complete subset regressions. The method is proposed by (Elliott et al. 2013) who develop the theory behind this estimator and present favourable results from simulations and empirical application for US stock returns. Admittedly, the scheme is computationally expensive, and thus additional computational resources may be required if \(P\) is in the dozens ((Elliott et al. 2013), Section 2.5.1 proposes a workaround based on random sampling from \(P\)). Additionally, since all \(n\) forecasts are returned, the user can freely refine the technique further – for example by choosing not to average over all \(n\) forecasts, but some partial subset. Using the median instead of the mean is another option that comes to mind.
Obtaining \(n\) combined forecasts from the complete subset regression also allows us to use a frequentist approach to forecast combination, also known as information-theoretic forecast combinations. In the ForecastComb package, several information criteria are available in the complete subset regression method, each with its own merits and weaknesses: By far the most common are the AIC (Akaike’s information criterion) and the BIC (Bayesian information criterion, also known as the Schwarz information criterion). Both are supplied in addition to the corrected AIC (Hurvich and Tsai 1989) and the Hannan Quinn information criterion (Hannan and Quinn 1979).
Formally, the weight given to each forecast based on the information-theoretic forecast combinations is the following:
\[\label{eq:IT} w_i = \frac{\exp(-1/2 \varrho_i) }{\sum_{j = 1}^{n} \exp(-1/2 \varrho_j) }, \tag{11}\]
where \(\varrho_i\) is the information criterion for forecast \(i\) obtained using a regression with a specific combination of forecasts. The value of \(n\) is fixed as the number of possible combinations, and the combined forecast is given by:
\[f^c = \sum_{i = 1}^{n} w_i \tilde{f}_i.\]
It is worth noting that this is a two-step combination method. The first step is the computation of \(n\) combined forecasts \(\tilde{f}_i\) using the complete subset regression method with the original \(P\) forecasts as predictors; the second step is the combination of these combined forecasts using the weights based on information criteria.
One advantage of this frequentist approach to model averaging is that the amount of shrinkage enforced on each individual forecast is data driven. The specification of a shrinkage hyper-parameter, which is required in the corresponding Bayesian framework (e.g. Raftery and Painter 2005) is spared from the user in this case.
The eigenvector-based forecast combination methods, proposed by (Hsiao and Wan 2014), are based on the idea of minimizing the mean squared prediction error subject to a normalization condition.
The most commonly used normalization condition for this purpose is to require the combination weights to add up to one, i.e. \(\sum_{i=1}^P w_i = 1\) (e.g. Newbold and Granger 1974; Timmermann 2006). (Hsiao and Wan 2014) show that this normalization condition leads to a constrained minimum of the mean squared prediction error (MSPE), and propose a normalization condition that leads to an unconstrained minimum of the MSPE:
\[\label{eq:unconstr} \sum_{i=1}^{P} w_i^2 = 1 \tag{12}\]
This unconstrained minimum of the MSPE is the basis of the four eigenvector-based approaches in the ForecastComb package.
Standard Eigenvector Approach. The standard eigenvector approach retrieves combination weights from the estimated MSPE matrix as follows: The \(P\) positive eigenvalues of the MSPE matrix are arranged in increasing order \((\Phi_1 = \Phi_{min},\Phi_2,...,\Phi_P)\), and \(\kappa_i\) denotes the eigenvector corresponding to \(\Phi_i\). Let \(d_i = e'\kappa_i\) with \(e\) being a \(P \times 1\) vector of \((1,1,...1)'\). The combination weights \(w\) are then chosen corresponding to the minimum of \((\frac{\Phi_1}{d_1^2}, \frac{\Phi_2}{d_2^2},...,\frac{\Phi_P}{d_P^2})\), denoted as \(\kappa_l\), as:
\[w = \frac{1}{d_l} \kappa_l\]
The combined forecast is then obtained as usual:
\[f^c = \sum_{i=1}^{P} f_i' w_i\]
Bias-Corrected Eigenvector Approach. The bias-corrected eigenvector approach builds on the idea that if one or more of the component models yield biased forecasts, the accuracy of the standard eigenvector approach can be improved by eliminating the bias. It modifies the standard approach by decomposing forecast errors into three parts: model-specific bias, omitted common factors of all component models, as well as an idiosyncratic part that is uncorrelated across the component models.
The optimization procedure to obtain combination weights coincides with the standard approach, except that we use as an input the centered MSPE matrix, i.e. after extracting the bias by subtracting the column means of the MSPE:
\[w = \frac{1}{\tilde{d}_l} \tilde{\kappa}_l\]
where \(\tilde{d}_i\) and \(\tilde{\kappa_i}\) are defined analogously to \(d_i\) and \(\kappa_i\) in the standard eigenvector approach with the only difference that they correspond to the spectral decomposition of the centered MSPE matrix rather than the original (uncentered) MSPE matrix.
The combined forecast is then obtained by:
\[f^c = \alpha + \sum_{i=1}^P f_i' w_i\]
where the intercept \(\alpha\) corrects for the potential bias.
Trimmed Eigenvector Approach.
The standard approach is highly sensitive to the disparities in performance of different predictive models, i.e. the standard eigenvector approach’s performance could be severely impaired by one or more component models that produce poor forecasts. This is due to treating uncertainties in the actual series, \(y\), and the uncertainties of the component models, \(F_{N \times P}\), symmetrically. For a detailed discussion of this so-called orthogonality principle, see Section 3 in (Hsiao and Wan 2014). The trimmed eigenvector approach takes note of this issue.
The idea of trimming the pool of input forecasts has been used by (Aiolfi and Timmermann 2006) and is picked up by (Hsiao and Wan 2014) using the eigenvector framework – the weights are computed exactly as in the standard eigenvector approach, but based on the MSPE matrix of the trimmed forecasts, after discarding particularly bad component models.
Trimmed Bias-Corrected Eigenvector Approach. The underlying methodology of the trimmed bias-corrected eigenvector approach is the same as the bias-corrected eigenvector approach: The weights are retrieved through the spectral decomposition of the centered MSPE matrix.
The only difference to the bias-corrected eigenvector approach is that this method, like the trimmed eigenvector approach, pre-selects component models that serve as input for the forecast combination; only a subset of the available forecast models is retained, while the models with the worst performance are discarded, thereby combining the favourable modifications of the previous two methods.
There are three more related functions. The function auto_combine
computes the fit for all the available forecast combination methods and
is based on a grid-search optimization. It returns the combined forecast
with the best in-sample accuracy. The default accuracy metric is the
RMSE and MAE or MAPE are also available. The function rolling_combine
is simply a dynamic variant, where the computation is done in an
expanding window fashion rather than a single in- and out-of-sample
split. Finally, the function predict.foreccomb_res
allows to obtain
the forecasts using previously trained forecast combination model. It is
a simple utility function to quickly get the forecasts in an
uncomplicated manner.
The main functions provided in the ForecastComb package can be classified in 3 categories:
Data Preparation
Estimation of Forecast Combination
Post-Fit Presentation of Results
In addition, some auxiliary functions are provided. Table 1 gives the list of the main functions.
Function | Description |
Data Preparation Functions | |
foreccomb |
Transform raw input data for forecast combination |
cs_dispersion |
Compute cross-sectional dispersion |
Forecast Combination Functions | |
Simple Forecast Combination Functions | |
comb_BG |
Bates/Granger (1969) forecast combination |
comb_InvW |
Inverse Rank forecast combination |
comb_MED |
Median Forecast Combination |
comb_NG |
Newbold/Granger (1974) forecast combination |
comb_SA |
Simple Average forecast combination |
comb_TA |
Trimmed Mean forecast combination |
comb_WA |
Winsorized Mean forecast combination |
Regression-Based Forecast Combination Functions | |
comb_CLS |
Constrained Least Squares (CLS) forecast combination |
comb_CSR |
Complete Subset Regression forecast combination |
comb_LAD |
Least Absolute Deviation (LAD) forecast combination |
comb_OLS |
Ordinary Least Squares (OLS) forecast combination |
Eigenvector-Based Forecast Combination Functions | |
comb_EIG1 |
Standard Eigenvector forecast combination |
comb_EIG2 |
Bias-Corrected Eigenvector forecast combination |
comb_EIG3 |
Trimmed Eigenvector forecast combination |
comb_EIG4 |
Trimmed Bias-Corrected Eigenvector forecast combination |
Other Forecast Combination Functions | |
auto_combine |
Automated grid-search forecast combination |
rolling_combine |
Rolling forecast combination (time-varying |
combination weights) | |
Post-Fit Functions | |
plot.foreccomb_res |
S3 method to plot results from a forecast combination model |
summary.foreccomb_res |
S3 method – summary of the forecast combination |
estimation | |
predict.foreccomb_res |
S3 method to create forecasts using weights from previously trained forecast combination model |
The ForecastComb package considers as a starting point that the user
has already obtained a set of component forecasts, either from survey
data or using statistical techniques, and now seeks to improve accuracy
by combining those component forecasts into one. If the user only has
the actual time series data, other packages in R can be used to create a
set of component models. For instance, the forecastHybrid package by
(Shaub and Ellis 2017) which creates several univariate forecasts using methods
available in the mature and popular forecast package (Hyndman 2017).
The method foreccomb
is the workhorse in the data preparation step. It
supports the user with transforming the raw input data to make sure that
the estimation of the forecast combinations will run smoothly.
The call of foreccomb
is:
foreccomb(observed_vector, prediction_matrix, newobs = NULL,
newpreds = NULL, byrow = FALSE, na.impute = TRUE, criterion = "RMSE")
The function requires user input for the parameters observed_vector
(a
vector, the actual data) and prediction_matrix
(a matrix, the set of
component forecasts to be combined). The format of the input matrix is
as follows: Each column contains the forecasts from one of the \(P\)
component models. Each row corresponds to the cross-section of component
forecasts at a specific point in time. A situation where the format of
the data is reversed, meaning that rows correspond to forecast models
and columns correspond to the time index, is handled by setting the
argument byrow = TRUE
.
The foreccomb
function includes some convenient features that take
note of the fact that in many cases combination methods are applied to
survey forecasts and the challenges that come along with this:
Split into Training Set and Test Set. The function allows the user
to specify a training set (observed_vector
and
prediction_matrix
)and a test set (newobs
and newpreds
)
separately. This is useful since most combination functions have to
estimate the weights (requiring part of the sample to be dedicated
to that task), while it is recommended that a test set is available
to evaluate the model’s performance on “new” data.
Missing Value Imputation. Survey forecasts usually include missing
values. This can be either because some of the survey participants
did not respond or because the set of survey participants is
changed. The foreccomb
function provides two alternatives to deal
with missing values: The default option (na.impute = TRUE
) uses
the missing value imputation algorithm from the
mtsdi by
(Junger and de Leon 2012). It is a modified version of the EM algorithm for
imputation that is specifically adjusted for multivariate time
series data, accounting for correlation between the forecasting
models and the time structure of the series itself. A smoothing
spline is fitted to each of the time series at every iteration and
the degrees of freedom of each spline are chosen by
cross-validation. Alternatively, the argument can be set to
na.impute = FALSE
, which means the component forecast models that
include any missing values are dropped prior to estimating forecast
combinations, and the user is notified in the console if so
relevant.
Handling Multicollinearity. More often than not component
forecasts that are used in the forecast combination are highly
correlated. This can trouble the estimation process which does not
handle well perfect collinearity. The foreccomb
function has an
inherent algorithm that checks the set of component forecasts for
perfect multicollinearity, and if detected, drops one of the
component models from the input data. The algorithm is designed to
minimize the cost of dropping one or more models from the input
data, in the sense that out of the models that cause perfect
multicollinearity, it drops the least accurate forecast model. Only
perfect multicollinearity is handled. By default, Root Mean Squared
Error (RMSE) is used as the accuracy metric, but alternatively the
user may choose the Mean Absolute Error (MAE) or the Mean Absolute
Percentage Error (MAPE) by changing the argument criterion
.
The output of the foreccomb
function is an object of S3
class
foreccomb
that can be passed on to the estimation functions or the
other auxiliary functions, for instance the function cs_dispersion
which computes the cross-sectional dispersion of the set of component
forecasts.
This is often helpful for selecting a suitable combination method: One
of the main findings of (Hsiao and Wan 2014) is that regression-based methods
produce more accurate forecasts when one or a few of the component
forecasts are much better than the rest, while eigenvector-based methods
perform better when there is low dispersion among the component
forecasts. The cs_dispersion
function can be used to compute and plot
this cross-sectional dispersion using standard deviation (default),
interquartile range, or range.
The package provides the user with functions for the 15 estimation
techniques for combined forecasts, which were described in previous
Section. The estimation functions require an object of S3
class
foreccomb
as input, which is obtained using the methods from the
previous subsection.
Four of the methods include trimming, i.e. a pre-selected subset of the full set of component models that should be used in the estimation of combination weights. These are:
Trimmed Mean (comb_TA
)
Winsorized Mean (comb_WA
)
Trimmed Eigenvector Approach (comb_EIG3
)
Trimmed Bias-Corrected Eigenvector Approach (comb_EIG4
)
For these methods, the user has some flexibility. The package provides
the option to set the trimming factor (or, for the eigenvector methods,
the number of retained component models) manually. Otherwise, an inbuilt
optimization algorithm is used for choosing the trimming factor such
that the combined forecast has the best possible fit. Again, this
optimization can be based on either RMSE, MAE, or MAPE, which are
controlled by the argument criterion
.
A simple simulation example:
> actual <- rnorm(100)
R> forecasts <- matrix(rnorm(1000, 1), 100, 10)
R
> input_data <- foreccomb(actual, forecasts)
R
> # Manual Selection of Trimming Factor:
R> model1 <- comb_TA(input_data, trim_factor = 0.3)
R
> # Assess accuracy of the combined forecast:
R> model1$AccuracyTrain
R
's U
ME RMSE MAE MPE MAPE ACF1 TheilTraining Set -1.07312 1.520668 1.274116 146.411 486.3501 -0.0480562 1.698456
R> # Algorithm-Optimized Selection of Trimming Factor:
R> model2 <- comb_TA(input_data, criterion = "RMSE")
Optimization algorithm chooses trim factor for trimmed mean approach...
Algorithm finished. Optimized trim factor: 0.1
R> # Assess accuracy of the combined forecast:
R> model2$AccuracyTrain
ME RMSE MAE MPE MAPE ACF1 Theil's U
-1.063363 1.515091 1.27304 134.7211 489.4353 -0.03678255 1.692617 Training Set
As can be seen, the automated selection of the trimming factor leads to an improved accuracy of the combined forecast.
The 15 methods included in the package all produce static combination weights, i.e. the models use the training set data to estimate combination weights, which will in turn be applied to all periods of the test set. The research community in the forecasting field is strongly divided in the assessment of the value of time-varying combination weights, since putting higher weights on more recent data tends to increase the parameter variance. Section 4.1 in (Timmermann 2006) reviews the advantages and challenges of allowing for time-varying weights.
While the ForecastComb mainly uses time-invariant combination weights,
the user is provided with some flexibility. The rolling_combine
function allows for the estimation of each of the methods with
time-varying weights. The approach builds on the idea of time series
cross-validation (Bergmeir et al. 2015), using the provided training set as a
departure point to estimate starting weights, and then increasing the
training set one step at a time and re-estimating the weights for the
remaining test set. However, this approach requires that the user
provides a full test set, i.e. also providing observed values for the
test set.
The function auto_combine
provides a quick and painless alternative.
The function is based on a grid-search optimization that returns the
combined forecast with the best in-sample accuracy (using RMSE as
accuracy metric in the default setting).
All of the estimation methods return an object of S3
class
foreccomb_res
with the components presented in Table
2. The object can subsequently be passed on to
post-fit functions.
Return Values | Description |
For all methods | |
Method |
Returns the used forecast combination method |
Models |
Returns the individual input models that were used for the forecast combinations |
Weights |
Returns the combination weights obtained by applying the combination method to the training set |
Fitted |
Returns the fitted values of the combination method for the training set |
Accuracy_Train |
Returns a range of accuracy measures for the training set |
Forecasts_Test |
Returns forecasts produced by the combination method for the test set. Only returned if input included a forecast matrix for the test set |
Accuracy_Test |
Returs a range of accuracy measures for the test set. Only returned if input included a forecast matrix and a vector of actual values for the test set |
Input_Data |
Returns the data forwarded to the method |
Only for comb_TA and comb_WA |
|
Trim Factor |
Returns the trim factor, \(\lambda\) |
Only for comb_OLS , comb_LAD , comb_EIG3 and comb_EIG4 |
|
Intercept |
Returns the intercept (bias correction) |
Only for comb_EIG3 and comb_EIG4 |
|
Top_Predictors |
Number of retained predictors |
Ranking |
Ranking of predictors that determines which models are removed in the trimming step |
Results from the estimation of combined forecasts can be passed on to
two post-fit convenience functions: summary
and plot
, which are S3
methods specific to the class foreccomb_res
.
The summary function displays the output of the respective forecast
combination in concise form, it displays the estimated combination
weights (and the intercept, if the combination method includes one), as
well as accuracy statistics for the training set and the test set.
The plot function will produce different plots based on the input data.
If only a training set was provided, it plots the actual versus the
fitted values; if a test set was also provided, it plots the combined
forecasts as well. Another option for the user is a plot of the
combination weights5, obtained by setting which = 2
in the function
call.
The ForecastComb package includes the dataset electricity
, which is
a multivariate time series of monthly UK electricity supply (in GWh)
from January 2007 to March 2017, and 5 univariate time series forecasts
for the same series and period. The observed data series is sourced from
the International Energy Agency (IEA 2017). The component models to be
combined are the following cross-validated one-month univariate
forecasts in the dataset:
ARIMA (produced using the auto.arima
function in the forecast
package),
ETS (produced using the ets
function in the forecast package),
Neural Network (produced using the nnetar
function in the
forecast package),
Damped Trend (produced using the ets
function in the forecast
package),
Dynamic Optimized Theta Model (produced using the dotm
function in
the forecTheta
package by (Fiorucci et al. 2016)).
To illustrate the functionalities of the package, we apply 4 combination
techniques: the simple average (comb_SA
), the OLS regression
(comb_OLS
), the standard eigenvector approach (comb_EIG1
) and the
trimmed bias-corrected eigenvector approach (comb_EIG4
). The selected
methods span all three categories of combination techniques
(statistics-based, regression-based, and eigenvector-based) and includes
trimmed and bias-corrected methods and are therefore suitable to show
the full functionality of the ForecastComb package in this empirical
context. using both the static and dynamic version of each. For the
selected combination methods, we produce both static and dynamic
forecasts, which gives us a total of 7 different time series of combined
forecasts (not 8, since the static and dynamic versions of the simple
average combination are identical).
For the purpose of this exercise, we use the first 84 months as training
set, which leaves us with a test set size of 39. Figure 1
plots the actual series and the univariate forecasts. The forecasts
(which are 1-month forecasts obtained via time series cross-validation)
track the actual series very well. None of them performs exceptionally
poorly compared with the rest, which are conditions that tend to favour
eigenvector approaches (Hsiao and Wan 2014). As it is with most electricity
data, the main difference between the individual models is in their
ability to quickly recognize and adjust for turning points. For example,
the neural nets model handles turning points well, but sometimes also
overshoots, while the ARIMA model has a smoother behaviour around
turning points.
First, we format the data correctly for the estimation of combination
weights. This step ensures later operations would proceed smoothly.
Since there are no missing values and no perfectly collinear columns in
our dataset, this is relatively straightforward:
> data(electricity)
R> train.obs <- electricity[1:84, 6]
R> train.pred <- electricity[1:84, 1:5]
R> test.obs <- electricity[85:123, 6]
R> test.pred <- electricity[85:123, 1:5]
R> input_data <- foreccomb(train.obs, train.pred, test.obs, test.pred) R
Once the object of S3
class foreccomb
is created, it can be fed into
the estimation functions. We can look at the cross-sectional dispersion,
to get a better idea of variability in the univariate forecasts.
> cs_dispersion(input_data, measure = "SD", plot = TRUE) R
Figure 2 shows that apart from a brief period of
increased dispersion around the end of 2009, the cross-sectional
standard deviation of the component forecasts is rather stable and low
given the level of around 25,000 to 35,000 GWh. This begs the question
whether conditions have fluctuated enough during the test set so that
estimation of time-varying weights is beneficial, yet we proceed with it
for this demonstration.
> ####### ESTIMATION OF STATIC FORECAST COMBINATIONS ########
R> SA <- comb_SA(input_data)
R> OLS_static <- comb_OLS(input_data)
R> EIG1_static <- comb_EIG1(input_data)
R> EIG4_static <- comb_EIG4(input_data, criterion = "MAE")
R
> ###### ESTIMATION OF DYNAMIC FORECAST COMBINATIONS
R> OLS_dyn <- rolling_combine(input_data, "comb_OLS")
R> EIG1_dyn <- rolling_combine(input_data, "comb_EIG1")
R> EIG4_dyn <- rolling_combine(input_data, "comb_EIG4", criterion = "MAE") R
The 7 combined forecasts can be evaluated separately by looking at their summary measures, which we present here for the static Ordinary Least Squares approach:
> summary(OLS_static)
R
Summary of Forecast Combination-------------------------------
: Ordinary Least Squares Regression
Method& Combination Weights:
Individual Forecasts
Combination Weight0.02152869
arima -0.20646266
ets 0.20992792
nnet -1.04349858
dampedt 1.97991049
dotm
Intercept (Bias-Correction): 962.3229
:
Accuracy of Combined Forecast
ME RMSE MAE MPE MAPE-4.417560e-12 888.1433 697.8645 -0.08262472 2.254470
Training Set -4.007742e+01 671.5214 536.0331 -0.24705122 1.841961
Test Set
:
Additional information can be extracted from the combination objectvalues (training set): OLS_static$Fitted
For fitted forecasts (test set): OLS_static$Forecasts_Test
For str(OLS_static) for full list. See
The output shows that the OLS combination puts an extremely high relative weight on the forecast from the Dynamic Optimized Theta model, which seems to be the best component forecast, which is rather surprising given that seasonality is an important feature in the analyzed series and Theta models cannot incorporate seasonality into the estimation so far, relying on pre-estimation deseasonalizing and post-estimation reseasonalizing. Table 3 shows a comparison of the accuracies achieved by the combined forecasts. Since all forecasts are for the same series, it is reasonable to use MAE as accuracy metric.
Method | MAE Training Set | MAE Test Set |
---|---|---|
Simple Average | 819.28 | 573.39 |
Ordinary Least Squares (static) | 697.86 | 536.03 |
Ordinary Least Squares (dynamic) | 697.86 | 533.47 |
Standard Eigenvector (static) | 821.60 | 573.84 |
Standard Eigenvector (dynamic) | 821.60 | 572.99 |
Trimmed Bias-Corrected Eigenvector (static) | 785.30 | 540.18 |
Trimmed Bias-Corrected Eigenvector (dynamic) | 785.30 | 541.98 |
The evaluation of accuracy delivers some interesting insight: All of the combination models perform better in the test set than in the training set, which is counter-intuitive, but is likely due to the increased dispersion among the component forecasts in the early period. The results also clearly suggest that one or more of the component forecasts were biased. The two methods with an intercept (i.e. bias correction) perform best. Finally, allowing for time-varying combination weights does not seem to change test-set accuracy much compared with the models’ static counterparts, suggesting that the estimated combination weights did not fluctuate a lot over time. There are some potential explanations why the OLS method performed extremely well in this case:
With such stable conditions the risk of ‘bouncing betas’ (described in the previous section) is low,
The OLS method produces unbiased forecasts even if one or more of the component forecasts are biased (which is why the trimmed bias-corrected eigenvector approach performed reasonably well too),
One of the component forecasts is much better than the rest, a situation that is favorable for regression-based approaches, as pointed out by (Hsiao and Wan 2014). These are also conditions under which sophisticated methods actually can largely improve upon a simple average combination.
Now that we learned that some of the combination models produced more
accurate forecasts than the simple average, we address the next, very
natural question: How well did the combination methods perform compared
to the univariate component forecasts themselves? To shed more light on
this question, Table 4 shows the MAE values for the
univariate models during the test set period.
Method | MAE Test Set |
---|---|
ARIMA | 770.32 |
ETS | 615.88 |
Neural Network | 730.35 |
Damped Trend | 660.08 |
DOTM | 540.24 |
The results of the accuracy evaluation speak for themselves. Not only
did two of the combined forecasts perform better or equally well as the
best univariate forecast over the test set period, but also the forecast
risk is considerably lower indeed. In the test set the range of MAEs of
the different combined forecast methods is only 40, while the
corresponding value for the univariate forecasting methods is 230. It is
worth noting that all of the combined forecasts perform considerably
better than even the second-best univariate forecast, emphasizing the
appeal of forecast combination: In the ideal case, it is possible to end
up with a forecast that is better than the best univariate forecast.
Even if this is not the case, using forecast combination considerably
decreases the risk of ending up with a poorly performing model.
Finally, we take a closer look at the results from the best combined
forecast, which is the dynamic OLS combination method in this exercise.
> ##### ACTUAL VS FITTED PLOT #####
R> plot(OLS_dyn) R
> ##### COMBINATION WEIGHTS #####
R> OLS_static$Weights
R
fhatarima fhatets fhatnnet fhatdampedt fhatdotm0.02152869 -0.20646266 0.20992792 -1.04349858 1.97991049
> colMeans(OLS_dyn$Weights)
R
fhatarima fhatets fhatnnet fhatdampedt fhatdotm0.002750948 -0.123811218 0.184846358 -1.098381593 2.001988985
Figures 3 shows how well the combined forecast obtained from the dynamic OLS method predicts the monthly electricity supply series. Results confirm the conjecture that the stable conditions (low cross-sectional dispersion and one very dominant univariate component forecast) do not cause a lot of fluctuation in the combination weights even when allowing for time-varying weights.
The weights presented also confirms another thing: that weights obtained using the OLS combination methods can be hard to interpret. It seems obvious that the method should put a high weight on the DOTM forecast, since it is the best univariate forecast by far. However, the reason why it assigns negative weights to the ETS and Damped Trend forecasts (the second- and third-best univariate forecasts) is not very intuitive. A possible explanation might be that all three of these are exponential smoothing-type models, suggesting that the information obtained from the ETS and Damped Trend models is better captured by the DOTM model, while ARIMA and Neural Networks are not closely related modelling approaches to the Theta model, so that even though these models perform far worse on average, they capture information differently and might outperform the DOTM forecast in some periods for that reason, justifying their positive weights (however small) and explaining how the combined forecast can slightly outperform the best component forecast.
Forecast combination is a useful strategy to hedge against model risk.
Even if combined forecasts do not win over the most accurate component
forecast, they generally avoid poor performance by circumventing the
choice between individual methods (Timmermann 2006). Instead of putting
all eggs into one basket using model selection, these model averaging
techniques are motivated by portfolio theory and diversify across
component forecasts.
The ForecastComb package categorizes some of the most popular
approaches into 3 groups: (a) simple statistics-based methods, (b)
regression-based methods, and (c) eigenvector-based methods. Providing
both regression-based combinations and eigenvector-based combinations to
the users is considered useful, since the former tend to perform
relatively better when one or a few component forecasts are much better
than the rest, while the latter perform relatively better when all
forecasts are in the same ballpark (Hsiao and Wan 2014).
The package is designed to support users along the entire modelling
process: data preparation, model estimation, and interpretation of
results using summaries and plotting functionalities. It includes tools
for data transformation that are designed to deal with two common issues
in forecast combination prior to estimation – missing values and
multicollinearity. The 15 combination methods are available in both
static and dynamic variants, and users have the option to automate the
selection algorithm so that a good combination method is found based on
training set fit. Finally, the package offers specialized functions for
summarizing and visualizing the combination results.
While the current version of the package already provides a
comprehensive set of tools for forecast combination, there is scope for
further extensions in future updates. First, additional combination
methods that showed promising results recently can be added, for
instance the factor-augmented regression approach by (Cheng and Hansen 2015) and
the AdaBoost algorithms reviewed by (Barrow and Crone 2016). Second, additional
algorithms for adaptive combination weights (cf. Timmermann 2006) can be
implemented to provide even more flexibility with dynamic estimation.
There are few other possible extensions which are not handled here. For
example the adaptation for a forecast combination context of the mean
absolute scaled error (Hyndman and Koehler 2006) – the current gold standard for
accuracy evaluation – using the in-sample MAE of the best component
forecast as scaling factor. Another extension is of course the
combination of forecast densities, while in this paper the scope was
limited to point-forecasts only.
We thank four anonymous referees for their insightful comments and remarks which helped to improve both the package and this paper.
BMA, opera, forecastHybrid, ForecastCombinations, GeomComb, quadprog, mtsdi, forecTheta
Bayesian, Econometrics, Optimization, Survival, TimeSeries
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Weiss, et al., "Forecast Combinations in R using the ForecastComb Package", The R Journal, 2018
BibTeX citation
@article{RJ-2018-052, author = {Weiss, Christoph E. and Raviv, Eran and Roetzer, Gernot}, title = {Forecast Combinations in R using the ForecastComb Package}, journal = {The R Journal}, year = {2018}, note = {https://rjournal.github.io/}, volume = {10}, issue = {2}, issn = {2073-4859}, pages = {262-281} }