Forecasting US Stock Returns

We forecast quarterly US stock returns using eighteen predictor variables both individually and in multivariate regressions, with the former also used in forecast combinations. Using rolling and recursive approaches, we consider a range of statistical and economic evaluation measures. We consider linear and non-linear regressions as well as forecast evaluations over both market and economic regimes and calculated on a rolling and recursive basis. The results reveal that the term structure of interest rates consistently provides the preferred forecast performance, especially when evaluated using the Sharpe ratio. Additionally, the purchasing managers index consistently provides a strong forecast performance. A broad view over the full set of predictive variables tends to indicate that such models are unable to beat the historical mean model. However, nuances to these results reveals forecast success varies according to how the forecasts are evaluated and over time. The success of the term structure (and the purchasing managers index) reveals that investor (and firm) expectations of future economic performance provide valuable stock return forecasts and is consistent with asset pricing models that indicate movements in returns are conditioned by such expectations.


Introduction.
Stock return predictability remains elusive and much sought after. Stock return predictability ties together several strands within the asset pricing literature and so remains a key empirical research question. Evidence of predictability linked to specific financial or economic variable would advance our understanding of an underlying asset pricing model that argues current stock returns are linked to future movements in economic conditions. Moreover, it would improve our knowledge of the links between real and financial markets. Equally, evidence of predictability arising from the movement of past returns or characteristics not related to economic conditions would suggest a reassessment of our asset pricing models is required.
Nonetheless, regardless of the source of predictability, supportive evidence would of interest to investors in building portfolios and market timing decisions.
While empirical research geared towards stock return predictability has been a recurring theme, recent impetus to this area was given by the work of Campbell and Shiller (1988) and Fama and French (1988) both of whom argue that financial ratios exhibit predictive power for subsequent stock returns. Pesaran andTimmermann (1995, 2000) consider a wider range of economic variables and report supportive evidence of predictability. However, consistent evidence of favour of predictability is lacking. Notably, Ang and Bekaert (2007) and Welch and Goyal (2008) undertake comprehensive exercises that suggest limited evidence of predictability. An explanation for the lack of consistent evidence is provided by work that suggests the presence of regime shifts or non-linear dynamics within the predictive relation or that predictability itself is a temporary phenomenon. For example, McMillan (2003) argues that a non-linear model is required to uncover more supportive evidence of predictability. Paye and Timmermann (2006) suggest that breaks occur within the coefficient of the predictive regression, while Lettau and van Nieuwerburgh (2008) suggest breaks in the predictor variable.
Timmermann (2008) argues that predictability only exists short-lived periods of time, while Electronic copy available at: https://ssrn.com/abstract=3264105 2 Campbell and Thompson (2008), Park (2010) and McMillan and Wohar (2013) equally argue that predictability is not constant over time. Henkel et al (2011) suggest that predictability only arises during economic downturns. More recently, Hammerschmid and Lohre (2018) provide evidence of predictability based on economic regimes, while Baltas and Karyampas (2018) highlight forecast success is dependent upon identifying market regimes. This paper focuses on the out-of-sample forecast ability of a range of eighteen variables that include financial ratios, firm specific variables, macroeconomic variables and series that correspond to confidence and recent market behaviour. Thus, we seek to cover variables that can be regarded as indicators of fundamental economic conditions as well as market indicators.
We consider these variables individually and in a multivariate regression setting and consider forecast combinations of the former. The modelling approach includes linear and non-linear models conducted using rolling and recursive approaches. The forecast evaluation utilises statistical and economic based measures and equally allow for regimes of behaviour to be identified according to both economic and market conditions and, again, in a rolling and recursive manner.
Our results reveal several key features. Statistical based forecast results tend to support the historical mean baseline model. However, this broad view disguises several nuances to these results. An examination of mean squared error components reveals the failure of predictive models arises from large unsystematic errors. Economic based forecast evaluations reveal better performance for the predictive models, as does the imposition of short-selling constraints. An evaluation of threshold model based forecasts as well as market and economic regimes also indicates the potential to identify periods where explicit forecast models outperform the historical mean. Equally, time-variation in calculating the forecast evaluation measures reveals periods of time where the predictor variables perform relatively better or worse compared to the historical mean. Notwithstanding these results, the term structure of Electronic copy available at: https://ssrn.com/abstract=3264105 3 interest rates (especially) and the purchasing managers index consistently exhibit a strong forecast performance. The term structure achieves the highest Sharpe ratio seven times across ten difference exercises and comes second in the remaining three. The purchasing managers index, with two exceptions, achieves a top three performance (and is first twice) on the Sharpe ratio across the different approaches.
The term structure is an indicator of investor expectations of future economic conditions, notably, whether expected future output with growth, leading to higher future inflation and interest rates. The purchasing managers index is an indictor of firm expectations of future economic performance and whether firms are seeking to expand supply. The success of these measures highlights the view that movements in stock returns are determined by expectations of expected future economic performance. This is supportive of the general asset pricing principle advocated for example by Campbell and Shiller (1988) and Lamont (1998) where movements in stocks depend upon expectations of future economic conditions. Further, the nature of these results similar to that of Ang and Bekaert (2007), Welch and Goyal (2008) and Hjalmarsson (2010) in providing evidence that the term structure provides a superior forecast performance compared to, for example, the dividend-price ratio often preferred in the literature.
This paper contributes to our knowledge by emphasising the forecast ability of the term structure for stock returns and, to a lesser extent, the purchasing managers index. Equally, that the other predictor variables do not exhibit similar such forecast power. The results also emphasise the different conclusions that can be reached whether statistical or economic forecast evaluation measures are used. Further, the results support greater evidence of predictability when we impose short-selling (non-negative forecast) constraints, across different market and economic regimes and different time periods. These latter points indicate a key result that is forecast power is not a constant but varies over time.
The basic forecast equation is given by: (1) rt = α + βxt-1 + εt Where rt is the stock return, xt the predictor variable and εt a white noise error term. In order to conduct the forecasts, we consider two schemes, a rolling and a recursive approach. The purpose of these approaches is designed to mimic investors in real time and thus updating all the available information, including the data and the parameter estimates. They differ in only how they treat older observations, either retaining them through the recursive scheme or dropping them in the rolling approach. In both cases, we begin by estimating the initial model over an in-sample ten-year window and then obtain the forecast for the first out-of-sample observation. To obtain the second forecast, the end of the in-sample period is then rolled forward by one observation. Under the recursive scheme the starting point of the in-sample remains fixed such that the in-sample period expands, while under the rolling scheme, the starting point of the in-sample also moves forward by one observations such that the number of in-sample observations remains fixed. These respective processes continue through the rest of the sample period and we generate two forecast time series.
We obtain individual forecasts for each of the series noted in the next section. In addition, we consider joint forecasts from the predictor variables. We do this in two ways, first, we simply include multiple variables into the regression. Second, we conduct forecast combinations. Specifically, we follow the complete subset regressions (CSR) approach of Elliot et al (2013). This approach equally weights the individual forecasts across different subsets of the models and has the advantage of diversifying across individual forecasts and thus reducing issues relating to model uncertainty and stability. As we consider eighteen predictor variables, an analysis of the complete subset is not feasible (it would involve over 250,000 subsets), we focus on a limited set. We consider an equally weighted combination of all Electronic copy available at: https://ssrn.com/abstract=3264105 5 forecasts, both including and excluding the historical mean, 1 a combination of only those variables that are significant over the full sample regression (we recognise that this information would not be available to an investor in real time but believe the results will be illustrative) and a combination of variables that are significant over the initial sample. We also allow the combinations to vary according to statistical significance across the rolling and recursive samples. Thus, at each point in the rolling and recursive step we identify the significant predictor variables and obtain the equally weighted combination of their forecasts for the subsequent period. 2 To evaluate the forecasts, we consider measures based on the size and the sign of the forecast error and thus provide measures that have statistical and economic content. We first consider the mean squared error (MSE) and decompose this measure to consider different elements of the forecast performance. The MSE is given by: where τ is the forecast sample size, rt is the actual return and rt f represents the forecast series.
The MSE can also be decomposed into elements that represent the forecast bias, the difference in the variance of the forecast and actual series and a component that represent unsystematic forecast errors. This decomposition is given by: where the first component represents the bias, , and measures the difference between the mean of the forecast and actual series. The second component, represents the difference in the variance of the forecast and actual series, where σ represents 6 the standard deviation. The third component, , captures the covariance proportion of the MSE and measures the unsystematic forecast error, where ρ represents the correlation between the forecast and actual series. This decomposition allows us to identify where any forecast difference between alternative models may arise from.
While the MSE produces a single value for each forecast model, we also use the outof-sample R-squared approach of Campbell and Thompson (2008) and Welch and Goyal (2008), which provides a single value to compare a baseline forecast with an alternative.
Moreover, the use of this measure has become increasingly popular in the stock return forecasting literature (for two recent examples, see, Baltas and Karyampas, 2018;Hammerschmid and Lohre, 2018). Additionally, we use the test of Clark and West (2007) to provide a measure of statistical significance for these values.
The out-of-sample R-squared measure is given by: (4) ) ) ( To provide some statistical robustness to this measure, we use the test of Clark and West (2007). This test considers whether the mean squared error of two competing forecasts are statistically different. The Clark and West test adds a simple adjustment to the difference in the MSE to account for additional parameter estimation error in the larger model. Clark and West thus suggest generating the following time-series: Electronic copy available at: https://ssrn.com/abstract=3264105 7 Where FE1 represents the forecast error for the forecast series generated from the baseline model, FE2 is the forecast series generated from the predictive model and FE3 is the difference between the baseline and predictive model forecasts. The generated CW series is then regressed on a constant, with associated the t-statistic providing the measure of significance.
The above tests measure the size of the forecast error. However, it is equally important to measure the sign of the forecast error as this provides market trading signals. As such, in this circumstance, it is preferred to accurately predict a rise or fall in subsequent stock returns than to have a forecast value that is close to the realised value. Therefore, we calculate the success ratio (SR), which measures the proportion of correctly forecast signs: Therefore, a SR value of one would indicate perfect sign predictability and a value of zero would indicate no sign predictability. Hence, in assessing the performance of each forecast model, we consider which model produces the highest SR value.
To complement this measure, we also provide a trading-based forecast. While the SR measure provides some trading information with respect to buy and sell signals, we expand this by considering a simple trading rule. Here, if the forecast of the subsequent period return is positive then an investor buys the stock, while if the subsequent forecast for the next period return is negative, then the investor (short) sells the stock. From this process, we obtain a time series for trading returns and denote that π. To provide market relevant information relevant, we then use this series to generate the Sharpe ratio as such: Where the Sharpe ratio is calculated as the ratio of the mean trading profit ( ) minus a shortterm (3-month) Treasury bill as the risk-free rate and the trading return standard deviation (σ).
A model that produces a higher Sharpe ratio therefore has superior risk-adjusted returns.

Data and Empirical Results.
Our variable of interest to be forecast is the S&P 500 composite index. Our analysis primarily focuses on the price index of this series, we also consider the total return index, but results are highly similar. The data is sampled quarterly over the period 1960:1 to 2017:2.
The predictor variables are selected from a list of commonly used variables (see, for example, Welch and Goyal, 2008;Hammerschmid and Lohre, 2018). We include, the log dividend-price ratio, the log price-earnings ratio, the cyclically adjusted price-earnings ratio, the payout ratio, the Fed model, the size premium, the value premium, the momentum premium, the quarterly change in GDP, consumption, investment and the CPI, the 10-year minus 3-month government treasuries term structure, the Tobin's q-ratio, the purchasing managers index (PMI), equity allocation, central government consumption and investment and a short moving average. While no set of predictor variables can be exhaustive, the above selection is motivated by an attempt to cover a wide range of variable types, include financial price ratio variables, firm characteristic variables, macroeconomic variables and measures of confidence and market behaviour, with the restriction of data availability. The data is sourced from Datastream and the St Louis Federal Reserve FRED database. Table 1 present the estimates of equation (1) over the full sample using both the price only and the total return indices to form the stock return series. Each predictor variable is estimated individually, and we report the coefficient value, with the Newey-West t-statistic and Rsquared value. As these equations cover the full sample, they are not used in the forecast exercise but are intended to provide information with regard to any variables that exhibit such full (in-)sample predictive power.

9
The results here show that only a limited number of variables exhibit statistical significance (including up to the 10% level). The variables that are significant across both the price and total return series are Fed model, the value premium, PMI and equity allocation, while the dividend-price ratio and the q-ratio are additionally significant for the total return series only. The fact that there is limited full sample significance is not surprising. Indeed, there is much evidence that stock return predictability is characterised by regimes of predictability, perhaps due to breaks or non-linearities. For example, Paye and Timmermann (2006) suggest that breaks may exist in the predictive relation, while Lettau and van Nieuwerburgh (2008) suggest breaks in the predictor variable. McMillan (2014McMillan ( , 2015 seeks to explicitly model timevariation within the predictive series (dividend-price ratio), while Timmermann (2008) (2013) argue that returns predictability may only occur over short periods of time. This, therefore, further motivates the use of the rolling and recursive forecast schemes that can accommodate such patterns of behaviour. Given the broad similarity in the outcomes for the price only and total return series, the results below focus only on the former but results for the latter are available upon request (and again, highly similar in nature). Table 2 presents the rolling regression based forecast results for the MSE measure and its component parts. The historical mean (HM) forecast acts as the baseline measure. As with the forecast models, the historical mean forecast is obtained using a rolling and recursive scheme and thus allowing the value of the constant term to change. The results for the overall MSE measure show that the values (multiplied by 100 in the table) obtained by the historical mean forecast and the eighteen individual predictor variables are very similar in value, however, only the PMI series achieves a lower value than the HM. Of note, the forecasts obtained using multivariate models perform poorly in comparison, especially the forecasts from the ratios and Electronic copy available at: https://ssrn.com/abstract=3264105 10 all groups. The combined forecasts are similar to those of the individual predictor variables and the HM, having a similar but slightly higher value than the HM.

Forecast Results
Examining the components of the MSE, we see that in terms of the bias, i.e., on average how close are the model's forecasts compared to the actual series, we see that seven individual predictors outperform the HM and eleven are worse although, again, the values are similar. For the multivariate regressions, the macro and all models perform better, while the for the combined forecasts the full and initial sample significance-based approach also perform better.
The results based on the variance and covariance proportions of the MSE provide an interesting dichotomy. All the predictor models achieve a smaller difference between the variance of the forecast and actual return series compared to the HM series. In contrast, the HM series outperforms all the predictor models on the basis of the covariance component. This latter component captures unsystematic forecast errors and suggests that the failure of the predictor models to consistently outperform the HM does not lie in a systematic failure of the predictor but in (large) errors that arise from unexpected movements in returns. Table 3 presents the same set of results for the forecasts obtained by the recursive modelling approach. The results here are broadly similar to those obtained under the rolling modelling scheme. The overall MSE values are very similar between the HM and predictor series. Again, of the individual forecasts, only the PMI series outperforms the HM, while for the combined forecasts, the models based on full sample significance of the predictor variables and t=1.90 significance at each recursion also outperform the HM. In terms of the MSE components, nine of the individual predictors have a lower mean difference compared to the HM between the forecasts and actual series. Likewise, the macro multivariate regression and all the combined forecasts (except the model based on initial sample significance) achieve a lower average bias. As with the rolling forecasts, all the predictor model achieve a lower variance forecast error component and a higher covariance forecast error component compared Electronic copy available at: https://ssrn.com/abstract=3264105 11 to the HM. The only except being the combined forecasts that use all variables (including and excluding the HM) and based on full sample significance, which have a higher variance and lower covariance compared to the HM.
The results of the MSE forecasts in Tables 2 and 3 suggest that, looking at the overall MSE values, there is little difference between the HM and the predictor models, although with very few exceptions, the HM performs better. However, this general result masks the view that several predictor variables achieve a better forecast based on a lower average forecast error and a lower variance forecast error value using either or both of the rolling and recursive techniques.
Notably, this include the dividend-price ratio (rolling), the cyclically-adjusted price-earnings ratio (both), the Fed model (rolling), the size premium (rolling), term structure (recursive), qratio (recursive), PMI (both), equity allocation (both) and government consumption and investment (rolling). In addition, several multivariate and combined forecast models also outperform the HM approach based on the mean and variance components. However, where the predictor models perform poorly in comparison to the HM is with respect to the unsystematic covariance component and thus the large unexpected movements in returns, for which the last periods value produces the current periods best forecast.
While the results in the above two discussed tables examine the MSE value for each forecast model individually, Table 4 presents the forecast results using the rolling and recursive schemes based on the out-of-sample R-squared value (OOS R 2 ) and the Clark and West test.
The OOS R 2 essentially is a comparison of the MSE values between the forecasts based on the predictor models and the HM, as the baseline model. Given the MSE values in Tables 2 and 3, it is unsurprising that only the PMI across the individual predictor variable forecasts achieves a positive OOS R 2 value, i.e., that its MSE is lower than the value for the HM, while the combination forecasts based on the t=1.90 approach does likewise for the recursive technique.
Moreover, what is revealed in Table 4  The above results are based on the size of the forecast error. However, in the context of financial returns data, sign forecasting is, at least, of equal importance as it implies market timing signals. Therefore, Table 5 presents the results of the success ratio and the Sharpe ratio obtained using the above trading rule based on the forecasts for both the rolling and recursive approaches. In comparison to the above results based around the MSE values, we find more supportive evidence in favour of the predictive variables. That said, the sum of the evidence still suggests that, with a few key exceptions, the relative difference in performance between the predictive models and the HM is small.
For the rolling regression based forecasts, of the eighteen individual predictor series, eight have a higher success ratio than the HM. The term structure and PMI being the highest at 66%, this is in comparison to a value of 60% for the HM. All the multivariate model forecasts achieve a value less than 60%, while the forecast combinations are all higher (with one exception, where the value is the same). Moreover, and notwithstanding this, the general range in value for the success ratio is from 55% to 66%, thus the value for HM lies in the middle. 3 For the Sharpe ratio, fourteen of the eighteen individual predictor variables have a higher value than for the HM, two of the four multivariate models also do, while all the combination forecasts achieve a higher value. Moreover, now the difference in value is noticeably larger for several of the forecast models. Indeed, across the 28 different models, ten have a Sharpe ratio 3 Although there is one value at 52%.
Electronic copy available at: https://ssrn.com/abstract=3264105 13 that is greater than double the value of 0.074, which is the value obtained by the HM. These ten are: price-earnings ratio; dividend-earnings ratio; value premium; the term structure; PMI; all combination models except that based on the initial sample significance. Additionally, a further twelve models outperform the HM.
The results from the recursive regression approach differ and suggest relatively less success for the predictive models over the HM. For the success ratio, only the term structure, PMI, the other group multivariate model (all 67%) and the forecast combination based on a tvalue of 2.50 (66%) achieve a higher value than that of 65% for the HM. In addition, eight forecast models achieve the same success ratio as the HM. For the Sharpe ratio the same four forecast model that achieve a higher success ratio, also achieve a higher Sharpe ratio. Of note, the term structure and PMI achieve a clearly higher Sharpe ratio value. Additionally, equity allocation and the combination forecast based on a t-value of 1.90 also outperform the HM.
In comparing the rolling and recursive values across both the MSE, success ratio and Sharpe ratio values, we can see that the recursive approach appears to be largely preferred. The recursive approach, across the full range of predictor variables, typically achieves a lower MSE and higher success and Sharpe ratios. The difference between the rolling and recursive approaches lies in the treatment of older observations, for which the former drops them, and the latter keeps them. The result of this is that rolling estimates and forecasts tend to be more volatile than equivalent recursive ones. Examining the components of the MSE, we can see that the rolling forecasts achieve a lower average forecast error and a lower volatility forecast error but a larger unsystematic component. This indicates that the rolling approach tends to achieve a more accurate forecast but with very large errors such that the overall value indicates a worse performance. Notwithstanding this, the overall preferred model on the basis of the Sharpe ratio, which is perhaps the most important measure for investors, is the rolling term structure model.
Electronic copy available at: https://ssrn.com/abstract=3264105 14 In conducting the rolling and recursive regressions we examine the significance of the predictive variables, which we use in the forecast combinations as discussed above. We can also use this information to examine whether the predictive variables are significant over the full sample period or whether there is a significant predictive effect only over sub-sample periods (for example, Timmermann, 2008, argues that predictability only exists in small subperiods). Figure 1 presents a set of graphs that shows the number of significant variables across each of the sample periods for both the rolling and recursive approaches and for t-values equal to 1.90 and 2.50. Specifically, we have eighteen individual predictive regressions, the line in each graph is the number of significant predictive variables across each period and significance level.
Across the four scenarios, we can see, as expected, that there is more evidence of significance using the t=1.90 level and using the recursive approach. At the t=1.90 level, the average number of significant variables in any sample period is three for the rolling approach and four for the recursive approach. For the rolling method, the maximum number of predictive variables is nine, while the are no predictive variables in 2009Q3 and for a small number of periods in the early 2010s. For the recursive method, the maximum number of significant predictive variables is seven, while at least one variable is significant at each sample period.
Using the t=2.50 cut-off, for the rolling approach, the average number of significant predictor variables is just one, while it is only two for the recursive approach. For the rolling method, we see several periods where there is no predictability, and this is notably concentrated around the late 2000s and early 2010s. For the recursive approach, over the same period, there is evidence of largely only one significant predictor.
Across the sample period, we see evidence of a greater number of predictor variables (particularly examining the recursive plots) over the periods of the first half of the 1970s, the second half of the 1980s and first half of the 1990s and possibly towards the end of the sample.
Electronic copy available at: https://ssrn.com/abstract=3264105 15 Across individual series, while there are too many graphs to consider, notable variables that exhibit significance across the sample include the dividend-price ratio, the price-earnings ratio, the cyclically adjusted price-earnings ratio, the term structure of interest rates, the q-ratio and PMI. Nonetheless, all of these variables exhibit periods of significance and insignificance, supporting the view that predictability only occurs over sub-sample periods but that periods of significance occur more regularly for some variables than others.

Short-Selling (Campbell and Thompson) Restrictions
In the above analysis, and perhaps of notable economic relevance when generating the trading rules, we allow for short-selling. When the forecast for stock returns is negative, this is included in the calculation of the MSE values and generates a sell signal in trading. Campbell and Thompson (2008) argue that negative forecasts do not make economic sense as the forecasts are intended to model expected returns. 4 Following this view, Tables 6 and 7 repeat the OOS R 2 , Clark and West test, success ratio and Sharpe ratio for our rolling and recursive forecasts where any negative return forecast is replaced with a zero value. In essence, in respect of the trading rule, we are imposing a shortselling constraint. The results in Table 6 reveal that the HM is still typically preferred as the OOS R 2 values are predominantly negative and while they are small in value, they are often statistically significant particularly for the rolling forecasts. Specifically, of the eighteen individual variable rolling forecasts, eleven significantly (including up to the 10% level) indicate preference for the HM. For the recursive forecasts only five of the individual forecasts are significantly worse than the HM approach. For the four multivariate regressions, the HM is statistically preferred regardless of the modelling approach, while for the combination 16 forecasts, the approach based on combining all individual forecasts are not significantly different from the HM, while the other combining approaches are worse.
In Table 7, we can see that the success ratios obtained by the HM and forecast models are very similar in value, although generally the HM achieves a slightly higher value. For the rolling forecasts, the term structure, government consumption and investment and three combination forecasts (the two all and the full sample significance combinations) achieve a higher success ratio than the HM. For the recursive forecasts, only the combination forecast based on a t=2.50 approach achieves a higher success ratio. For the Sharpe ratio a different view can be observed. For the rolling forecasts, fourteen of the eighteen individual variable forecasts outperform the HM, two of the four multivariate regression and all the combination forecasts, likewise outperform the HM. For the recursive forecasts, the supportive evidence for the predictive models is less, with only five individual, one multivariate and two combination forecasts outperforming the HM. Notably, the five individual forecasts are for the term structure (which achieves the highest Sharpe ratio), PMI, equity allocation, inflation and the payout ratio.
As before, we can compare the performance of the forecasts obtained from the rolling and recursive approaches. The OOS R 2 , the success and Sharpe ratios are generally better for the recursive technique. For example, just taking the Sharpe ratio, this value is higher for the recursive approach for thirteen of the eighteen individual forecasts and all the multivariate forecasts. Of note, however, for the combination forecasts, the rolling technique produces a higher Sharpe ratio. Again, however, the rolling term structure forecast (and now also the combination forecast based on a t-value of 1.90) produces the highest value. In comparing Tables 5 and 7, we can see that by imposing a short-selling constraint leads to an improved Sharpe ratio.
Electronic copy available at: https://ssrn.com/abstract=3264105 17 Overall, the above results suggest several pertinent points. If we consider the headline statistical measure, then the general conclusion would be that the HM is preferred. There are exceptions but no consistent evidence in favour of predictability. However, considering a deeper analysis, the results suggest that the predictive models produce a forecast that on average is closer to the mean of the series compared to the HM. Equally, the variance of the forecasts produced by the predictive models are closer to the variance of the actual return series than for the HM. The low overall ranking of the predictive models arises from large unsystematic forecast errors. The economic forecast measures provide greater support for the predictive models, especially, in regard of the Sharpe ratio. In comparing the rolling and recursive forecasting approaches, while the latter appears to generally provide a superior forecast performance, again there is a distinction between the mean and variance on the one hand and the covariance (unsystematic) components on the other. The rolling forecasts, which tend to be more volatile, achieve a lower systematic forecast errors but higher unsystematic forecast errors. The use of short-selling restrictions improves the forecast performance and suggests that forecasts of negative returns leads to inferior performance on both statistical and economic measures. Across the different approaches and variables, picking out one forecast model, a rolling regression using the term structure achieves the best performance if we focus on the Sharpe ratio as being the most relevant measure for investors.

Time-Varying Forecast Models
The above analysis measures the performance of the forecasts obtained from the predictive models both individually and as a group, either through multivariate forecasts or forecast combinations. The use of rolling and recursive modelling approaches allows for time-variation to exist in the parameter values and the statistical significance of the regressions. However, the estimated model is nonetheless a linear one. As noted in the Introduction, there is evidence that forecasts may be improved through considering differing regimes of behaviour. For example, Hammerschmid and Lohre (2018) consider forecasts according to macroeconomic conditions using a Markov-switching approach, threshold regressions are considered by McMillan (2001McMillan ( , 2003, while Henkel et al (2011) argue that predictability only arises during recessionary periods.
We consider the importance of regimes of behaviour in predictability and forecasting in two different ways. First, we examine forecast ability of the predictor variables according to whether the market is in a bull or bear phase and whether the economy is in a contractionary of expansionary state. Second, we estimate an explicit threshold regression (TR) model for each predictor variable. In the TR models, the predictor variable is also used as the threshold variable, although for the multivariate models, where this becomes impractical, we use a lag of stock returns themselves as the threshold. For the forecast combinations we now only consider a combination of all variable forecasts and those corresponding to significance across each individual step in the rolling and recursive exercises. Table 8 presents the OOS R 2 and Sharpe ratio results when we separate the forecast sample between bull and bear markets and expansionary and contractionary periods. We only report these two statistics and for the rolling regression for space considerations, but these results highlight the key conclusions from this approach. To define bull/bear market periods, we follow Cooper et al (2004) and use the three-year moving average of the stock index.
Specifically, if the change in the moving average is positive then the market is characterised as a bull market, while if the change in the three-year moving average is negative, the market is in a bear phase. To define expansionary and contractionary regimes, we use output (GDP) growth over two consecutive quarters. Where this value is positive then we ascribe that to be an expansionary regime and a contractionary regime when it is negative.
In the bear market regime, we can see that the HM again outperforms the predictive models on the basis of the OOS R 2 , although the values are close to zero. The exception is for the term structure predictive variable and the combination forecast based on a t=1.90, for which both OOS R 2 values are positive. A similar picture is seen in the bull market regime, although there is some further evidence of positive OOS R 2 values, but again with the values being small in magnitude. With regard to the Sharpe ratios, we see more of a distinction between the two regimes. In the bear market regime, the HM achieves the lowest Sharpe ratio and while most of the values are negative (given the bear market), for five individual series, two multivariate regressions and one forecast combination, we observe a positive Sharpe ratio. Moreover, the Sharpe ratio for the term structure is the largest value. In the bull market regime, however, only five models (three individual forecasts and two forecast combinations) outperform the HM.
Again, the term structure forecast model performs well, although the PMI value is slightly higher. This suggests a clear distinction in the ability of the predictive models (or the HM) across market regimes.
Examining the results for expansionary and contractionary regimes, we see a similar dichotomy as with the market regimes. Across both regimes, on the basis of the OOS R 2 , the HM is typically preferred but all the values are small (with the exception of the multivariate models). In the expansionary regime, the HM achieves a higher Sharpe ratio than the majority of the predictive models, although there is evidence of some success for the predictive models.
Notably, eight of the eighteen individual predictor variables achieve a higher Sharpe ratio, while all the forecast combination models also do. The term structure predictive variable achieves the highest Sharpe ratio. As with the bear market regime, in the contractionary regime, all the predictive models achieve a higher Sharpe ratio compared to the HM. Moreover, while for many of the forecast models, the Sharpe ratio is negative (given the state of the economy), for eight individual predictors, two multivariate models and five forecast combinations, the Electronic copy available at: https://ssrn.com/abstract=3264105 20 Sharpe ratio is positive. The term structure variable again produces a relatively high Sharpe ratio, although the value for the PMI is slightly higher.
Overall, these results suggest that the HM approach typically outperforms the predictive models in bull markets and economic expansions, while, the predictive models perform well during bear market conditions and economic contractions. Indeed, in both these regimes, on the basis of the Sharpe ratio, the HM performs the worse. Furthermore, across the different regimes, the term structure predictive model consistently achieves a strong performance.
The results of the TR regressions are reported in Tables 9 and 10. In Table 9, which presents the OOS R 2 values and the Clark and West test, the results present a similar picture to that revealed earlier for the linear models in Table 4. Specifically, the OOS R 2 values are small but nearly all negative, indicating preference for the HM. In terms of statistical significance, under the rolling scheme, the Clark and West test is significant for nearly all the predictive models, whereas significance is much lower under the recursive approach. Table 10 presents the success ratio and Sharpe ratio for the TR models. For the rolling approach, we can see that several predictor variables achieve a higher success ratio than the historical mean, notably the dividend-price ratio, size and value premium, investment growth, the term structure, PMI and government consumption and investment. In addition, these variable, plus several others (price-earnings ratio, cyclically adjusted price-earnings ratio, dividend-earnings ratio, Fed model, consumption growth, the moving average and all forecast combination models) achieve a higher Sharpe ratio than the historical mean. For the recursive estimation approach, the results are less supportive across the full sweep of the forecast models.
For the success ratio, no model outperforms the historical mean, although the values are similar and for the PMI variable, the same. For the Sharpe ratio only two predictor variables achieve a higher value, the term structure and PMI. Thus, although only a small number of variables outperform the HM, it is the same two variables that notably perform well. Moreover, the Electronic copy available at: https://ssrn.com/abstract=3264105 21 Sharpe ratio for the term structure here is the highest value achieved across all models. The term structure represents investor perceptions of the future economy and whether the economy is likely to grow, leading to higher future inflation and long-term interest rates. The PMI represents firm expectations of future economic performance and whether they believe future output will be higher and thus are seeking to expand their own production.
As with the linear regression results above, we can also examine the significance of the predictive variables in the threshold regressions. Figure 2 presents the equivalent graphs to Figure 1 for the threshold regression parameters. That is, Figure 2 shows the number of significant predictive variables across the rolling and recursive approaches for the t=1.90 and 2.50 values. In comparing the two figures, we can see a higher level of significance with the threshold regressions. Here the average number of significant parameters are six for the rolling and recursive approaches when the t value is equal to 1.90 (in comparison to three and four respectively for the linear models). For t=2.50, the average number of significance variables is five and four for the rolling and recursive approaches respectively (in comparison to one and two respectively for the linear approach).
In addition to the higher average values, we can also see that the maximum number of significant variables is higher for the threshold than linear approach. Across the rolling and recursive approaches and the t=1.90 and 2.50 values, the maximum number of significant predictive variables is between seven and twelve compared to four to nine for the linear models.
Equally, the lowest number of significant variables is one (for the higher t-value) or two, compared to zero and one for the linear model.
These plots again demonstrate that predictability is not constant over the full sample period. We again find evidence that predictability occurs across more variables notably at the beginning of the sample, between the mid-1980s and mid-1990s and towards the end of the sample. In contrast, predictability is limited to a small number of variables in the early 1980s Electronic copy available at: https://ssrn.com/abstract=3264105 22 and early 2000s. Without reporting the individual graphs, the variables that exhibit more periods of significance include the dividend-price ratio, price-earnings ratio, cyclically adjusted price-earnings ratio, the term structure, q-ratio and PMI.

Time-Varying Forecast Evaluation
The literature highlights the view that forecast success may only occur in pockets of time. The evidence reported in Figures 1 and 2, illustrate that predictive power of the variables varies over time. Therefore, we would expect the forecast success of the predictive variables to change over time.
Therefore, using the first set of linear based results reported in Tables 4 and 5, we calculate the OOS R 2 and Sharpe ratio on both a recursive and rolling basis to consider how these values and thus the relative forecast success varies over time. 5 We only present the plots for four predictive models, the PMI and term structure as the above results support their forecasting ability and the forecast combination models based on using all predictor variables and those that are significant at each recursive/rolling step using a t-value of 1.90. We present both the recursive and rolling plots as the latter will highlight more temporary periods of forecast success, while the former will illustrate longer periods of such behaviour.
Taking both figures, we can see evidence where the forecasts models are preferred on both the OOS R 2 and Sharpe ratio measures. Although it is noticeable that the periods of success across these two measures do not coincide. Looking at Figure 3 for the OOS R 2 plots, we can see a positive value indicating preference over the HM occurring during the early mid-1980s, the first half of the 1990s, the early to mid-2000s and the mid-2010s. We can also see, that this pattern can be more clearly observed for the forecast combination models and the term structure, while for the PMI model the periods of success are more transient. For the Sharpe ratio plots in Figure 4, the periods of greater success occur slightly after those indicated for the OOS R 2 . Looking particularly at the rolling plots, here the higher Sharpe ratios are seen during the later mid-1980s, the mid-late 1990s, the late mid-2000s and towards the end of the sample.
As such, it appears that a higher OOS R 2 value precedes a higher Sharpe ratio.
Overall, the results from the full set of empirical tests above suggest that the term structure of interest rates provides the best set of out-of-sample forecasts. The purchasing managers index provides the second best set of forecasts. The term structure reveals investor expectations of the future course of the economy. A steepening term structure indicates that investors expect higher future interest rates that will arise from higher future inflation and thus an expanding economy. The result that the term structure achieves the best forecast performance is similar to that reported by Welch and Goyal (2008) and Hjalmarsson (2010).
While, a set of research seeks to emphasise the ability of fundamental to price ratio series (beginning with Campbell and Shiller, 1988) as proxies for expected returns, the results here suggest that a more explicit predictor of future economic conditions provides a better forecast performance.

Summary and Conclusions.
Using quarterly US data from 1960 to the end of 2017, we conduct ten-year rolling and recursive forecasts for a range eighteen financial and economic predictor variables. The forecasts are generated from individual regressions, multivariate regressions and forecast combinations. We use both statistical and economic evaluations of the forecasts that are based on linear and threshold models and are considered over economic and market cycles and calculated over the out-of-sample period as an average and on a rolling and recursive basis.

24
The results present several interesting conclusions; however, the overriding takeaway point is that the term structure of interest rates (10-year Treasury bond minus 3-month Treasury bill) and (to a lesser extent) the purchasing managers index provide consistent forecast performance that is superior to the HM across different forecasting approaches and regimes of behaviour. Using linear or non-linear models, rolling or recursive approaches, imposing or not short selling restrictions and allowing for regimes according to economic or market conditions, these two variables are consistently the best two forecast models when using the economic based Sharpe ratio as the forecast measure, which is the measure relevant for investors. The historical mean model outperforms that vast majority of the predictor variables and models. This is more noticeably true using mean squared error based measures of forecasting ability, however, the numerical difference in values is typically small. One point of interest is that a decomposition of the mean squared error reveals that the forecast models typically outperform the historical mean model in terms of forecast bias and the volatility of forecasts but are subject to large unsystematic forecast errors resulting in an overall poorer performance.
The Sharpe ratio values for rolling forecasts show improvement over the historical mean, although that does not hold as strongly for the recursive forecasts.
Building upon these results, short-selling restrictions improve the forecasts, while separating the forecast evaluations between periods of bull and bear market behaviour and economic expansion and contraction indicates the forecast models are more accurate during bear markets and economic contractions.
The use of rolling and recursive approaches for both the linear and non-linear models as well as the forecast evaluations reveals the existence of notable periods of time where the predictive models outperform the historical mean that are otherwise masked by examining statistics over the whole period. Time periods during each of the 1980s, 1990s, 2000s and 2010s reveal evidence of in-sample predictability and out-of-sample forecast power, which highlights Electronic copy available at: https://ssrn.com/abstract=3264105 25 the temporary nature of predictability. Interestingly, there is evidence that periods of relative statistical forecast success precede periods of successful economic forecast performance.
Cutting across these alternative modelling and forecasting approaches, and alternative forecast evaluations, the term structure of interest rates and the purchasing managers index achieve consistently strong forecast performance, especially (but not only) when assessed according to the Sharpe ratio measure. These two variables are based on either investor or firm expectations of future economic performance i.e. do investors expect higher future inflation and interest rates as the economic expands or do firms expect an increase in orders, again, as the economy expands. The concluding point of this paper is that quarterly US stock returns can be forecast and that while forecast performance is variable, these two series provide as consistent a performance as likely to occur. It remains to be seen whether a similar result will be repeated across alternative markets. McMillan, D.G. (2015). Time-varying predictability for stock returns, dividend growth and consumption growth. International Journal of Finance and Economics, 20, 362-373. McMillan, D.G. and Wohar, M.E. (2013). A panel analysis of the stock return dividend yield relation: predicting returns and dividend growth. Manchester School, 81, 386-400. Park, C. (2010). When does the dividend-price ratio predict stock returns? Journal of Empirical Finance, 17, 81-101. Paye, B., and Timmermann, A. (2006). Instability of return prediction models. Journal of Empirical Finance, 13, 274-315. Pesaran, M.H. and Timmermann, A. (1995). Predictability of stock returns: Robustness and economic significance. Journal of Finance, 50, 1201-1228. Pesaran, M.H. and Timmermann, A. (2000. A recursive modelling approach to predicting UK stock returns. Economic Journal, 110, 159-191. Timmermann, A. (2008). Elusive return predictability. International Journal of Forecasting, 24, 1-18. Welch, I. and Goyal, A. (2008). A comprehensive look at the empirical performance of equity premium prediction. Review of Financial Studies, 21, 1455-1508.  (1). The increasing number of asterisks refer to statistical significance at the 10%, 5% and 1% levels. The dependent variables is the stock returns (difference log) multiplied by 100. The explanatory variables are DP (log dividend-price ratio), PE (log priceearnings ratio), CAPE (cyclically adjusted PE ratio), DE (log dividend-earnings ratio), FED (earnings yield dividend by the 10-year Treasury bond), SMB (the return premium to small firms over large firms), HML (the return premium to value firms over growth firms), MOM (the return premium to past winner firms over loser firms), GDP (the period growth rate of real GDP), Cons (the period growth rate of personal consumption), Inv (the period growth rate of investment), infl (the period change in CPI), TS (the difference between the yield on a 10-year Treasury bond and 3-month Treasury bill), Q-ratio (Tobin's Q), PMI (the purchasing managers index), Eq Alloc (market value of stocks divided by the sum of market value of stocks and investor holdings of cash and bonds), Gov C&I (central government consumption and investment), MA1Yr (a one-year lagged moving average of stock returns).  (GDP, Cons, Inv, Infl, TS, Gov C&I), Other (SMB, HML, MOM, Q-Ratio, PMI, Eq Alloc, MA1Yr), All (all variables). Forecast combinations are also considered, CSR-All (all forecasts), CSR-All excl HM (all forecasts excluding the historical mean forecast), CSR-Full Smp Sig (forecast combination includes all variables that are statistically significant at the 5% level in Table  1), CSR-Fst Smpl Sig (forecast combination includes all variables that are statistically significant at the 5% level at the first in-sample period), CSR-t=1.90 (forecast combination that includes all variables that have a t-statistic in absolute value equal to or greater than 1.90 at each rolling step), CSR-t=2.50 (forecast combination that includes all variables that have a t-statistic in absolute value equal to or greater than 2.50 at each rolling step).   (4) and the Clark and West (2007) test of equation (5).  (6) and Sharpe Ratio of equation (7).   Table 5, values differ as a zero value is imposed where a negative return forecast is obtained.   Table 4, with the forecasts now obtained from a threshold model.  .00 .02 .04