MathType

Monday, 25 July 2016

Regression and Forecasting with Stata (Time Series)







One of the reason we estimate the regression model is to generate forecast of the dependent variable.
Before we doing the forecasting, the first things is we need a concrete model that we can refer to.

Some students of econometrics are not clear about certain issues when it comes to using regression models as for forecasting. 

Keep in mind that the forecasting is not necessary using the time series data although time series data is more popular when dealt with forecasting.

Let’s we begin with what we called it as a static linear regression model, which means there is lagged value of the dependent variable entering as a regressors in our model.

\({{y}_{t}}={{\beta }_{1}}+{{\beta }_{2}}{{x}_{t}}+{{\beta }_{3}}{{x}_{t}}+...+{{\beta }_{k}}{{x}_{kt}}+{{\varepsilon }_{t}}\)                                                                  (1)

We assumed the the error term in Eq(1) is white noise, or “well-behave”, and our model is estimated with a sample \(T\) observations.

Now, lets we consider that the \({{b}_{i}}\)  is denote the OLS estimator for \({{\beta }_{i}}\), and  \(i=1,2,...k\) .

That means, the “fitted value” from our estimated model become;

\(y_{t}^{*}={{b}_{1}}+{{b}_{2}}{{x}_{t}}+{{b}_{3}}{{x}_{t}}+...+{{b}_{k}}{{x}_{kt}}\)                                                                          (2)

In Eq(2), we just looking at the within sample predictions of the estimated model.

These prediction are constructed by using the point estimates of the regression coefficients, and the actual observed values of the regressors, which is \(t=1,2,...T\).

In Eq(2), we also notice that for obtaining the fitted values, the error term has been set to assumed mean value of zero.

Let’s assumed that now (by time passes by) we have an additional  \(n\) observation  \(\left( T+n \right)\) on y and all of the \({{x}_{t}}\) variables. But, we still use our OLS parameter estimates based on the original \(T\) observations.

Now, we can generate what we call ex post forecasting of the \(n\) additional observations.

We know the values of these data, but they haven’t been used in the estimation of the model. We can see how well our model performs when it comes to forecasting these \(n\)  values, because we know exactly what actually happened.

We use the data consdpi.dta  which is contain two time-series, namely realcons and realdpi.

Before we estimate the run-long run relationship, the two series must be test for the unit root to make sure that the two time series is stationary in first difference

varsoc realcons
dfuller realcons,trend lags(#)
varsoc realdpi
dfuller realdpi,trend lags(#)

varsoc D.realcons
dfuller D.realcons,lags(#)
varsoc D.realdpi
dfuller D.realdpi,lags(#)

 #= number of lags based on information criterion from varsoc.

The unit root test shows that the two time series is \(I\left( 1 \right)\) , and they are cointegrated.
In this case, we now can estimate the long-run relationship  by regressing one series on the other without the need for any differencing for the data from 1950Q1 to 1983Q4.

reg realcons realdpi if tin(1950q1,1983q4)


 


Right now we have estimated the model with the sample ending in 1983Q4 even though we also have data for 1984Q1 to 1985Q4.

Lets we use these other 8 observations for the ex post forecasting.

regress realcons realdpi if tin(1950q1,1983q4)
estimates store consmodel
forecast create expostforecast, replace
forecast estimates consmodel
set seed 1
forecast solve, prefix(f_) begin(tq(1984q1))end(tq(1985q4)) static   >simulate(betas,statistic(stddev,prefix(sd_)) reps(100))

Let’s now we compare realcons and f_realcons (forecast  of realcons) over the forecast period.

list realcons f_realcons if tin(1984q1,1985q4)

 


Before we plot the data, lets we compute the upper and lower bounds of 95% prediction interval for our forecast realcons.
gen f_y_up = f_realcons + invnormal(0.975)*sd_realcons
gen f_y_dn = f_realcons + invnormal(0.025)*sd_realcons

Let’s we plot the same data;

     
     twoway(line realcons year)(line f_realcons year) (line f_y_up 
 year,lpattern(dash))(line f_y_dn 
>year,lpattern(dash))if tin(1984q1,1985q4)























Now, think about the “real-time” in nature and supposed that we estimate our model using a sample of \(T\) observations.

Then, we want to forecast for another \(n\) observations. For this point, we don’t know the actual values of \({{y}_{t}}\)  for these data-points. This is called as ex ante forecasting .

When we are deal with this type of forecasting, there is some practical problem arises. If we want to apply Eq(2), for period \(\left( T+1 \right)\), then it means that we need to have a value of each of the \({{x}_{t}}\) in period \(\left( T+1 \right)\).

In practice, there is two way how we can insert the value of \({{x}_{t}}\), either we insert “educated guesses” or we use some other type of forecasting for future values like ARIMA model.

The discussion so far is based on what we usually call as a  static forecasting, means that our regression in model Eq(1) is “static” form (rather than dynamic) because none of the variables in RHS equation are lagged values of \({{y}_{t}}\). 

Now, let we modified Eq(1) to include a lagged value of the dependent variable among the regressors;

\({{y}_{t}}={{\beta }_{1}}+{{\beta }_{2}}{{y}_{t-1}}+{{\beta }_{3}}{{x}_{t}}+...+{{\beta }_{k}}{{x}_{kt}}+{{\varepsilon }_{t}}\)                                                                (3)

The Eq(3) can be used to obtain either ex post or ex ante forcast for observation  \(\left( T+1 \right)\) as follows;

\(y_{T+1}^{*}={{b}_{1}}+{{b}_{2}}{{y}_{T}}+{{\beta }_{3}}{{x}_{T+1}}+...+{{\beta }_{k}}{{x}_{kT+1}}\)                                                     (4)

and in Eq(4), at time \(\left( T+1 \right)\) we already know the value of \(y{}_{t-1}\) (The observed value of \({{y}_{T}}\)).

However, when we forecasting point \({{y}_{t}}\)  for period \(\left( T+2 \right)\) , there are actually two option to us in the case of ex post forecasting;

a)      We could insert the known value of \({{y}_{T+1}}\) for \(y{}_{t-1}\) in the forecasting equation (together with value for \({{x}_{3,T+2}},{{x}_{4,T+2}}...etc\))
b)      Alternatively, we could insert the previously  predicted value of \({{y}_{T+1}}\) , namely \(y_{T+1}^{*}\), from Eq(4), together with appropriate \({{x}_{t}}\) values.

The same options remain for forecasting in period \(\left( T+3 \right)\) and so on.

The first option above is called static forecasting, while the second option is called dynamic forecasting.

When we undertaking ex ante forecasting for two or more periods ahead, we actually use dynamic forecasting. In this situation, we actually don’t know the true values of the dependent variable outside the sample. 

Once again, future value for the \({{x}_{t}}\)  variables will have to obtained in some way or other, and this can be a major excise in itself.

Lets now we estimate the model Eq(3) with a simple dynamic model, or there is only lagged one dependent variable as a regressor;

      reg realcons L.realcons realdpi if tin(1950q1,1983q4)














The results show that our model is not robust and maybe the residual exhibit some autocorrelation. 

But for illustrate only and points discussed above, let’s we just ignore the problem that might be exist.
Now, let’s we generate some  ex post static forecast;

 reg realcons L.realcons realdpi if tin(1950q1,1983q4)
 estimates store consmodel
 forecast create forecaststat, replace
 forecast estimates consmodel
 set seed 1
 forecast solve, prefix(sf_) begin(tq(1984q1))end(tq(1985q4)) 
>static simulate(betas,statistic(stddev,prefix(ssd_)) reps(100))

and some ex post dynamic forecast

 forecast create forecastdy, replace
 forecast estimates consmodel
 set seed 1
 forecast solve, prefix(df_) begin(tq(1984q1)) 
>end(tq(1985q4)) simulate(betas, statistic(stddev, 
>prefix(dsd_)) reps(100))

Let’s now we compare realcons and  static and dynamic forecast for realcons (sf_realcons & df_realcons)  over the forecast period.

 list realcons sf_realcons df_realcons if 
>tin(1984q1,1985q4)


 













Notice that the static and dynamic forecast are identical in the first forecast period, as expected, but after that the values is differ.

Before we plot the data, lets we compute the upper and lower bounds of 95% prediction interval for our static and dynamic forecast realcons.

gen sf_y_up = sf_realcons + >invnormal(0.975)*ssd_realcons
gen sf_y_dn = sf_realcons + >invnormal(0.025)*ssd_realcons
gen df_y_up = df_realcons + >invnormal(0.975)*dsd_realcons
gen df_y_dn = df_realcons + >invnormal(0.025)*dsd_realcons

Let’s we plot the same data;

 twoway(line realcons year)(line sf_realcons year)(line 
>sf_y_up year,lpattern(dash))(line sf_y_dn 
>year,lpattern(dash))if tin(1984q1,1985q4)


 






twoway(line realcons year)(line df_realcons year)(line 
>df_y_up year,lpattern(dash))(line df_y_dn 
  >year,lpattern(dash))if tin(1984q1,1985q4)


 

 


The results from the graphs show that most of the dynamic forecast for realcons values is between the bounds compared to the static forecast.  That’s means the dynamic forecast  is more preferable than the static forecast for realcons.