MathType

Tuesday, 24 January 2017

Random Effects (RE) Model with Stata (Panel)



If individual effect  (cross-sectional or time specific effect) does not exist \(\left( {{u}_{i}}=0 \right)\) , OLS produces efficient and consistent parameter estimates;

\({{y}_{it}}={{\beta }_{0}}+{{\beta }_{1}}{{x}_{it}}+{{u}_{i}}+{{v}_{it}}\)  (1)                                                               

and we assumed that \(\left( {{u}_{i}}=0 \right)\) .

OLS consists of five core assumptions (Greene,2008; Kennedy,2008)
o   Linearity – the model is linear function.
o   Exogeneity – expected value of disturbance is zero or disturbance are not correlated with any regressor.
o   Homoscedasticity & no autocorrelation.
o   Not stochastic for the independent variable but fixed in repeated samples.
o   Full rank – there is no exact linear relationship among independent variables

There are several strategies for estimating a fixed effect model; the least squares dummy variable (LSDV) model, within estimation and between estimation.

Random Effects (RE) Model

In FE model we had discuss in here, the estimation goal of FE is to eliminate \({{u}_{i}}\)  because it is thought to be correlated with one or more of the \({{x}_{it}}\).
 
But, suppose we assume \({{u}_{i}}\) is uncorrelated with each explanatory variable in all time periods. Then using a transformation to eliminate \({{u}_{i}}\) will results inefficient estimators.

Eq(1) becomes a RE model when we assume that the unobserved effect \({{u}_{i}}\) is uncorrelated with each explanatory variable;

\(Cov\left( {{x}_{ij}},{{u}_{i}} \right)=0\)        (2)

The ideal RE assumptions include all the FE assumptions plus the additional requirement that \({{u}_{i}}\) is independent of all explainatory variables in all time periods.

If we assume the unobserved the \({{u}_{i}}\) is correlated with any explanatory variables, we should use first differencing or FE.

To estimate RE, we define the composite error term as \({{w}_{it}}={{u}_{i}}+{{v}_{it}}\), then the Eq (2.1) can be written as;

\({{y}_{it}}={{\beta }_{0}}+{{\beta }_{1}}{{x}_{it}}+{{w}_{it}}\)         (3)


Because is in the composite error in each time period, the \({{w}_{it}}\)  are serially correlated across time.

Under the RE assumptions;

\(Corr\left( {{w}_{it}},{{w}_{is}} \right)=\sigma _{u}^{2}/\left( \sigma _{u}^{2}+\sigma _{v}^{2} \right),t\ne s\)        (4)

where \(\sigma _{u}^{2}=Var\left( {{u}_{i}} \right)\)  and \(\sigma _{v}^{2}=Var\left( {{v}_{it}} \right)\)  

The RE is estimated by GLS when the covariance structure is known, and by FGLS or EGLS when the covariance structure of composite error is unknown.

Compared to FE model, a RE model is relatively difficult to estimate. In FGLS, we first have to estimate \(\theta \)  using  \(\hat{\sigma }_{u}^{2}\) and \(\hat{\sigma }_{v}^{2}\) .

The \(\hat{\sigma }_{u}^{2}\) comes from the between effect estimation (group mean regression) and \(\hat{\sigma }_{v}^{2}\) is derived from the RSS of the within effect estimation of the deviation of residuals from group means of residual;


\(\hat{\theta }=1-\sqrt{\frac{\hat{\sigma }_{v}^{2}}{T\hat{\sigma }_{u}^{2}+\hat{\sigma }_{v}^{2}}}=1-\sqrt{\frac{\hat{\sigma }_{v}^{2}}{T\hat{\sigma }_{between}^{2}}}\)           (5)


where  \(\hat{\sigma }_{u}^{2}=\hat{\sigma }_{between}^{2}-\frac{\hat{\sigma }_{v}^{2}}{T}\), where \(\hat{\sigma }_{between}^{2}=\frac{RS{{S}_{between}}}{n-k-1}\) ,
\(\hat{\sigma }_{v}^{2}=\frac{RS{{S}_{within}}}{nT-n-k}=\frac{e'{{e}_{within}}}{nT-n-k}=\frac{\sum\nolimits_{i=1}^{n}{\sum\nolimits_{t=1}^{T}{{{\left( {{v}_{it}}-{{{\bar{v}}}_{i}} \right)}^{2}}}}}{nT-n-k}\), where   are the residual of the LSDV.

Then, the dependent variable, independent variables, and the intercept term need to be transformed as follows;

\(y_{it}^{*}={{y}_{it}}-\hat{\theta }{{\bar{y}}_{i}}\)        (6)
\(x_{it}^{*}={{x}_{it}}-\hat{\theta }{{\bar{x}}_{i}}\)        (7)
\(\beta _{0}^{*}=1-\hat{\theta }\)          (8)

Finally, run OLS on those transformed variables , Eq(6), (7) and (8) with the traditional intercept suppressed;

\(y_{it}^{*}=\beta _{0}^{*}+{{\beta }_{1}}x_{it}^{*}+\varepsilon _{it}^{*}\)       (9)


Estimation using Stata

For our discussion on the RE using Stata, lets we use the data airline.dta again as we discuss the FE model in here  and we want to estimate the effects of output, fuel and loading factor to the cost of airline companies;

\(cos{{t}_{it}}={{\beta }_{0}}+{{\beta }_{1}}outpu{{t}_{it}}+{{\beta }_{2}}fue{{l}_{it}}+{{\beta }_{3}}loa{{d}_{it}}+{{v}_{it}}\)      (10)                


where;
\(cos{{t}_{it}}\)               = cost of airline companies
\(outpu{{t}_{it}}\)           = revenue passenger mile (output index)
\(fue{{l}_{it}}\)               = fuel prices
\(loa{{d}_{it}}\)              = loading factor (average capacity utilization of the fleet)


Now, lets us regress the Eq(10) by the pooled OLS

reg cost output fuel load

 

Now, lets we regress the RE model. The estimation of the RE model require that we need to estimate the Eq(3) first and then get the value of \(\theta \)  manually as in Eq(5). After that, we need to transform the data based on the value of  as in Eq(6), Eq(7) and Eq(8) also in manually and then regress the RE model by OLS as in Eq(9).

In Stata, we can skip the procedure of calculation and estimation manually from Eq(5) through Eq(9). Thanks to Stata for the command xtreg,re which the Stata estimate the Eq(9) automatically to get the output in RE estimation.

Before we run the xtreg command, we need to specifies first the cross-sectional and time series variables,

xtset airline year
 

To estimate the RE model as in  Eq(9);

xtreg cost output fuel load,re theta

 


The sigma_u  and sigma_e  are square roots of the variance components for groups and errors, respectively \(\left( 0.0156={{0.1249}^{2}},0.036={{0.0601}^{2}} \right)\).

Note that the RSS is 0.0602 displayed under sigma_e.

The rho represents the ratio of individual specific error variance to the composite (entire) error variance, \(0.8119={{0.1249}^{2}}/\left( {{0.1249}^{2}}+{{0.0601}^{2}} \right)\).

A large ratio – individual specific error account a large proportion of the composite error variance.
For this RE estimation, the individual specific error can explain 81% of entire composite error variance.

This ratio may be interpreted as a goodness-of-fit of RE model.







Tuesday, 17 January 2017

Fixed Effects (FE) Model with Stata (Panel)




If individual effect \({{u}_{i}}\)  (cross-sectional or time specific effect) does not exist\(\left( {{u}_{i}}=0 \right)\), OLS produces efficient and consistent parameter estimates;

\({{y}_{it}}={{\beta }_{0}}+{{\beta }_{1}}{{x}_{it}}+{{u}_{i}}+{{v}_{it}}\)          (1)

and we assumed that \(\left( {{u}_{i}}=0 \right)\) .

OLS consists of five core assumptions (Greene,2008; Kennedy,2008)
o   Linearity – the model is linear function.
o   Exogeneity – expected value of disturbance is zero or disturbance are not correlated with any regressor.
o   Homoscedasticity & no autocorrelation.
o   Not stochastic for the independent variable but fixed in repeated samples.
o   Full rank – there is no exact linear relationship among independent variables

There are several strategies for estimating a fixed effect model; the least squares dummy variable (LSDV) model, within estimation and between estimation.

LSDV

The least squares dummy variable (LSDV) model is widely used because it is relatively easy to estimate and interpret substantively. But, the LSDV will become problematic when there are many individual (or groups) in panel data.

If \(T\)  is fixed and  \(n\to \infty \) (\(n\)  is number of groups or firms, and \(T\)  is number time period) parameters estimates are consistent but the coefficients of individual effects, \({{\beta }_{0}}+{{u}_{i}}\) are not (Baltagi, 2001).If the LSDV includes a large number of dummy variables, the number of parameter increases as \(n\)  increases. 

Therefore, LSDV loses \(n\)  degree of freedom  but returns less efficient estimators.Under this circumstance, LSDV is useless and thus calls another strategy

\({{y}_{i}}={{\beta }_{1i}}+{{\beta }_{2}}{{x}_{it}}+{{v}_{it}}\)       (2)

we put the subscript \(i\) on the intercept term to suggest that the intercept of the individuals may be different, and the differences may be due to special features of each individuals.

Within Estimation

Unlike LSDV, the “within” estimation does not need dummy variables, but it uses deviations from group (or time period) means. That is, “within” estimation uses variation within each individual or entity instead of a large number of dummies.
To get the FE with “within’” estimation, for each \(i\) , we need to average the Eq(2.1) overtime,

\({{\bar{y}}_{i}}={{\beta }_{0}}+{{\beta }_{1}}{{\bar{x}}_{i}}+{{u}_{i}}+{{\bar{v}}_{i}}\)   (3)

where \({{\bar{y}}_{i}}={{T}^{-1}}\sum\nolimits_{t=1}^{T}{{{y}_{it}}}\) , \({{\bar{x}}_{i}}={{T}^{-1}}\sum\nolimits_{t=1}^{T}{{{x}_{it}}}\) and \({{\bar{v}}_{i}}={{T}^{-1}}\sum\nolimits_{t=1}^{T}{{{v}_{it}}}\)

Because \({{u}_{i}}\)  is fixed over time, it still  appears in Eq(3).
Subtract Eq(3) from Eq(1) for each \(t\) ;

\({{y}_{it}}-{{\bar{y}}_{i}}={{\beta }_{1}}\left( {{x}_{it}}-{{{\bar{x}}}_{i}} \right)+{{v}_{it}}-{{\bar{v}}_{i}}\)

or

\({{\ddot{y}}_{it}}={{\beta }_{1}}{{\ddot{x}}_{it}}+{{\ddot{v}}_{it}}\)     (4)

Where\({{\ddot{y}}_{it}}={{y}_{it}}-{{\bar{y}}_{i}}\) is the time-demeaning data on \(y\) , and similarly for \({{\ddot{x}}_{it}}\)  and \({{\ddot{v}}_{it}}\) .

The parameter estimates of regressors in the “within” estimation are identical to those of LSDV and reports correct of the RSS.The FE with “within estimator” allows for arbitrary correlation between  and the explanatory variables in any time period, just as with first differencing.

Because of these, any explanatory variable that is constant overtime for all \(i\)  get swept away by the fixed effects transformation (Kennedy, 2008; Wooldridge, 2009).

Between Estimation

The Eq (3) is also called as “between group” estimation, or the group mean regression which is uses variation between individual entities (group).

Specifically, this estimation calculates group means of the dependent and independent variables and thus reduces the number of observation s down to \(n\) .

Because only cross-section variation in the data is used, the coefficient of any individual-invariant regressors, such as time dummies, cannot be identified.

Estimation using Stata

For our discussion on the FE using Stata, lets we use the data airline.dta  and we want to estimate the effects of output, fuel and loadinfg factor to the cost of airline companies;

\(cos{{t}_{it}}={{\beta }_{0}}+{{\beta }_{1}}outpu{{t}_{it}}+{{\beta }_{2}}fue{{l}_{it}}+{{\beta }_{3}}loa{{d}_{it}}+{{v}_{it}}\)      (5)

where;

\(cos{{t}_{it}}\)               = cost of airline companies
\(outpu{{t}_{it}}\)           = revenue passenger mile (output index)
\(fue{{l}_{it}}\)               = fuel prices
\(loa{{d}_{it}}\)               = loading factor (average capacity utilization of the fleet)


Now, lets us regress the Eq(5) by the pooled OLS

reg cost output fuel load

 


The results show that the pooled OLS model fits the data well; with high \({{R}^{2}}\)  and also all the variables is significance even at 1% level.

To estimate the LSDV model, Let us examine fixed group effects by introducing group (airline) dummy variables.

Let set dummy variable as;

g1 =1 for airline 1; 0 = otherwise.
g2 =1 for airline 2; 0 = otherwise.
.
.
g6 =1 for airline 6; 0 = otherwise

Now we generate the new series of dummy variables for each groups (airline);

gen g1=(airline==1)
gen g2=(airline==2)
gen g3=(airline==3)
gen g4=(airline==4)
gen g5=(airline==5)
gen g6=(airline==6)

list airline year g1-g6 if year<=2,noobs
 


The LSDV model from Eq(5) will become;

\(cos{{t}_{it}}={{\beta }_{0}}+{{\beta }_{1}}outpu{{t}_{it}}+{{\beta }_{2}}fue{{l}_{it}}+{{\beta }_{3}}loa{{d}_{it}}+{{u}_{1}}{{g}_{1}}+{{u}_{2}}{{g}_{2}}+{{u}_{3}}{{g}_{3}}+{{u}_{4}}{{g}_{4}}+{{u}_{5}}{{g}_{5}}+{{v}_{it}}\)(2.6)

Five group dummies \(\left( {{g}_{1}}-{{g}_{5}} \right)\)  are added to the pooled OLS equation. We excluded \({{g}_{6}}\) from the regression equation in order to avoid perfect multicollinearity or we called as dummy variable trap.

The \(\left( {{u}_{1}}-{{u}_{5}} \right)\)  are respectively parameter estimates of group dummy variables  \(\left( {{g}_{1}}-{{g}_{5}} \right)\).

Now, lets us regress the Eq(6).

reg cost output fuel load g1 g2 g3 g4 g5

 


The LSDV results seem fits better than the pooled OLS. The F-statistics increased from 2419.34 to 3935.79, the RSS decreased from 1.335 to 0.293 and the  increased from 0.988 to 0.997.

Because we included the dummy variables, the model loses five degree of freedom. Parameter estimated we get from the LSDV model also different form the pooled OLS model but the sign still consistent.

The LSDV model posits that each airline has its own intercept but share the same slopes of regression.

The parameter estimate of \({{g}_{6}}\)  (dropped dummy for Airline 6) is presented in the LSDV model by the intercept (9.793) , which is the benchmark intercept (reference point).

The value of  \(\left( {{u}_{1}}-{{u}_{5}} \right)\) represents the deviation (or differences) of its group specific intercept from the benchmark intercept (Airline 6). Eg.,\({{u}_{1}}=-0.087\)  means the intercept of Airline 1 are smaller by 0.087 than Airline 6, and the intercept for airline 1 is \({{\beta }_{0}}+{{u}_{1}}=9.793-0.087=9.706\) .

The equations for each airline will become;

Airline 1: \(cos\hat{t}=9.706+0.919*outpu{{t}_{it}}+0.417*fue{{l}_{it}}-1.070*loa{{d}_{it}}\)
Airline 2: \(cos\hat{t}=9.665+0.919*outpu{{t}_{it}}+0.417*fue{{l}_{it}}-1.070*loa{{d}_{it}}\)
Airline 3: \(cos\hat{t}=9.497+0.919*outpu{{t}_{it}}+0.417*fue{{l}_{it}}-1.070*loa{{d}_{it}}\)
Airline 4: \(cos\hat{t}=9.890+0.919*outpu{{t}_{it}}+0.417*fue{{l}_{it}}-1.070*loa{{d}_{it}}\)
Airline 5: \(cos\hat{t}=9.730+0.919*outpu{{t}_{it}}+0.417*fue{{l}_{it}}-1.070*loa{{d}_{it}}\)
Airline 6: \(cos\hat{t}=9.793+0.919*outpu{{t}_{it}}+0.417*fue{{l}_{it}}-1.070*loa{{d}_{it}}\)

Let’s we compare the pooled OLS and LSDV side by side with Stata command  estout.If not available, installing it by typing ssc install estout.

* pooled OLS
quiet reg cost output fuel load     
estimates store pooled

* LSDV
quiet reg cost output fuel load g1-g5
estimates store LSDV

* create table
estout pooled LSDV,cells(b(star fmt(3)) se(par fmt(3))) stats(F df_r rss rmse r2 r2_a N)

 

Note:
F
= F-statitics
df_r
= degree of freedom
rss  
= residual sum of squares
rmse  
= root mean square error
r2   
= R-square
r2_a    
= adjusted R-squares
N
= number of obs

Parameter estimates of regressor show some differences between the pooled OLS and LSDV, but all of them statistically significant at 1% level.

The pooled OLS report overall intercept. The LSDV report the intercept of the dropped (benchmark) and deviation of other five intercepts from the benchmark.

The another way to estimate the FE is by using the “within” estimation. The Stata xtreg command estimates “within group” estimator without creating dummy variables.

Before we run the xtreg command, we need to specifies first the cross-sectional and time series variables,

xtset airline year
 


To estimate the FE model by “within” estimation as in Eq(4);

xtreg cost output fuel load,fe

 

The F-test in last line examines the null hypothesis that five dummy parameter in LSDV are zero \(\left( {{u}_{1}}={{u}_{2}}={{u}_{3}}={{u}_{4}}={{u}_{5}}=0 \right)\).
 
The large F-statistic reject the null hypothesis in favor of the fixed group effect.The intercept of 9.713 is the average intercept.

The xtreg does not display an analysis of variance (ANOVA) table including SSE.Since many related statistics are stored in macro, we need to run  ereturn or display to get them.

ereturn list

 


To display the  value of model sum of squares (MMS) or so called explain sum of squares (ESS) and residual sum of squares (RSS);

display e(mss) e(rss)

To get the value of Root MSE which the fomula is \(\left( RSS/\left( n-k \right) \right)\) ;

display sqrt(e(rss)/e(df_r))
 



To display the value of \({{R}^{2}}\)  and Adjusted-\({{R}^{2}}\);


display e(r2) e(r2_a)
 


Let us get some comparison between the OLS, LSDV and the “within” estimation;

reg cost output fuel load
estimates store OLS

reg cost output fuel load g1-g5  
estimates store LSDV

xtreg cost output fuel load,fe
estimates store xtreg

estout OLS LSDV xtreg,cells(b(star fmt(3)) se(par fmt(3))) stats(F df_r mss rss rmse r2 r2_a F_f F_absorb N)

 


Note:
F
= F-statitics
df_r
= degree of freedom
rss  
= residual sum of squares
mss   
= model(explain) sum of squares
rmse  
= root mean square error
r2   
= R-square
r2_a    
= adjusted R-squares
F_f
= F-test (fixed effect)
F_absorb
= F-test (fixed effect)
N
= number of obs

The result shows contrast the output of the pooled OLS and and the  fixed effect estimation (LSDV, and  xtreg )

Except for the pooled OLS, estimate from FE produce same RMSE, parameter estimates and SE but reports a bit different of goodness-of-fit measures.

Which estimation is best for us?

LSDV generally preferred because of correct estimation, goodness-of-fit, and group/time specific intercepts. But, if the number of entities and/or time period is large enough, say over 100 groups, the xtreg will provide less painful and more elegant solutions including F-test for fixed effects.