MathType

Tuesday 17 January 2017

Fixed Effects (FE) Model with Stata (Panel)




If individual effect \({{u}_{i}}\)  (cross-sectional or time specific effect) does not exist\(\left( {{u}_{i}}=0 \right)\), OLS produces efficient and consistent parameter estimates;

\({{y}_{it}}={{\beta }_{0}}+{{\beta }_{1}}{{x}_{it}}+{{u}_{i}}+{{v}_{it}}\)          (1)

and we assumed that \(\left( {{u}_{i}}=0 \right)\) .

OLS consists of five core assumptions (Greene,2008; Kennedy,2008)
o   Linearity – the model is linear function.
o   Exogeneity – expected value of disturbance is zero or disturbance are not correlated with any regressor.
o   Homoscedasticity & no autocorrelation.
o   Not stochastic for the independent variable but fixed in repeated samples.
o   Full rank – there is no exact linear relationship among independent variables

There are several strategies for estimating a fixed effect model; the least squares dummy variable (LSDV) model, within estimation and between estimation.

LSDV

The least squares dummy variable (LSDV) model is widely used because it is relatively easy to estimate and interpret substantively. But, the LSDV will become problematic when there are many individual (or groups) in panel data.

If \(T\)  is fixed and  \(n\to \infty \) (\(n\)  is number of groups or firms, and \(T\)  is number time period) parameters estimates are consistent but the coefficients of individual effects, \({{\beta }_{0}}+{{u}_{i}}\) are not (Baltagi, 2001).If the LSDV includes a large number of dummy variables, the number of parameter increases as \(n\)  increases. 

Therefore, LSDV loses \(n\)  degree of freedom  but returns less efficient estimators.Under this circumstance, LSDV is useless and thus calls another strategy

\({{y}_{i}}={{\beta }_{1i}}+{{\beta }_{2}}{{x}_{it}}+{{v}_{it}}\)       (2)

we put the subscript \(i\) on the intercept term to suggest that the intercept of the individuals may be different, and the differences may be due to special features of each individuals.

Within Estimation

Unlike LSDV, the “within” estimation does not need dummy variables, but it uses deviations from group (or time period) means. That is, “within” estimation uses variation within each individual or entity instead of a large number of dummies.
To get the FE with “within’” estimation, for each \(i\) , we need to average the Eq(2.1) overtime,

\({{\bar{y}}_{i}}={{\beta }_{0}}+{{\beta }_{1}}{{\bar{x}}_{i}}+{{u}_{i}}+{{\bar{v}}_{i}}\)   (3)

where \({{\bar{y}}_{i}}={{T}^{-1}}\sum\nolimits_{t=1}^{T}{{{y}_{it}}}\) , \({{\bar{x}}_{i}}={{T}^{-1}}\sum\nolimits_{t=1}^{T}{{{x}_{it}}}\) and \({{\bar{v}}_{i}}={{T}^{-1}}\sum\nolimits_{t=1}^{T}{{{v}_{it}}}\)

Because \({{u}_{i}}\)  is fixed over time, it still  appears in Eq(3).
Subtract Eq(3) from Eq(1) for each \(t\) ;

\({{y}_{it}}-{{\bar{y}}_{i}}={{\beta }_{1}}\left( {{x}_{it}}-{{{\bar{x}}}_{i}} \right)+{{v}_{it}}-{{\bar{v}}_{i}}\)

or

\({{\ddot{y}}_{it}}={{\beta }_{1}}{{\ddot{x}}_{it}}+{{\ddot{v}}_{it}}\)     (4)

Where\({{\ddot{y}}_{it}}={{y}_{it}}-{{\bar{y}}_{i}}\) is the time-demeaning data on \(y\) , and similarly for \({{\ddot{x}}_{it}}\)  and \({{\ddot{v}}_{it}}\) .

The parameter estimates of regressors in the “within” estimation are identical to those of LSDV and reports correct of the RSS.The FE with “within estimator” allows for arbitrary correlation between  and the explanatory variables in any time period, just as with first differencing.

Because of these, any explanatory variable that is constant overtime for all \(i\)  get swept away by the fixed effects transformation (Kennedy, 2008; Wooldridge, 2009).

Between Estimation

The Eq (3) is also called as “between group” estimation, or the group mean regression which is uses variation between individual entities (group).

Specifically, this estimation calculates group means of the dependent and independent variables and thus reduces the number of observation s down to \(n\) .

Because only cross-section variation in the data is used, the coefficient of any individual-invariant regressors, such as time dummies, cannot be identified.

Estimation using Stata

For our discussion on the FE using Stata, lets we use the data airline.dta  and we want to estimate the effects of output, fuel and loadinfg factor to the cost of airline companies;

\(cos{{t}_{it}}={{\beta }_{0}}+{{\beta }_{1}}outpu{{t}_{it}}+{{\beta }_{2}}fue{{l}_{it}}+{{\beta }_{3}}loa{{d}_{it}}+{{v}_{it}}\)      (5)

where;

\(cos{{t}_{it}}\)               = cost of airline companies
\(outpu{{t}_{it}}\)           = revenue passenger mile (output index)
\(fue{{l}_{it}}\)               = fuel prices
\(loa{{d}_{it}}\)               = loading factor (average capacity utilization of the fleet)


Now, lets us regress the Eq(5) by the pooled OLS

reg cost output fuel load

 


The results show that the pooled OLS model fits the data well; with high \({{R}^{2}}\)  and also all the variables is significance even at 1% level.

To estimate the LSDV model, Let us examine fixed group effects by introducing group (airline) dummy variables.

Let set dummy variable as;

g1 =1 for airline 1; 0 = otherwise.
g2 =1 for airline 2; 0 = otherwise.
.
.
g6 =1 for airline 6; 0 = otherwise

Now we generate the new series of dummy variables for each groups (airline);

gen g1=(airline==1)
gen g2=(airline==2)
gen g3=(airline==3)
gen g4=(airline==4)
gen g5=(airline==5)
gen g6=(airline==6)

list airline year g1-g6 if year<=2,noobs
 


The LSDV model from Eq(5) will become;

\(cos{{t}_{it}}={{\beta }_{0}}+{{\beta }_{1}}outpu{{t}_{it}}+{{\beta }_{2}}fue{{l}_{it}}+{{\beta }_{3}}loa{{d}_{it}}+{{u}_{1}}{{g}_{1}}+{{u}_{2}}{{g}_{2}}+{{u}_{3}}{{g}_{3}}+{{u}_{4}}{{g}_{4}}+{{u}_{5}}{{g}_{5}}+{{v}_{it}}\)(2.6)

Five group dummies \(\left( {{g}_{1}}-{{g}_{5}} \right)\)  are added to the pooled OLS equation. We excluded \({{g}_{6}}\) from the regression equation in order to avoid perfect multicollinearity or we called as dummy variable trap.

The \(\left( {{u}_{1}}-{{u}_{5}} \right)\)  are respectively parameter estimates of group dummy variables  \(\left( {{g}_{1}}-{{g}_{5}} \right)\).

Now, lets us regress the Eq(6).

reg cost output fuel load g1 g2 g3 g4 g5

 


The LSDV results seem fits better than the pooled OLS. The F-statistics increased from 2419.34 to 3935.79, the RSS decreased from 1.335 to 0.293 and the  increased from 0.988 to 0.997.

Because we included the dummy variables, the model loses five degree of freedom. Parameter estimated we get from the LSDV model also different form the pooled OLS model but the sign still consistent.

The LSDV model posits that each airline has its own intercept but share the same slopes of regression.

The parameter estimate of \({{g}_{6}}\)  (dropped dummy for Airline 6) is presented in the LSDV model by the intercept (9.793) , which is the benchmark intercept (reference point).

The value of  \(\left( {{u}_{1}}-{{u}_{5}} \right)\) represents the deviation (or differences) of its group specific intercept from the benchmark intercept (Airline 6). Eg.,\({{u}_{1}}=-0.087\)  means the intercept of Airline 1 are smaller by 0.087 than Airline 6, and the intercept for airline 1 is \({{\beta }_{0}}+{{u}_{1}}=9.793-0.087=9.706\) .

The equations for each airline will become;

Airline 1: \(cos\hat{t}=9.706+0.919*outpu{{t}_{it}}+0.417*fue{{l}_{it}}-1.070*loa{{d}_{it}}\)
Airline 2: \(cos\hat{t}=9.665+0.919*outpu{{t}_{it}}+0.417*fue{{l}_{it}}-1.070*loa{{d}_{it}}\)
Airline 3: \(cos\hat{t}=9.497+0.919*outpu{{t}_{it}}+0.417*fue{{l}_{it}}-1.070*loa{{d}_{it}}\)
Airline 4: \(cos\hat{t}=9.890+0.919*outpu{{t}_{it}}+0.417*fue{{l}_{it}}-1.070*loa{{d}_{it}}\)
Airline 5: \(cos\hat{t}=9.730+0.919*outpu{{t}_{it}}+0.417*fue{{l}_{it}}-1.070*loa{{d}_{it}}\)
Airline 6: \(cos\hat{t}=9.793+0.919*outpu{{t}_{it}}+0.417*fue{{l}_{it}}-1.070*loa{{d}_{it}}\)

Let’s we compare the pooled OLS and LSDV side by side with Stata command  estout.If not available, installing it by typing ssc install estout.

* pooled OLS
quiet reg cost output fuel load     
estimates store pooled

* LSDV
quiet reg cost output fuel load g1-g5
estimates store LSDV

* create table
estout pooled LSDV,cells(b(star fmt(3)) se(par fmt(3))) stats(F df_r rss rmse r2 r2_a N)

 

Note:
F
= F-statitics
df_r
= degree of freedom
rss  
= residual sum of squares
rmse  
= root mean square error
r2   
= R-square
r2_a    
= adjusted R-squares
N
= number of obs

Parameter estimates of regressor show some differences between the pooled OLS and LSDV, but all of them statistically significant at 1% level.

The pooled OLS report overall intercept. The LSDV report the intercept of the dropped (benchmark) and deviation of other five intercepts from the benchmark.

The another way to estimate the FE is by using the “within” estimation. The Stata xtreg command estimates “within group” estimator without creating dummy variables.

Before we run the xtreg command, we need to specifies first the cross-sectional and time series variables,

xtset airline year
 


To estimate the FE model by “within” estimation as in Eq(4);

xtreg cost output fuel load,fe

 

The F-test in last line examines the null hypothesis that five dummy parameter in LSDV are zero \(\left( {{u}_{1}}={{u}_{2}}={{u}_{3}}={{u}_{4}}={{u}_{5}}=0 \right)\).
 
The large F-statistic reject the null hypothesis in favor of the fixed group effect.The intercept of 9.713 is the average intercept.

The xtreg does not display an analysis of variance (ANOVA) table including SSE.Since many related statistics are stored in macro, we need to run  ereturn or display to get them.

ereturn list

 


To display the  value of model sum of squares (MMS) or so called explain sum of squares (ESS) and residual sum of squares (RSS);

display e(mss) e(rss)

To get the value of Root MSE which the fomula is \(\left( RSS/\left( n-k \right) \right)\) ;

display sqrt(e(rss)/e(df_r))
 



To display the value of \({{R}^{2}}\)  and Adjusted-\({{R}^{2}}\);


display e(r2) e(r2_a)
 


Let us get some comparison between the OLS, LSDV and the “within” estimation;

reg cost output fuel load
estimates store OLS

reg cost output fuel load g1-g5  
estimates store LSDV

xtreg cost output fuel load,fe
estimates store xtreg

estout OLS LSDV xtreg,cells(b(star fmt(3)) se(par fmt(3))) stats(F df_r mss rss rmse r2 r2_a F_f F_absorb N)

 


Note:
F
= F-statitics
df_r
= degree of freedom
rss  
= residual sum of squares
mss   
= model(explain) sum of squares
rmse  
= root mean square error
r2   
= R-square
r2_a    
= adjusted R-squares
F_f
= F-test (fixed effect)
F_absorb
= F-test (fixed effect)
N
= number of obs

The result shows contrast the output of the pooled OLS and and the  fixed effect estimation (LSDV, and  xtreg )

Except for the pooled OLS, estimate from FE produce same RMSE, parameter estimates and SE but reports a bit different of goodness-of-fit measures.

Which estimation is best for us?

LSDV generally preferred because of correct estimation, goodness-of-fit, and group/time specific intercepts. But, if the number of entities and/or time period is large enough, say over 100 groups, the xtreg will provide less painful and more elegant solutions including F-test for fixed effects.








No comments:

Post a Comment