How to do clustering for panel data model in R
Clustering standard errors in R
An introduction of clustering in panel data models
In my last post, I briefly introduced standard error clustering in panel data settings. In this post, I will continue the topic and present how to do the clustering in R
.
Before we move to the coding part, I’d like to clarify several things. First, the panel model I focus on in this post refers to the fixed effect panel data model, which is distinct from the random effect model. These two models differ from each other in terms of the assumption of the unobserved individual effects. The random effect model assumes the individual effects are unrelated to the independent variables, whereas the fixed effect model allows the presence of arbitrary correlations. You are encouraged to go over the textbook by Wooldridge (2010) if you want to learn more.
In practice, the fixed effect model is much more privileged than the random effect model. Because we suspect there exist unobserved individual effects that are correlated with the covariates and we use the fixed effect model to control for the unobservables to alleviate omitted variable bias.
Back to our topic, clustering in the fixed effect model is straightforward. We can either cluster at \(i\) (individual level) or \(t\) (time level). The former case is robust to heterogeneity and serial correlation within the individual, whereas the latter one allows correlations across individuals yet is unable to take account of the serial correlations.
You may ask how to determine which level to cluster at? The rule of thumb is that in panels where you have a large number of individuals and a short period, you should cluster at the individual level. Analogously, if you have a few individuals that were observed during a long time slot, you should cluster at the time level.
Alternatively, you can also “cluster” at both individual and time levels (also known as two-way clustering). See here for more details.
Clustering in the random effect model is slightly different than that in the fixed effect counterpart. With a few more assumptions (again refer to Wooldridge (2010) for more details), we can use GLS to estimate standard errors. In such cases, we gain efficiency (because we have a smaller number of unknowns to estimate in the variance and covariance matrix), however, the payoff is the standard errors are less robust because we imposed additional assumptions in the first place. Thus, in some circumstances, regular clustering is also recommended in random effect model settings.
Clustering in R
Before we perform clustering, we need to run the panel data model first. You can either use the lm
function or the plm
function from the plm
package. I personally prefer the latter over the former. Thus, in this post, I am going to stick with the plm
package.
Importing the data
I am going to run a panel data model to examine the effects of weather conditions on crop yields. The data can be downloaded from here.
library(tidyverse)
library(plm)
library(lmtest)
library(clubSandwich)
library(sandwich)
df <- read_csv('dataforclustering.csv',
col_types = cols(new_code = col_character()))
head(df)
## # A tibble: 6 × 5
## new_code Year wheat tavg pc
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 130402 1987 1.64 8.15 103.
## 2 130402 1988 1.62 8.29 177.
## 3 130402 1989 5.37 9.17 95.4
## 4 130402 1999 4.75 10.3 41.4
## 5 130402 2000 4.75 9.29 88.9
## 6 130402 2001 4.75 8.83 151.
As you can see,
new_code
represents the counties (individuals).Year
indicates the time index.wheat
refers to winter wheat yield in tons/hectare.tavg
stands for the average temperature over the growing season (October to May).pc
is the accumulation of precipitation over the growing season.
Running the fixed effect model
# Note I only included the county fixed effect in the model.
reg <- df %>%
pdata.frame(index = c('new_code','Year')) %>% # specify the inidivual and year index
plm(formula = log(wheat) ~ tavg + I(tavg^2) + pc + I(pc^2),
data = ., effect = 'individual', model = 'within')
summary(reg)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = log(wheat) ~ tavg + I(tavg^2) + pc + I(pc^2), data = .,
## effect = "individual", model = "within")
##
## Unbalanced Panel: n = 352, T = 10-34, N = 8867
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -7.158354 -0.132313 0.081255 0.255566 1.144955
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## tavg 4.4441e-01 5.0411e-02 8.8156 < 2.2e-16 ***
## I(tavg^2) -8.5789e-03 2.6044e-03 -3.2940 0.0009917 ***
## pc 1.8988e-03 2.0575e-04 9.2285 < 2.2e-16 ***
## I(pc^2) -2.6568e-06 3.3927e-07 -7.8309 5.425e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 2107.6
## Residual Sum of Squares: 1726.4
## R-Squared: 0.18087
## Adj. R-Squared: 0.1467
## F-statistic: 469.818 on 4 and 8511 DF, p-value: < 2.22e-16
As can be been, with raw (no clustering) standard errors, we got statistically significant coefficients on temperature and precipitation. We then move forward to clustering.
Clustering the standard erros
There are three functions available to do the clustering.
- You can use the
vcovHC
function in theplm
package to construct the variance-covariance matrix. - Alternatively, you can use the
vcovHC
function from thesandwich
package to do the task. - Or
vcovCR
function from theclubSandwich
package to do the task.
# The first approach: vcovHC from the plm package
# 'arellano' is the recommanded method as it does not
# impose restrictions on the structure of the variance-covariance matrix.
# cluster = 'group' indicates we cluster at county level
# type = 'HC3' is for small-sample adjustment.
method1 <- coeftest(reg, vcov = function(x)
plm::vcovHC(x, method = 'arellano', cluster = 'group', type = 'HC3'))
# The second approach: vcovHC from the sandwich package
# You may have noticed that there does not exist a parameter for us
# to specify the clustering level.
method2 <- coeftest(reg, vcov = function(x)
sandwich::vcovHC(x, type = 'HC3'))
# The third approach: vcovHC from the clubSandwich package
# type = 'CR2' is recommanded for small-sample adjustment.
# cluster = df$new_code indicates we cluster at the individual level.
method3 <- coeftest(reg, vcov = function(x)
clubSandwich::vcovCR(x, type = 'CR2', cluster = df$new_code))
# Compare the estimated standard errors
comse <- bind_cols(method1[,2], method2[,2], method3[,2]) %>%
rename('plm' = '...1',
'sandwich' = '...2',
'clubsandwich' = '...3') %>% print(.)
## # A tibble: 4 × 3
## plm sandwich clubsandwich
## <dbl> <dbl> <dbl>
## 1 0.0721 0.0721 0.0724
## 2 0.00360 0.00360 0.00361
## 3 0.000233 0.000233 0.000234
## 4 0.000000375 0.000000375 0.000000377
As we can see, plm
and sandwich
gave us identical clustered standard errors, whereas clubsanwich
returned slightly larger standard errors. Note that in the analysis above, we clustered at the county (individual) level.
Things are different if we clustered at the year (time) level. Simply change the cluster = 'group'
to cluster = 'time'
in method one, and adjust cluster = df$new_code
to cluster = df$Year
in method three. See below. The vcovHC
function from the sandwich
package does not allow us to choose which level to cluster the standard errors at.
method1_time <- coeftest(reg, vcov = function(x)
plm::vcovHC(x, method = 'arellano', cluster = 'time', type = 'HC3'))
method3_time <- coeftest(reg, vcov = function(x)
clubSandwich::vcovCR(x, type = 'CR2', cluster = df$Year))
comse_time <- bind_cols(method1_time[,2], method2[,2], method3_time[,2]) %>%
rename('plm' = '...1',
'sandwich' = '...2',
'clubsandwich' = '...3') %>% print(.)
## # A tibble: 4 × 3
## plm sandwich clubsandwich
## <dbl> <dbl> <dbl>
## 1 0.189 0.0721 0.208
## 2 0.00865 0.00360 0.00996
## 3 0.00114 0.000233 0.00124
## 4 0.00000138 0.000000375 0.00000150
Takeaways
In applications where you cluster standard errors at the individual level, all three methods should work just fine. However, if you want to cluster at the time level (or other alternative levels), you may refer to the embedded vcovHC
function in the plm
package or the vcovCR
function from the clubSandwich
package.
Well, actually I would say it’s better to just stick with the plm
and/or clubSandwich
approaches regardless of what level your standard errors were clustered at. The sandwich
package is sort of particularly designed for clustering in lm
or glm
regressions.
Reference
Jeffrey M Wooldridge, 2010. “Econometric Analysis of Cross Section and Panel Data,” MIT Press Books, The MIT Press, edition 2.