lfe: Linear Group Fixed Effects

Linear models with fixed effects and many dummy variables are common in some fields. Such models are straightforward to estimate unless the factors have too many levels. The R package lfe solves this problem by implementing a generalization of the within transformation to multiple factors, tailored for large problems.


Introduction
A typical linear model looks like:

> y ~x1+x2+x3 + f1+f2+f3
where f1,f2,f3 are arbitrary factors, and x1,x2,x3 are other covariates.Coefficients may easily be estimated by lm(): > lm(y ~x1+x2+x3 + f1+f2+f3) However, in some applications, in particular in econometrics, the factors have too many levels and are still needed as fixed effects, as in Abowd et al. (1999).The use of fixed effects as opposed to random effects is typically necessitated by the possibility that the factors and the other regressors are correlated.They study log wage (logwage) as an outcome, where fixed effects for individuals (id) and firms (firm) are included, with some ordinary covariates (exemplified by x), i.e. a model of the type:

> logwage ~x + id + firm
where id is a factor with one level for each employee, and firm is a factor with one level for each firm.Employees change firm now and then, so it is not a nested model.The model is used to account for arbitrarily distributed time constant individual and firm hetereogeneity, and is used to study the correlation between the firm effect and the employee effect.There are also similar cases with 3 factors, e.g. in Torres et al. (2013), where job title is also considered.They have a dataset of 27 million observations, with 5.5 million, 568,000, and 96,000 levels in the three factors.In other applications the factors are primarily used as controls, as in Markussen and Røed (2012), where a variety of models are used, controlling for various combinations of interactions between factors.These datasets are typically sourced from large public registries, and can contain tens of millions of observations, with hundreds of thousands or millions of factor levels.This far exceeds the capabilities of lm(), and can also be too demanding for the sparse methods in package Matrix (Bates and Maechler, 2013).
When estimating such models, the case with a single factor is special.It can be estimated by using the within groups transformation, where the mean of the groups are subtracted from the covariates, resulting in a system without the factor, via the Frisch-Waugh-Lovell theorem.See e.g.Wooldridge (2002, Section 10.5).The coefficients for the factor levels can easily be recovered as the group means of the residuals.This estimation method can be found in package plm (Croissant and Millo, 2008).The method can also be used with more than one factor, by using the within transformation to project out the factor with the highest number of levels, coding the others as dummy variables.However, if all of the factors have many levels this can still result in a too large system which is non-sparse.Moreover, having such sets of dummies in datasets which are not balanced may lead to non-trivial identification problems.We will illustrate how to use the package lfe (Gaure, 2013b) to solve some of these problems.For clarity we construct quite small datasets by drawing random covariates, rather than to refer to large real datasets which are not publicly available.

Enter lfe
The package lfe is designed to handle the above estimation problem.It also contains some methods for solving identification problems, though not completely in all cases.Since lfe handles quite ordinary linear models which conceptually could be handled by lm(), we do not go into detail about the feasibility of fixed effect linear models as such, only the problems which are more or less peculiar to models with a large number of dummies.
We write our model in matrix form where D is a matrix of dummies coding the factor levels, X is a matrix with the other covariates, y is the response vector, and is a normally distributed error term.Performing an ordinary least squares regression (OLS) on this system yields the OLS estimates β for the X covariates and α for the factor levels.The Frisch-Waugh-Lovell theorem states that if P is the projection onto the orthogonal complement of the range of D, then the projected system Py = PXβ + P , yields the same β when estimated with OLS.Moreover, the matrix (X t PX) −1 which is used to find the covariance matrix of β, is identical to the β-part of the corresponding matrix in the full system (1).
Similarly, the residuals are identical.The projected system does not contain the dummies for the factor levels, and may therefore be manageable with conventional methods.
Unfortunately, P is an n × n matrix, where n is the number of observations, so it is impractical to compute P when n ≈ 10 7 .However, it is easier to compute Px for a vector x, i.e. y and the columns of X.In the special case when D encodes a single factor, the transformation x → Px is the within transformation, i.e. centring of the groups on their means.It is shown in Gaure (2013a) that when there is more than one factor, say e > 1 factors, we may consider the centring transformation for each of them, a projection P i for each i = 1 . . .e, and compute Px as This approach is known as the method of alternating projections and is based on a result by Halperin (1962, Theorem 1).The procedure can easily be implemented with an R function taking as input a vector x and the list of factors flist: This algorithm was also arrived at by Guimarães and Portugal (2010, p. 637) as a technical simplification of their iterated estimation approach to the same problem.
For efficiency reasons, this linear transformation has been written in C, made threaded to centre vectors in parallel, and is available as the function demeanlist() in lfe, though the function felm() wraps it in an lm() like function.We create a simple example to illustrate the usage.We have 100,000 observations of a covariate x, and two factors f1, f2, each with 10,000 randomly drawn levels.We create an outcome variable y and estimate the x coefficient.The G() syntax is used to specify which factors should be projected out of the system, a similar syntax as for the Error() term in aov().The G() is not an R function in itself, though it translates to as.factor() inside felm() after the G() terms have been removed from the model for special handling.
> library(lfe) > set.seed(42)> x <-rnorm(100000) > f1 <-sample(10000, length(x), replace=TRUE) > f2 <-sample(10000, length(x), replace=TRUE) > y <-2.13*x + cos(f1) + log(f2+1) + rnorm(length(x), sd=0.5)> est <-felm(y ~x + G(f1) + G(f2)) > summary(est) Call: felm(formula = y ~x + G(f1) + G(f2)) Residuals: The R Journal Vol.The result of felm() is a 'felm' object which is quite similar to an 'lm' object.However, it is not fully compatible with it, so some methods for 'lm' objects may work on a 'felm' object, others may not.In an earlier version, the 'felm' object inherited from the class 'lm', but there are important differences in the structure, so this inheritance has been removed.In particular there is no qr-decomposition in the 'felm' object, so methods for 'lm' objects which depend on the qr-decomposition, such as anova(), can not be used.Also, the 'felm'-object does not contain a copy of the dataset.This has been removed to conserve memory for very large datasets.Simple extractors like coef() and residuals() work.Extracting the covariance matrix with vcov() also works, via a separate S3 method, albeit only for the βs.Also, a summary() S3-method, and an accompanying print() method have been included.It is possible to try 'lm'-methods explicitly as in: getS3method( vcov , lm )(est), though the author does not guarantee any success with this approach.
The careful reader has noticed that the behaviour of summary() on a 'felm' object with respect to degrees of freedom and R 2 is the same as that of on an 'lm' object when including an intercept.There is no explicit intercept in the result of felm(), but the factor structure includes one implicitly.

The coefficients for the factor levels
If the factors are used as controls only, the above procedure is all we have to worry about, but if we also need the group coefficients α, what econometricians often refer to as the fixed effects, we can solve the equation for α.The right hand side is easily computed when we have β, PX and Py.The equation is solved by the Kaczmarz method (Kaczmarz, 1937), as described in Gaure (2013a).The method is available as a function kaczmarz(), but the wrapper getfe() is more useful.A solution of equation ( 2) is not unique, the matrix D dummy-encodes all the factor levels, hence there will be multicollinearities (unless there is only one factor), both the obvious ones, but there can also be spurious ones.The multicollinearities are resolved by applying an estimable function.lfe contains a function efactory() for creating estimable functions, but users can also supply their own.getfe() returns a 'data.frame'with some information in addition to the coefficients.The reason for this is that the intended use of lfe is for factors with so many levels that they probably anyway must be analyzed according to the researcher's needs prior to being presented.We continue the example: effect obs comp fe idx f1.9998 -0.2431720 9 1 f1 9998 f1.9999 -0.9733257 5 1 f1 9999 f1.10000 -0.8456289 9 1 f1 10000 f2.1 0.4800013 9 1 f2 1 f2.2 1.4868744 14 1 f2 2 f2.3 1.5002583 11 1 f2 3 The obs column is the number of observations of this level.The fe and idx columns are factors containing the same information as the row name, but split into the name of the factor and the level for convenience.The comp column is important in that it is used for identification purposes, and here is a main deviation from how lm() treats identification in the presence of factors.The The R Journal Vol.5/2, December 2013 ISSN 2073-4859 default method when using lm() is taken from the global contrasts option which defaults to using treatment contrasts for each factor.It introduces a reference level for each factor, this is equivalent to forcing the coefficient for the reference level to zero, and the other coefficients are only meaningful when compared to this zero coefficient.It is also possible to specify different constraints for the lm() coefficients, such as forcing the sum of the coefficients to zero, or an arbitrary one via the contrasts option, or the contrasts argument to lm(), but typically a single constraint is used for each factor.
Identification when the factors have many levels can be a more complicated matter.The standard example in the econometrics literature is the one found in Abowd et al. (1999), elaborated in Abowd et al. (2002).In this case there are two factors, one for employees and one for firms.It may happen that one set of employees move between one set of firms, whereas another disjoint set of employees move between some other firms.There are no movements between these mobility groups, hence coefficients from different groups can not be compared.A useful construction for analysis of this problem is the undirected, bipartite graph which has the factor levels as vertices, and an edge between levels that occur in the same observation.The mobility groups are then the connected components of this graph.It is shown in Abowd et al. (2002, Appendix 1) that for identification of the coefficients, it is sufficient to introduce a single reference level in each of the disjoint mobility groups, either a firm or an employee.This was previously established by Eccleston and Hedayat (1974) in a different context, and there is another argument in Gaure (2013a) using spectral graph theory.The comp column in the return value from getfe() when using estimable functions from efactory() is a factor enumerating these groups, i.e. the connected components.efactory() chooses by default the level with the highest number of observations as a reference level, and sets the coefficient to 0. When interpreting the coefficients, they should never be compared between the components; a coefficient is only meaningful relative to the reference level in the component it belongs to.For this reason, one may choose to restrict attention to the largest component only, the one with comp == 1, if this is feasible for the problem at hand.
To the author's knowledge, the identification problem when there are more than two factors has not been solved in general.efactory()'s behaviour with more than two factors is to assume that the connected components of the two first factors are sufficient for identification, and a single reference is used in each of the remaining factors.lfe contains a probabilistic test for estimability in this case, the one described in Gaure (2013a, Remark 6.2), and getfe() will issue a warning if it finds an identification problem.The test is also available as the function is.estimable().In theory, the test can fail only on a set of measure zero.Specifically, the test uses the fact that an estimable function must evaluate to the same value on all solutions of equation ( 2).Different solutions can be obtained by running the Kaczmarz algorithm with different initial vectors.By starting with the zero vector we obtain the solution with least Euclidean norm.The test works by comparing the value of the candidate estimable function on the least norm solution and a solution obtained by drawing the initial vector η at random.The test will only fail to find an identification problem if η happens to lie in a particular, but unknown, subspace of positive codimension, i.e. in a set of Lebesgue measure zero.However, due to numerical inaccuracies, finite arithmetic, and bad luck, the test may in practice fail to find an identification problem even if there is one, with some very small positive probability.It does however not report identification problems when there are none.If there are identification problems which are unaccounted for, then some, or all of the resulting coefficients are non-estimable, i.e. they are meaningless.
Sometimes, an identification problem can be alleviated by changing the order of the G() terms in the formula supplied to felm().To illustrate, consider a situation where we add a third factor to the example in the Introduction, nkids, the number of an individual's offspring, with only 5 levels.If our formula contains G(firm) + G(nkids) + G(id), it is likely that the graph associated with the two first factors, firm and nkids, is connected, and there can be many mobility groups which are unaccounted for.getfe() issues a warning, the returned coefficients are non-estimable; there is no correct indication of which mobility groups the coefficients belong to, and therefore, we do not know which coefficients can be meaningfully compared.If we change the model specification to G(id) + G(firm) + G(nkids), or G(id) + G(firm) + nkids, the mobility groups are found, and accounted for.
Another approach to the identification problem is found in Carneiro et al. (2012).They restrict attention to a subset of the dataset in which, by construction, all elementary contrasts, i.e. differences between coefficients, are estimable, as described by Weeks and Williams (1964).Such a partitioning of the dataset can be found by constructing a graph with the observations as vertices, and with an edge between two observations if they differ in at most a single factor.The partitions of the dataset correspond to connected components of the graph.They can be found by the function compfactor(...,WW=TRUE), which returns a factor enumerating the partitions.It can be used to select a subset of the dataset prior to calling felm().When there are only two factors, this partitioning structure coincides with the mobility groups discussed above.In some datasets, like the one in Torres et al. (2013), the largest partition comprises almost the entire dataset, but if it does not, such a selection may introduce bias in the coefficients.An extreme example: The R Journal Vol.The largest partition has 29 observations, fewer than the number of levels, even though far more estimable functions exist.This can be seen by computing the rank deficiency of the matrix D with the function rankMatrix() from package Matrix.Indeed, it is sufficient with two references in the 3 factors: > D <-t(do.call(rBind , lapply(list(f1,f2,f3), as, sparseMatrix ))) > ncol(D) -as.integer(rankMatrix(D)) [1] 2 The standard method of efactory(), to analyze the connected components of the first two factors only, works well in this particular case.It will find a reference among the two first factors, and a reference in the last, and differences between coefficients within each factor will be estimable: This can happen because the construction in Weeks and Williams (1964, Theorem 1) does not find a maximal subset, only a subset which is guaranteed to have the desired property.
Yet another algorithm for finding estimable functions is described by Godolphin and Godolphin (2001), but lfe does not implement it.

Specifying an estimable function
The function efactory() in lfe is by default called by getfe() to create an estimable function for a factor structure.The default is to use reference levels as described above, some other estimable functions are also available.However, researchers may have other needs, so it is possible to supply a user written estimable function to getfe().We describe this interface with an example, for a small dataset.The function takes as an argument a vector of length the sum of the number of levels of the factors, i.e. a solution to equation (2), applies an estimable function of choice, and returns the result.It is not necessary to include a full set of estimable functions; if so desired, our function may return just a single difference between two specific factor levels, or even some nonlinear transformation thereof.It should, however, evaluate to the same value on all the different solutions of equation ( 2).
We make a small dataset with 100 observations, a covariate x and 3 factors with a handful of levels.

Standard errors
The standard errors for β are easily computed, and are identical to the the standard errors from lm() if we were to run it with all the dummies.However, there is a minor difference.The degrees of freedom The R Journal Vol.5/2, December 2013 ISSN 2073-4859 depends on the rank of the matrix D, or D t D, or the trace of P. The rank is easily computed for the cases with one or two factors, but requires a much more time consuming approach for 3 or more factors.The default approach of felm() is to assume that the column rank deficiency of D is e − 1 when there are e > 2 factors.This may result in a too high value for the rank, hence a too low value for the degrees of freedom, which will yield too high standard errors.In most real cases the author has witnessed, the error is negligible.felm() has an argument exactDOF which may be set to TRUE to activate a more accurate computation of the rank.It does a sparse pivoted Cholesky factorization of D t D + I for some small and finds the rank deficiency by counting small pivots.This is not a perfect method, as noted by Higham (1990), but it seems to work well for the matrices the author has encountered, and it is much faster than rankMatrix() from package Matrix.To use rankMatrix() instead, one may specify exactDOF= rM .If, for some reason, one happens to know the degrees of freedom in advance, they can be specified as a numeric, like exactDOF=12536819.
The standard errors for α, the coefficients for the factor levels, can be estimated by bootstrapping.This requires resampling the residuals with replacement, and do the whole estimation over and over again, an operation which can be very time consuming.It can be requested, as in the above example, by the se argument to getfe(), the number of samples can be specified in the bN argument.The common practice of resampling observations rather than residuals is not useful for this kind of model, it results in a different factor structure, hence possibly a different set of identified coefficients in each sample.
Here is an example.We first create some covariates: Then we create some factors, and their effects: > id <-factor(sample (2000, length(x), replace=TRUE)) > firm <-factor(sample(1300, length(x), replace=TRUE)) > id.eff <-rnorm(nlevels(id)) > firm.eff<-rnorm(nlevels(firm)) and a normally distributed error term u, and an outcome y: We create a covariate Q which is correlated with the other covariates, the instrument x3, and the error term, and add it to the outcome: Estimation is now carried out by specifying the instrumented variable equation with the iv argument of felm.Only the instrument variable, in this case x3, is needed, the other covariates are added by felm().Residual standard error: 1.668 on 6717 degrees of freedom Multiple R-squared: 0.8112 Adjusted R-squared: 0.7189 F-statistic: 8.791 on 3282 and 6717 DF, p-value: < 2.2e-16 felm() supports more than one instrumented variable, in this case the iv argument must be a list of formulas, one for each instrumented variable, with all the instruments on the right hand side.
The author has been made aware that both the G() syntax and the handling of instrumental variables equations could be done in a smoother fashion with multi-part formulas.Such a syntax is not presently available in lfe.

Estimation time
It is a fact of life that both felm() and getfe() may take quite a while to complete.What is more unfortunate is that time to completion does not only depend on the size of the dataset, but also on its structure.In particular, the dependencies between the factors, beyond the pure identification problems, have a huge impact.The rate of convergence for the general method of alternating projections has been analyzed by Aronszajn (1950); Kayalar and Weinert (1988); Gearhart and Koshy (1989); Deutsch and Hundal (1997); Bauschke et al. (2003); Badea et al. (2012), among others.Their results are in terms of the cosine c of generalized angles between the subspaces corresponding to the projections P i .For two factors, the convergence rate in operator norm is known to be where 0 ≤ c < 1.The convergence is linear in the case with two factors, but the rate depends heavily on the structure of the factors.With 3 or more factors, the convergence rate depends on both the structure of the factors and their order in the iterations, in a quite complicated way (Deutsch and Hundal, 1997, Theorem 2.7).The theoretical convergence results are also valid for the Kaczmarz method used by getfe(), as this is a special case of the method of alternating projections (Deutsch and Hundal, 1997, Section 4).Though, for our Kaczmarz step the number of projections is the number of observations, not the number of factors.
We do not offer general results which relate intuitive properties of the factors to the convergence rate, but here are some small examples to illustrate the complexity.
But then we introduce a dependence which makes convergence slow.We let the second factor be closely related to the first, with some stochasticity, but we do not increase the number of levels.In this example, the second factor f3 can only have 5 different values for each value of f1.
The author has not succeeded in finding general intuitive guidelines for the convergence rate.However, with two factors, a relevant concept seems to be the bipartite graph where the vertices are factor levels and there is an edge between levels which are observed together.It is the connected components of this graph which determines estimability.The graph can be analyzed with the tools in package igraph (Csardi and Nepusz, 2006)  We make a list of the associated graphs: > glist <-lapply(list(f2,f3,f4,f5,f6), + function(f) mkgraph(lapply(list(f1,f), factor))) The graphs, except for the last, are connected: > sapply(glist, no.clusters) [1] 1 1 1 1 50 The average degrees, i.e. the number of edges in the graph, show no convincing signs of being strongly related to the estimation time: > sapply(glist, function(g) mean(degree(g))) [1] 19.100301 8.404311 8.410719 8.400039 8.423925 A property of the graph which shows some promise of correlating with the convergence rate is the diameter.To get some intuituion of what this is, consider the example with firms and employees.The associated graph has as vertices the firms and employees, and there is an edge between a firm and an employee if the employee has worked for the firm.A path between e.g. two employees Roy and Floyd can be constructed by finding a firm that Roy has worked for, then another employee Warshall of this firm, then another firm for Warshall, zigzagging between firms and their employees until we reach the employee Floyd.For each pair of vertices (i.e. firms and employees) in a mobility group, there is a shortest path.The length of the longest of the shortest paths among all vertex pairs is the graph's diameter.
In these particular examples, it turns out that in the cases with fast convergence (the first, third, and fifth), the shortest paths between pairs of vertices are typically much shorter than in the slowly converging cases.We do not compute the diameter as a typical fast algorithm for this, the Roy-Warshall-Floyd algorithm, has complexity of order O(n 3 ), where n is the number of vertices, i.e. 10300 in our example.Below, for simplicity, we exclude the last graph, it is disconnected, hence of infinite diameter, though the diameters of its (comparably small) components are relevant.A component of this graph would typically contain only about 300/50 = 6 different levels of the second factor f6, so its diameter can't possibly be very large.
> for(gr in glist[1:4]) + print(fivenum(shortest.paths(gr, v=sample(V(gr),10), to=sample(V(gr),10)))) Also, the many variants of factor structures in Markussen and Røed (2012) (described in Gaure (2013a)) and ongoing subsequent works, use factors which are interactions of other factors, in such a way that there are a few collinearities by design.This is not a conceptual problem when the factors are used as controls, but manually removing these known collinearities by merging some levels into a common "reference" level, yields performance improvements of up to two orders of magnitude (Markussen, 2013), a phenomenon which was not noted in Gaure (2013a).Incidentally, this also leads to shorter paths via the common reference level.This suggests that, loosely speaking, datasets with six degrees of separation converge fast, whereas less well connected datasets converge slower.However, we cannot rule out other structural differences which may have an impact on the convergence rate; there is no such thing as "larger diameter, ceteris paribus" for graphs.It is possible that a diameter path could be used to construct a lower bound for the c in equation ( 3), but the author has not found a proof for such an assertion.
The convergence rate with more than two factors is more complicated, the theoretical results of Deutsch and Hundal (1997) are less complete, the rate is not a function of the cosines of pairs of subspaces, and even the order of the projections matters.In cases where only some of the factors have many levels, the remaining factors may be specified without G(), treating them as ordinary dummy-encoded covariates.lfe utilizes two acceleration techniques for the alternating projections methods.In demeanlist(), which is used by felm() to centre the covariates, the line search method in Bauschke et al. (2003, Theorem 3.16) is used, even though their Example 3.24 shows that in some special cases with more than two factors, this method is in fact slower.For the Kaczmarz method, random shuffling of the equations is used, as suggested by Gaure (2013a).

Parallelism
felm() is thread parallelized over the vectors to be centred, i.e. all variables in the model except those enclosed in G().The number of processors is fetched with getOptions( lfe.threads ) which is initialized upon loading of lfe from the environment variable LFE_THREADS or OMP_NUM_THREADS, or otherwise from an heuristic snatched from package multicore (Urbanek, 2011).It may of course be set manually after loading the package.There is no benefit from specifying more threads than there are variables to centre.getfe() is not parallelized as such, but bootstrapping with getfe(...,se=TRUE) is.

Summary
The package lfe contains methods for estimating ordinary least square models with multiple factors with too many levels for conventional methods like lm().Such models may exhibit non-trivial identification problems.These are satisfactorily solved for the two factor case, and the package also includes some partial solutions for the case with more factors.Instrumental variable regression via the two step OLS is also supported.Convergence rate can be an issue for some datasets, and the present paper suggests some intuitive structural reasons for this.
The method employed by lfe is known as the method of alternating projections, a somewhat old and well studied method which to the author's knowledge has not been applied to the particular problem of linear regression with dummy variables before.
used in bootstrapping, adding names to very long vectors would only contribute to exercising R's memory management; hence names should only be added when required.It is also possible to add an attribute called extra , which is a named list of vectors of the same length as the names, for adding additional information to the result of getfe().This attribute is added by the estimable functions returned by efactory() to provide the obs , fe , comp and idx columns, but the content is not used for anything inside lfe.The estimable function is supplied to getfe() via the argument ef.In this example we also request bootstrapped standard errors with the arguments se and bN, we elaborate on this in the next section. is as follows: