glmmPen: High Dimensional Penalized Generalized Linear Mixed Models

Generalized linear mixed models (GLMMs) are widely used in research for their ability to model correlated outcomes with non-Gaussian conditional distributions. The proper selection of fixed and random effects is a critical part of the modeling process since model misspecification may lead to significant bias. However, the joint selection of fixed and random effects has historically been limited to lower-dimensional GLMMs, largely due to the use of criterion-based model selection strategies. Here we present the R package glmmPen, one of the first to select fixed and random effects in higher dimension using a penalized GLMM modeling framework. Model parameters are estimated using a Monte Carlo Expectation Conditional Minimization (MCECM) algorithm, which leverages Stan and RcppArmadillo for increased computational efficiency. Our package supports the Binomial, Gaussian, and Poisson families and multiple penalty functions. In this manuscript we discuss the modeling procedure, estimation scheme, and software implementation through application to a pancreatic cancer subtyping study. Simulation results show our method has good performance in selecting both the fixed and random effects in high dimensional GLMMs.

Hillary M. Heiling (University of North Carolina Chapel Hill) , Naim U. Rashid (University of North Carolina Chapel Hill) , Quefeng Li (University of North Carolina Chapel Hill) , Joseph G. Ibrahim (University of North Carolina Chapel Hill)

0.1 Supplementary materials

Supplementary materials are available in addition to this article. It can be downloaded at

F. A. Archila. : Maximum likelihood estimation for generalized linear mixed models. 2020. URL R package version 1.1.1.
D. Bates, M. Mächler, B. Bolker and S. Walker. Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1): 1–48, 2015. URL
B. M. Bolker, M. E. Brooks, C. J. Clark, S. W. Geange, J. R. Poulsen, M. H. H. Stevens and J.-S. S. White. Generalized linear mixed models: A practical guide for ecology and evolution. Trends in ecology & evolution, 24(3): 127–135, 2009. URL
H. D. Bondell, A. Krishna and S. K. Ghosh. Joint variable selection for fixed and random effects in linear mixed-effects models. Biometrics, 66(4): 1069–1077, 2010. URL
J. G. Booth and J. P. Hobert. Maximizing generalized linear mixed model likelihoods with an automated monte carlo EM algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(1): 265–285, 1999. URL
P. Breheny and J. Huang. Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Annals of Applied Statistics, 5(1): 232–253, 2011. URL
P. Breheny and J. Huang. Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors. Statistics and Computing, 25(2): 173–187, 2015. URL
B. Carpenter, A. Gelman, M. D. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. Brubaker, J. Guo, P. Li and A. Riddell. Stan: A probabilistic programming language. Journal of Statistical Software, 76(1): 2017. URL
Z. Chen and D. B. Dunson. Random effects selection in linear mixed models. Biometrics, 59(4): 762–769, 2003. URL
C. Dean and J. D. Nielsen. Generalized linear mixed models: A review and some extensions. Lifetime data analysis, 13: 497–512, 2007. URL
M. Delattre, M. Lavielle, M.-A. Poursat, et al. A note on BIC in mixed-effects models. Electronic Journal of Statistics, 8(1): 456–475, 2014. URL
M. Donohue, R. Overholser, R. Xu and F. Vaida. Conditional akaike information under generalized linear and proportional hazards mixed models. Biometrika, 98(3): 685–700, 2011. URL
D. Eddelbuettel and R. François. : Seamless r and c++ integration. Journal of Statistical Software, 40(8): 1–18, 2011. URL
D. Eddelbuettel and C. Sanderson. : Accelerating r with high-performance c++ linear algebra. Computational Statistics and Data Analysis, 71: 1054–1063, 2014. URL
Y. Fan and R. Li. Variable selection in linear mixed effects models. Annals of Statistics, 40(4): 2043, 2012. URL
D. J. Feaster, S. Mikulich-Gilbertson and A. M. Brincks. Modeling site effects in the design and analysis of multi-site trials. The American journal of drug and alcohol abuse, 37(5): 383–391, 2011. URL
G. M. Fitzmaurice, N. M. Laird and J. H. Ware. Applied longitudinal analysis. 2nd ed John Wiley & Sons, 2012. URL
J. Friedman, T. Hastie and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1): 1–22, 2010. URL
R. I. Garcia, J. G. Ibrahim and H. Zhu. Variable selection for regression models with missing data. Statistica Sinica, 20(1): 149, 2010. URL
G. H. Givens and J. A. Hoeting. Computational statistics. 2nd ed 2012. John Wiley & Sons. URL
A. Groll. glmmLasso: Variable selection for generalized linear mixed models by L1-penalized estimation. 2017. URL R package version 1.5.1.
M. J. Gurka, L. J. Edwards and K. E. Muller. Avoiding bias in mixed model inference for fixed effects. Statistics in Medicine, 30(22): 2696–2707, 2011. URL
J. D. Hadfield. MCMC methods for multi-response generalized linear mixed models: The r package. Journal of Statistical Software, 33(2): 1–22, 2010. URL
M. D. Hoffman and A. Gelman. The no-u-turn sampler: Adaptively setting path lengths in hamiltonian monte carlo. Journal of Machine Learning Research, 15(1): 1593–1623, 2014. URL
J. G. Ibrahim, H. Zhu, R. I. Garcia and R. Guo. Fixed and random effects selection in mixed effects models. Biometrics, 67(2): 495–503, 2011. URL
M. J. Kane, J. Emerson and S. Weston. Scalable strategies for computing with massive data. Journal of Statistical Software, 55(14): 1–19, 2013. URL
K. Kleinman, R. Lazarus and R. Platt. A generalized linear mixed models approach for detecting incident clusters of disease in small areas, with an application to biological terrorism. American Journal of Epidemiology, 159(3): 217–224, 2004. URL
I. H. Langford. Using a generalized linear mixed model to analyze dichotomous choice contingent valuation data. Land Economics, 507–514, 1994. URL
J. Lorah and A. Womack. Value of sample size for computation of the bayesian information criterion (BIC) in multilevel modeling. Behavior Research Methods, 51(1): 440–450, 2019. URL
S. Ma, S. Ogino, P. Parsana, R. Nishihara, Z. Qian, J. Shen, K. Mima, Y. Masugi, Y. Cao, J. A. Nowak, et al. Continuity of transcriptomes among colorectal cancer subtypes based on meta-analysis. Genome Biology, 19(1): 142, 2018. URL
I. Misztal. Reliable computing in estimation of variance components. Journal of Animal Breeding and Genetics, 125(6): 363–370, 2008. URL
R. A. Moffitt, R. Marayati, E. L. Flate, K. E. Volmar, S. G. H. Loeza, K. A. Hoadley, N. U. Rashid, L. A. Williams, S. C. Eaton, A. H. Chung, et al. Virtual microdissection identifies distinct tumor- and stroma-specific subtypes of pancreatic ductal adenocarcinoma. Nature Genetics, 47(10): 1168, 2015. URL
A. Pajor. Estimating the marginal likelihood using the arithmetic mean identity. Bayesian Analysis, 12(1): 261–287, 2017. URL
P. Patil and G. Parmigiani. Training replicable predictors in multiple studies. Proceedings of the National Academy of Sciences, 115(11): 2578–2583, 2018. URL
J. Pinheiro, D. Bates, S. DebRoy, D. Sarkar and R Core Team. : Linear and nonlinear mixed effects models. 2021. URL R package version 3.1-152.
N. U. Rashid, Q. Li, J. J. Yeh and J. G. Ibrahim. Modeling between-study heterogeneity for improved replicability in gene signature selection and clinical prediction. Journal of the American Statistical Association, 115(531): 1125–1138, 2020. URL
M. Riester, W. Wei, L. Waldron, A. C. Culhane, L. Trippa, E. Oliva, S. Kim, F. Michor, C. Huttenhower, G. Parmigiani, et al. Risk prediction for late-stage ovarian cancer by meta-analysis of 1525 patient samples. JNCI: Journal of the National Cancer Institute, 106(5): 2014. URL
G. O. Roberts and J. S. Rosenthal. Examples of adaptive MCMC. Journal of Computational and Graphical Statistics, 18(2): 349–367, 2009. URL
SAS Institute Inc. SAS/STAT software, version 9.2. Cary, NC, 2008. URL
A. W. Schmidt-Catran and M. Fairbrother. The random effects in multilevel models: Getting them wrong and getting them right. European Sociological Review, 32(1): 23–38, 2016. URL
Stan Development Team. : The r interface to stan. 2020. URL R package version 2.21.2.
M. Szyszkowicz. Use of generalized linear mixed models to examine the association between air pollution and health outcomes. International Journal of Occupational Medicine and Environmental Health, 19(4): 224–227, 2006. URL
J. A. Thompson, K. L. Fielding, C. Davey, A. M. Aiken, J. R. Hargreaves and R. J. Hayes. Bias and inference from misspecified mixed-effect models in stepped wedge trial analysis. Statistics in Medicine, 36(23): 3670–3682, 2017. URL
J. N. Weinstein, E. A. Collisson, G. B. Mills, K. R. Shaw, B. A. Ozenberger, K. Ellrott, I. Shmulevich, C. Sander and J. M. Stuart. The cancer genome atlas pan-cancer analysis project. Nature genetics, 45(10): 1113–1120, 2013. URL
H. Wickham. ggplot2: Elegant graphics for data analysis. Springer-Verlag New York, 2016. URL



Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".


For attribution, please cite this work as

Heiling, et al., "glmmPen: High Dimensional Penalized Generalized Linear Mixed Models", The R Journal, 2024

BibTeX citation

  author = {Heiling, Hillary M. and Rashid, Naim U. and Li, Quefeng and Ibrahim, Joseph G.},
  title = {glmmPen: High Dimensional Penalized Generalized Linear Mixed Models},
  journal = {The R Journal},
  year = {2024},
  note = {},
  doi = {10.32614/RJ-2023-086},
  volume = {15},
  issue = {4},
  issn = {2073-4859},
  pages = {106-128}