glmmPen: High Dimensional Penalized Generalized Linear Mixed Models

Generalized linear mixed models (GLMMs) are widely used in research for their ability to model correlated outcomes with non-Gaussian conditional distributions. The proper selection of fixed and random effects is a critical part of the modeling process since model misspecification may lead to significant bias. However, the joint selection of fixed and random effects has historically been limited to lower-dimensional GLMMs, largely due to the use of criterion-based model selection strategies. Here we present the R package glmmPen, one of the first to select fixed and random effects in higher dimension using a penalized GLMM modeling framework. Model parameters are estimated using a Monte Carlo Expectation Conditional Minimization (MCECM) algorithm, which leverages Stan and RcppArmadillo for increased computational efficiency. Our package supports the Binomial, Gaussian, and Poisson families and multiple penalty functions. In this manuscript we discuss the modeling procedure, estimation scheme, and software implementation through application to a pancreatic cancer subtyping study. Simulation results show our method has good performance in selecting both the fixed and random effects in high dimensional GLMMs.

Hillary M. Heiling (University of North Carolina Chapel Hill) , Naim U. Rashid (University of North Carolina Chapel Hill) , Quefeng Li (University of North Carolina Chapel Hill) , Joseph G. Ibrahim (University of North Carolina Chapel Hill)
2024-04-11

0.1 Supplementary materials

Supplementary materials are available in addition to this article. It can be downloaded at RJ-2023-086.zip

F. A. Archila. : Maximum likelihood estimation for generalized linear mixed models. 2020. URL https://CRAN.R-project.org/package=mcemGLM. R package version 1.1.1.
D. Bates, M. Mächler, B. Bolker and S. Walker. Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1): 1–48, 2015. URL https://doi.org/10.18637/jss.v067.i01.
B. M. Bolker, M. E. Brooks, C. J. Clark, S. W. Geange, J. R. Poulsen, M. H. H. Stevens and J.-S. S. White. Generalized linear mixed models: A practical guide for ecology and evolution. Trends in ecology & evolution, 24(3): 127–135, 2009. URL https://doi.org/10.1016/j.tree.2008.10.008.
H. D. Bondell, A. Krishna and S. K. Ghosh. Joint variable selection for fixed and random effects in linear mixed-effects models. Biometrics, 66(4): 1069–1077, 2010. URL https://doi.org/10.1111/j.1541-0420.2010.01391.x.
J. G. Booth and J. P. Hobert. Maximizing generalized linear mixed model likelihoods with an automated monte carlo EM algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(1): 265–285, 1999. URL https://doi.org/10.1111/1467-9868.00176.
P. Breheny and J. Huang. Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Annals of Applied Statistics, 5(1): 232–253, 2011. URL https://doi.org/10.1214/10-AOAS388.
P. Breheny and J. Huang. Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors. Statistics and Computing, 25(2): 173–187, 2015. URL https://doi.org/10.1007/s11222-013-9424-2.
B. Carpenter, A. Gelman, M. D. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. Brubaker, J. Guo, P. Li and A. Riddell. Stan: A probabilistic programming language. Journal of Statistical Software, 76(1): 2017. URL https://doi.org/10.18637/jss.v076.i01.
Z. Chen and D. B. Dunson. Random effects selection in linear mixed models. Biometrics, 59(4): 762–769, 2003. URL https://doi.org/10.1111/j.0006-341X.2003.00089.x.
C. Dean and J. D. Nielsen. Generalized linear mixed models: A review and some extensions. Lifetime data analysis, 13: 497–512, 2007. URL https://doi.org/10.1007/s10985-007-9065-x.
M. Delattre, M. Lavielle, M.-A. Poursat, et al. A note on BIC in mixed-effects models. Electronic Journal of Statistics, 8(1): 456–475, 2014. URL https://doi.org/10.1214/14-EJS890.
M. Donohue, R. Overholser, R. Xu and F. Vaida. Conditional akaike information under generalized linear and proportional hazards mixed models. Biometrika, 98(3): 685–700, 2011. URL https://doi.org/10.1093/biomet/asr023.
D. Eddelbuettel and R. François. : Seamless r and c++ integration. Journal of Statistical Software, 40(8): 1–18, 2011. URL http://www.jstatsoft.org/v40/i08/.
D. Eddelbuettel and C. Sanderson. : Accelerating r with high-performance c++ linear algebra. Computational Statistics and Data Analysis, 71: 1054–1063, 2014. URL https://doi.org/10.1016/j.csda.2013.02.005.
Y. Fan and R. Li. Variable selection in linear mixed effects models. Annals of Statistics, 40(4): 2043, 2012. URL https://doi.org/10.1214/12-AOS1028.
D. J. Feaster, S. Mikulich-Gilbertson and A. M. Brincks. Modeling site effects in the design and analysis of multi-site trials. The American journal of drug and alcohol abuse, 37(5): 383–391, 2011. URL https://doi.org/10.3109/00952990.2011.600386.
G. M. Fitzmaurice, N. M. Laird and J. H. Ware. Applied longitudinal analysis. 2nd ed John Wiley & Sons, 2012. URL https://doi.org/10.1002/9781119513469.
J. Friedman, T. Hastie and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1): 1–22, 2010. URL https://www.jstatsoft.org/v33/i01/.
R. I. Garcia, J. G. Ibrahim and H. Zhu. Variable selection for regression models with missing data. Statistica Sinica, 20(1): 149, 2010. URL https://pubmed.ncbi.nlm.nih.gov/20336190/.
G. H. Givens and J. A. Hoeting. Computational statistics. 2nd ed 2012. John Wiley & Sons. URL https://doi.org/10.1111/j.1467-985X.2006.00430_5.x.
A. Groll. glmmLasso: Variable selection for generalized linear mixed models by L1-penalized estimation. 2017. URL https://CRAN.R-project.org/package=glmmLasso. R package version 1.5.1.
M. J. Gurka, L. J. Edwards and K. E. Muller. Avoiding bias in mixed model inference for fixed effects. Statistics in Medicine, 30(22): 2696–2707, 2011. URL https://doi.org/10.1002/sim.4293.
J. D. Hadfield. MCMC methods for multi-response generalized linear mixed models: The r package. Journal of Statistical Software, 33(2): 1–22, 2010. URL https://www.jstatsoft.org/v33/i02/.
M. D. Hoffman and A. Gelman. The no-u-turn sampler: Adaptively setting path lengths in hamiltonian monte carlo. Journal of Machine Learning Research, 15(1): 1593–1623, 2014. URL https://dl.acm.org/doi/abs/10.5555/2627435.2638586.
J. G. Ibrahim, H. Zhu, R. I. Garcia and R. Guo. Fixed and random effects selection in mixed effects models. Biometrics, 67(2): 495–503, 2011. URL https://doi.org/10.1111/j.1541-0420.2010.01463.x.
M. J. Kane, J. Emerson and S. Weston. Scalable strategies for computing with massive data. Journal of Statistical Software, 55(14): 1–19, 2013. URL http://www.jstatsoft.org/v55/i14/.
K. Kleinman, R. Lazarus and R. Platt. A generalized linear mixed models approach for detecting incident clusters of disease in small areas, with an application to biological terrorism. American Journal of Epidemiology, 159(3): 217–224, 2004. URL https://doi.org/10.1093/aje/kwh029.
I. H. Langford. Using a generalized linear mixed model to analyze dichotomous choice contingent valuation data. Land Economics, 507–514, 1994. URL https://doi.org/10.2307/3146644.
J. Lorah and A. Womack. Value of sample size for computation of the bayesian information criterion (BIC) in multilevel modeling. Behavior Research Methods, 51(1): 440–450, 2019. URL https://doi.org/10.3758/s13428-018-1188-3.
S. Ma, S. Ogino, P. Parsana, R. Nishihara, Z. Qian, J. Shen, K. Mima, Y. Masugi, Y. Cao, J. A. Nowak, et al. Continuity of transcriptomes among colorectal cancer subtypes based on meta-analysis. Genome Biology, 19(1): 142, 2018. URL https://doi.org/10.1186/s13059-018-1511-4.
I. Misztal. Reliable computing in estimation of variance components. Journal of Animal Breeding and Genetics, 125(6): 363–370, 2008. URL https://doi.org/10.1111/j.1439-0388.2008.00774.x.
R. A. Moffitt, R. Marayati, E. L. Flate, K. E. Volmar, S. G. H. Loeza, K. A. Hoadley, N. U. Rashid, L. A. Williams, S. C. Eaton, A. H. Chung, et al. Virtual microdissection identifies distinct tumor- and stroma-specific subtypes of pancreatic ductal adenocarcinoma. Nature Genetics, 47(10): 1168, 2015. URL https://doi.org/10.1038/ng.3398.
A. Pajor. Estimating the marginal likelihood using the arithmetic mean identity. Bayesian Analysis, 12(1): 261–287, 2017. URL https://doi.org/10.1214/16-BA1001.
P. Patil and G. Parmigiani. Training replicable predictors in multiple studies. Proceedings of the National Academy of Sciences, 115(11): 2578–2583, 2018. URL https://doi.org/10.1073/pnas.1708283115.
J. Pinheiro, D. Bates, S. DebRoy, D. Sarkar and R Core Team. : Linear and nonlinear mixed effects models. 2021. URL https://CRAN.R-project.org/package=nlme. R package version 3.1-152.
N. U. Rashid, Q. Li, J. J. Yeh and J. G. Ibrahim. Modeling between-study heterogeneity for improved replicability in gene signature selection and clinical prediction. Journal of the American Statistical Association, 115(531): 1125–1138, 2020. URL https://doi.org/10.1080/01621459.2019.1671197.
M. Riester, W. Wei, L. Waldron, A. C. Culhane, L. Trippa, E. Oliva, S. Kim, F. Michor, C. Huttenhower, G. Parmigiani, et al. Risk prediction for late-stage ovarian cancer by meta-analysis of 1525 patient samples. JNCI: Journal of the National Cancer Institute, 106(5): 2014. URL https://doi.org/10.1093/jnci/dju048.
G. O. Roberts and J. S. Rosenthal. Examples of adaptive MCMC. Journal of Computational and Graphical Statistics, 18(2): 349–367, 2009. URL https://doi.org/10.1002/wics.1307.
SAS Institute Inc. SAS/STAT software, version 9.2. Cary, NC, 2008. URL http://www.sas.com/.
A. W. Schmidt-Catran and M. Fairbrother. The random effects in multilevel models: Getting them wrong and getting them right. European Sociological Review, 32(1): 23–38, 2016. URL https://doi.org/10.1093/esr/jcv090.
Stan Development Team. : The r interface to stan. 2020. URL http://mc-stan.org/. R package version 2.21.2.
M. Szyszkowicz. Use of generalized linear mixed models to examine the association between air pollution and health outcomes. International Journal of Occupational Medicine and Environmental Health, 19(4): 224–227, 2006. URL https://doi.org/10.2478/v10001-006-0032-7.
J. A. Thompson, K. L. Fielding, C. Davey, A. M. Aiken, J. R. Hargreaves and R. J. Hayes. Bias and inference from misspecified mixed-effect models in stepped wedge trial analysis. Statistics in Medicine, 36(23): 3670–3682, 2017. URL https://doi.org/10.1002/sim.7348.
J. N. Weinstein, E. A. Collisson, G. B. Mills, K. R. Shaw, B. A. Ozenberger, K. Ellrott, I. Shmulevich, C. Sander and J. M. Stuart. The cancer genome atlas pan-cancer analysis project. Nature genetics, 45(10): 1113–1120, 2013. URL https://doi.org/10.1038/ng.2764.
H. Wickham. ggplot2: Elegant graphics for data analysis. Springer-Verlag New York, 2016. URL https://ggplot2.tidyverse.org.

References

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Heiling, et al., "glmmPen: High Dimensional Penalized Generalized Linear Mixed Models", The R Journal, 2024

BibTeX citation

@article{RJ-2023-086,
  author = {Heiling, Hillary M. and Rashid, Naim U. and Li, Quefeng and Ibrahim, Joseph G.},
  title = {glmmPen: High Dimensional Penalized Generalized Linear Mixed Models},
  journal = {The R Journal},
  year = {2024},
  note = {https://doi.org/10.32614/RJ-2023-086},
  doi = {10.32614/RJ-2023-086},
  volume = {15},
  issue = {4},
  issn = {2073-4859},
  pages = {106-128}
}