SIMEXBoost: An R package for Analysis of High-Dimensional Error-Prone Data Based on Boosting Method

Boosting is a powerful statistical learning method. Its key feature is the ability to derive a strong learner from simple yet weak learners by iteratively updating the learning results. Moreover, boosting algorithms have been employed to do variable selection and estimation for regression models. However, measurement error usually appears in covariates. Ignoring measurement error can lead to biased estimates and wrong inferences. To the best of our knowledge, few packages have been developed to address measurement error and variable selection simultaneously by using boosting algorithms. In this paper, we introduce an R package SIMEXBoost, which covers some widely used regression models and applies the simulation and extrapolation method to deal with measurement error effects. Moreover, the package SIMEXBoost enables us to do variable selection and estimation for high-dimensional data under various regression models. To assess the performance and illustrate the features of the package, we conduct numerical studies.

Li-Pang Chen (Department of Statistics, National Chengchi University) , Bangxu Qiu (Department of Statistics, National Chengchi University)
2024-04-11

0.1 Supplementary materials

Supplementary materials are available in addition to this article. It can be downloaded at RJ-2023-080.zip

A. Agresti. Categorical data analysis. New York: Wiley, 2012.
E. Alfaro, M. Gamez, L. Garcia N. Guo, A. Albano, M. Sciandra and A. Plaia. Adabag: Applies multiclass AdaBoost.M1, SAMME and bagging. 2023. URL https://cran.r-project.org/package=adabag. R package version 5.0.
K. Bartoszek. GLSME: Generalized least squares with measurement error. 2019. URL https://cran.r-project.org/package=GLSME. R package version 1.0.5.
S. Boyd and L. Vandenberghe. Convex optimization. New York: Cambridge University Press, 2004.
B. Brown, C. J. Miller and J. Wolfson. ThrEEBoost: Thresholded boosting for variable selection and prediction via estimating equations. Journal of Computational and Graphical Statistics, 26: 579–588, 2017.
R. J. Carroll, D. Ruppert, L. A. Stefanski and C. M. Crainiceanu. Measurement error in nonlinear model. New York: CRC Press, 2006.
L.-P. Chen. A note of feature screening via rank-based coefficient of correlation. Biometrical Journal, 65: 2100373, 2023a.
L.-P. Chen. BOOME: A python package for handling misclassified disease and ultrahigh-dimensional error-prone gene expression data. PLOS ONE, 17: e0276664, 2023b.
L.-P. Chen. De-noising boosting methods for variable selection and estimation subject to error-prone variables. Statistics and Computing, 33:38: 1–13, 2023c.
L.-P. Chen. Feature screening based on distance correlation for ultrahigh-dimensional censored data with covariate measurement error. Computational Statistics, 36: 857–884, 2021.
L.-P. Chen. Variable selection and estimation for misclassified binary responses and multivariate error-prone predictors. Journal of Computational and Graphical Statistics, 2023d. URL https://doi.org/10.1080/10618600.2023.2218428.
L.-P. Chen and B. Qiu. Analysis of length-biased and partly interval-censored survival data with mismeasured covariates. Biometrics, 79: 3929–3940, 2023.
L.-P. Chen and G. Y. Yi. Analysis of noisy survival data with graphical proportional hazards measurement error models. Biometrics, 77: 956–969, 2021.
T. Chen, T. He, M. Benesty, V. Khotilovich, Y. Tang, H. Cho, K. Chen, R. Mitchell, I. Cano, T. Zhou, et al. Xgboost: Extreme gradient boosting. 2023. URL https://cran.r-project.org/package=xgboost. R package version 1.7.5.1.
Y. Feng, J. Fan, D. F. Saldana, Y. Wu and R. Samworth. SIS: Sure independence screening. 2020. URL https://cran.r-project.org/package=SIS. R package version 0.8-8.
J. Friedman, T. Hastie, R. Tibshirani, B. Narasimhan, K. Tay, N. Simon, J. Qian and J. Yang. Glmnet: Lasso and elastic-net regularized generalized linear models. 2023. URL https://cran.r-project.org/package=glmnet. R package version 4.1-7.
B. Greenwell, B. Boehmke, J. Cunningham and G. Developers. Gbm: Generalized boosted regression models. 2022. URL https://cran.r-project.org/package=gbm. R package version 2.1.8.1.
A. Groll. GMMBoost: Likelihood-based boosting for generalized mixed models. 2020. URL https://cran.r-project.org/package=GMMBoost. R package version 1.1.3.
T. Hastie, R. Tibshirani and J. Friedman. The elements of statistical learning: Data mining, inference, and prediction. New York: Springer, 2008.
B. Hofner, A. Mayr, N. Fenske, J. Thomas and M. Schmid. gamboostLSS: Boosting methods for ’GAMLSS’. 2023. URL https://cran.r-project.org/package=glmnet. R package version 4.1-7.
J. F. Lawless. Statistical models and methods for lifetime data. New York: Wiley, 2003.
W. Lederer, H. Seibold, H. Küchenhoff, C. Lawrence and R. F. Brøndum. Simex: SIMEX- and MCSIMEX-algorithm for measurement error models. 2019. URL https://cran.r-project.org/package=simex. R package version 1.8.
L. Nab. Mecor: Measurement error correction in linear models with a continuous outcome. 2021. URL https://cran.r-project.org/package=mecor. R package version 1.0.0.
B. Qiu and L.-P. Chen. SIMEXBoost: Boosting method for high-dimensional error-prone data. 2023. URL https://cran.r-project.org/package=SIMEXBoost. R package version 0.2.0.
Y. Shi, G. Ke, D. Soukhavong, J. Lamb, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, et al. Lightgbm: Light gradient boosting machine. 2023. URL https://cran.r-project.org/package=lightgbm. R package version 3.3.5.
R. Tibshirani. Regression shrinkage and selection via the LASSO. Journal of Royal Statistical Society, Series B, 58: 267–288, 1996.
Z. Wang and T. Hothorn. Bst: Gradient boosting. 2023. URL https://cran.r-project.org/package=bst. R package version 0.3-24.
J. Wolfson. EEBOOST: A general method for prediction and variable selection based on estimating equation. Journal of the American Statistical Association, 106: 296–305, 2011.
J. Xiong, W. He and G. Y. Yi. Simexaft: simexaft. 2019. URL https://cran.r-project.org/package=simexaft. R package version 1.0.7.1.
G. Y. Yi. Statistical analysis with measurement error and misclassication: Strategy, method and application. New York: Springer, 2017.
Q. Zhang and G. Y. Yi. augSIMEX: Analysis of data with mixed measurement error and misclassification in covariates. 2020. URL https://cran.r-project.org/package=augSIMEX. R package version 3.7.4.
H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101: 1418–1429, 2006.
H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B, 67: 301–320, 2005.

References

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Chen & Qiu, "SIMEXBoost: An R package for Analysis of High-Dimensional Error-Prone Data Based on Boosting Method", The R Journal, 2024

BibTeX citation

@article{RJ-2023-080,
  author = {Chen, Li-Pang and Qiu, Bangxu},
  title = {SIMEXBoost: An R package for Analysis of High-Dimensional Error-Prone Data Based on Boosting Method},
  journal = {The R Journal},
  year = {2024},
  note = {https://doi.org/10.32614/RJ-2023-080},
  doi = {10.32614/RJ-2023-080},
  volume = {15},
  issue = {4},
  issn = {2073-4859},
  pages = {5-20}
}