Movatterモバイル変換


[0]ホーム

URL:


  1. Home
  2. Issues
  3. Volume 1, Issue 1 (2023)
  4. Effects of stopping criterion on the gro ...

Effects of stopping criterion on the growth of trees in regression random forests
Volume 1, Issue 1 (2023), pp. 46–61

Authors

 
Placeholder
Pub. online:31 August 2022     Type:Methodology Article     Open accessOpen Access
Area:Biomedical Research

Accepted
18 July 2022
Published
31 August 2022

Abstract

Random forests are a powerful machine learning tool that capture complex relationships between independent variables and an outcome of interest. Trees built in a random forest are dependent on several hyperparameters, one of the more critical being the node size. The original algorithm of Breiman, controls for node size by limiting the size of the parent node, so that a node cannot be split if it has less than a specified number of observations. We propose that this hyperparameter should instead be defined as the minimum number of observations in each terminal node. The two existing random forest approaches are compared in the regression context based on estimated generalization error, bias-squared, and variance of resulting predictions in a number of simulated datasets. Additionally the two approaches are applied to type 2 diabetes data obtained from the National Health and Nutrition Examination Survey. We have developed a straightforward method for incorporating weights into the random forest analysis of survey data. Our results demonstrate that generalization error under the proposed approach is competitive to that attained from the original random forest approach when data have large random error variability. The R code created from this work is available and includes an illustration.

Supplementary material

 Supplementary Material
The supplementary material provides code in R software for implementing the algorithms developed in this work. The National Health and Nutrition Examination Survey data utilized in the paper are also provided.

References

[1] 
Alexopoulos, A., Qamar, A., Hutchins, K., Crowley, M. J., Batch, B. C. and R., G. J. (2019). Triglycerides: Emerging Targets in Diabetes Care? Review of Moderate Hypertriglyceridemia in Diabetes. Current diabetes reports 19 13.
[2] 
Biau, G. and Devroye, L. (2010). On the layered nearest neighbour estimate, the bagged nearest neighbour estimate and the random forest method in regression and classification. Journal of Multivariate Analysis 101 2499–2518.https://doi.org/10.1016/j.jmva.2010.06.019.MR2719877
[3] 
Biau, G. and Scornet, E. (2016). A random forest guided tour. TEST 25 197–227.https://doi.org/10.1007/s11749-016-0481-7.MR3493512
[4] 
Breiman, L. (2001). Random Forests. Machine Learning 45 5–32.MR3874153
[5] 
Chen, T. C., Clark, J., Riddles, M. K., Mohadjer, L. K. and Fakhouri, T. H. I. (2020). National Health and Nutrition Examination Survey, 2015–2018: Sample Design and Estimation Procedures. Vital Health Stat 2 184 1–35.
[6] 
Hastie, T., Tibshirani, R. and Friedman, J. H. (2017) The elements of statistical learning: Data mining, inference, and prediction. Springer, New York.https://doi.org/10.1007/978-0-387-84858-7.MR2722294
[7] 
Ishwaran, H. and Kogalur, U. B. (2021). Package ‘randomForestSRC’.https://doi.org/cran.r-project.org/web/packages/randomForestSRC/randomForestSRC.pdf.
[8] 
Liaw, A. and Wiener, M. (2018). Package ‘randomForest’.https://doi.org/cran.r-project.org/web/packages/randomForest/randomForest.pdf.
[9] 
Liitle, M., Rosenberg, P. and Arsham, A. (2022). Alternative stopping rules to limit tree expansion for random forest models.Scientific Reports (Resubmitted).
[10] 
Makris, K. and Spanou, L. (2011). Is there a relationship between mean blood glucose and glycated hemoglobin? J Diabetes Sci Technol 5(6) 1572–1583.
[11] 
Mendola, N. D., Chen, T. q. C., Gu, Q., Eberhardt, M. S. and Saydah, S. (2018). Prevalence of Total, Diagnosed, and Undiagnosed Diabetes Among Adults: United States, 2013–2016. NCHS Data Brief 1–8.
[12] 
Nicolo, M. L., Shewokis, P. A., Boullata, J., Sukumar, D., Smith, S., Compher, C. and Volpe, S. L. (2019). Sedentary behavior time as a predictor of hemoglobin A1c among adults, 40 to 59 years of age, living in the United States: National Health and Nutrition Examination Survey 2003 to 2004 and 2013 to 2014. Nutrition and Health 25 275–279.https://doi.org/10.1002/sim.7049.MR3569919
[13] 
Ninh, T., Nguyen, X. T., Lane, J. and Wang, P. (2011). Relationship between obesity and diabetes in a US adult population: findings from the National Health and Nutrition Examination Survey, 1999–2006. Obesity Surgery 21 351–355.
[14] 
Probst, P., Wright, M. N. and Boulesteix, A. (2019). Hyperparameters and tuning strategies for random forest. WIREs Data Mining and Knowledge Discovery 9 1301.
[15] 
Rohrmann, S., Smit, E., Giovannucci, E. and Platz, E. A. (2005). Association between markers of the metabolic syndrome and lower urinary tract symptoms in the Third National Health and Nutrition Examination Survey (NHANES III). International journal of obesity 29 310–316.
[16] 
Scornet, E. (2018). Tuning parameters in random forests. ESAIM: Proceedings and Surveys 60 144–162.https://doi.org/10.1051/proc/201760144.MR3772478
[17] 
van Rijn, J. N. and Hutter, F. (2018). Hyperparameter importance across datasets. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
[18] 
Centers for Disease Control and Prevention (CDC). National Center for Health Statistics (NCHS). National Health and Nutrition Examination Survey Data. Hyattsville, MD: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention, [2017–2018].http://wwwn.cdc.gov/Nchs/Nhanes/continuousnhanes/default.aspx?BeginYear=2015.
[19] 
Centers for Disease Control and Prevention (CDC). National Center for Health Statistics (NCHS). National Health and Nutrition Examination Survey Data. Hyattsville, MD: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention, [2015–2016].http://wwwn.cdc.gov/Nchs/Nhanes/continuousnhanes/default.aspx?BeginYear=2017.

Full article PDF XML
Full article PDF XML

Copyright
© 2023 New England Statistical Society
by logoby logo
Open access article under the CC BY license.

Keywords
Regression random forestNode sizeGeneralization error

Funding
This work was supported by the Intramural Research Program of the National Institutes of Health, National Cancer Institute, Division of Cancer Epidemiology and Genetics and utilized the computational resources of the NIH HPC Biowulf cluster. (http://hpc.nih.gov). The Intramural Research Program of the National Institutes of Health, the National Cancer Institute, Division of Cancer Epidemiology and Genetics supported the work of all authors.

Metrics
since December 2021
1058

Article info
views

316

Full article
views

368

PDF
downloads

121

XML
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS


[8]ページ先頭

©2009-2025 Movatter.jp