Email Alert | RSS

Chinese Journal of Antituberculosis ›› 2025, Vol. 47 ›› Issue (10): 1311-1317.doi: 10.19982/j.issn.1000-6621.20250166

• Original Articles • Previous Articles     Next Articles

Building and evaluating a predictive model for secondary pulmonary tuberculosis based on Random Forest model

Chu Guangyan1, Li Ting1, Yu Jianing1, He Di1, Zhang Kun2, Hou Shaoying3(), Yan Shichun4()   

  1. 1Heilongjiang Provincial Center for Infectious Disease Prevention and Treatment, Harbin 150500, China
    2Department of Clinical Nutrition, Heilongjiang Provincial Hospital, Harbin 150036, China
    3School of Public Health, Harbin Medical University, Harbin 150081, China
    4Heilongjiang Provincial Center for Disease Control and Prevention, Harbin 150030, China
  • Received:2025-04-25 Online:2025-10-10 Published:2025-09-29
  • Contact: Hou Shaoying,Email: hsy3982@163.com;Yan Shichun,Email:yan208@163.com
  • Supported by:
    Scientific Research Project of Heilongjiang Provincial Health Commission(20230303060013)

Abstract:

Objective: To investigate the screening value of Random Forest model algorithm for secondary pulmonary tuberculosis, aiming to provide a basis for early clinical identification of secondary pulmonary tuberculosis. Methods: A total of 1208 healthy individuals who underwent physical examination at the Health Management Center of Heilongjiang Provincial Hospital from March to September 2021, and 876 patients with secondary pulmonary tuberculosis who were initially diagnosed and received treatment at Heilongjiang Provincial Infectious Disease Hospital were identified. Based on inclusion and exclusion criteria, 367 secondary pulmonary tuberculosis patients were assigned to the observation group and 376 healthy individuals to the control group. Basic demographic data was collected. The dataset was divided into a training set (495 cases) and a testing set (248 cases) with a 2∶1 ratio. The model incorporated 38 predictive variables. A Random Forest model was constructed using the training set data, and its performance was validated and evaluated using the testing set data. A comparative analysis of the top ten important variables identified by the model was performed for the two groups. Results: In this study, the optimal parameters for the Random Forest model were determined to be 5 nodes and 300 trees. The model demonstrated high performance with an accuracy of 99.60%, precision of 99.92%, sensitivity of 99.20%, specificity of 99.87%, and an area under the receiver operating characteristic curve (AUC) of 0.986 (95%CI: 0.978-0.995). The top ten variables ranked by mean decrease in Gini index were identified as: platelet distribution width (30.02), albumin-to-globulin ratio (20.70), indirect bilirubin (19.32), albumin (17.97), mean corpuscular hemoglobin concentration (12.24), urine specific gravity (11.26), total bilirubin (10.09), total bile acids (7.43), lymphocyte percentage (6.92), and aspartate aminotransferase/alanine aminotransferase ratio (6.50).Compared with the control group, the observation group exhibited significantly higher platelet distribution width (15.90 (15.50, 16.20) fl), AST/ALT ratio (1.41 (1.02, 1.79)), and a higher proportion of abnormal urine specific gravity (340 cases (92.64%) vs. normal: 27 cases (7.36%)) versus the control group (12.00 (11.10, 13.30) fl; 0.95 (0.77, 1.15); abnormal: 290 cases (77.13%) vs. normal: 86 cases (22.87%)). These differences were statistically significant (Z=-16.907, P<0.001; Z=-11.951, P<0.001; χ2=34.670, P<0.001, respectively). Conversely, the observation group had significantly lower levels of albumin ((38.42±6.47) g/L), total bilirubin (12.80 (9.10, 19.00) μmol/L), indirect bilirubin (7.60 (5.05, 11.80) μmol/L), albumin-to-globulin ratio (1.17 (0.97, 1.40)), total bile acids (2.80 (1.79, 5.12) μmol/L), lymphocyte percentage (22.90 (15.50, 32.55) %), and mean corpuscular hemoglobin concentration (322.00 (317.00, 328.00) g/L) compared with the control group ((45.14±2.13) g/L; 22.25 (18.40, 26.90) μmol/L; 17.40 (14.30, 19.10) μmol/L; 1.60 (1.50, 1.70); 4.60 (3.70, 5.50) μmol/L; 34.60 (29.07, 39.90) %; 336.00 (331.00, 343.00) g/L). These differences were also statistically significant (t=-18.891, P<0.001; Z=-14.313, P<0.001; Z=-16.994, P<0.001; Z=-17.030, P<0.001; Z=-9.675, P<0.001; Z=-12.684, P<0.001; Z=-16.843, P<0.001, respectively). Conclusion: The Random Forest prediction model constructed in this study for secondary pulmonary tuberculosis patients exhibited excellent performance. It is recommended that early screening for tuberculosis patients should pay attention to changes in certain nutritional indicators (such as indirect bilirubin, total bilirubin, albumin, etc.) to optimize screening strategy. Furthermore, nutritional therapy should be emphasized during the treatment of secondary pulmonary tuberculosis patients.

Key words: Tuberculosis,pulmonary, Forecasting, Models, statistical

CLC Number: