Email Alert | RSS

Chinese Journal of Antituberculosis ›› 2025, Vol. 47 ›› Issue (6): 708-718.doi: 10.19982/j.issn.1000-6621.20250011

• Original Articles • Previous Articles     Next Articles

Establishing and validating a prediction model for HIV-associated nontuberculous mycobacterial disease based on machine learning

Li Longfen1, Shi Chunjing1, Luo Yun1, Zhang Huajie1, Liu Jun2, Wang Ge1, Zhao Yanhong1, Yuan Lijuan1, Li Shan1, Li Wenming1(), Shen Lingjun1()   

  1. 1Department of Integrated Pulmonary Tuberculosis, Kunming Third People’s Hospital/Yunnan Clinical Medical Center for Infectious Diseases/Kunming Yunnan Diagnosis and Treatment Technology Center for Nontuberculous Mycobacterial Diseases (2024-SW (Technology)-11)/The Sixth Affiliated Hospital of Dali University, Kunming 650041, China
    2Department of Infectious Disease I, Kunming Third People’s Hospital, Kunming 650041, China
  • Received:2025-01-11 Online:2025-06-10 Published:2025-06-11
  • Contact: Li Wenming, Email: 328202492@qq.com;hen Lingjun, Email: m18608770202@163.com
  • Supported by:
    National Natural Science Foundation of China(82460001);The Basic Research Foundation of Yunnan Province Local Universities(202401BA070001-063);The Basic Research Foundation of Yunnan Province Local Universities(202401BA070001-082);Kunming Municipal Health Commission Health Science Research Project(2023-03-08-012);Kunming Municipal Health Commission Health Science Research Project(2023-03-02-019)

Abstract:

Objective: To explore establishing a prediction model based on machine learning for HIV-associated nontuberculous mycobacteria (NTM) disease, in order to provide a basis for the early clinical identification of HIV co-infection with NTM. Methods: A retrospective analysis was conducted on 4475 patients who were hospitalized at the Third People’s Hospital of Kunming from August 2021 to August 2024. According to inclusion and exclusion criteria, as well as grouping standards, 77 patients with HIV complicated with NTM were designated as the observation group, while 262 patients with HIV without NTM complications were designated as the control group. We collected their clinical data. Borderline SMOTE was applied to address the imbalance between two groups. Feature selection was then conducted using Support Vector Machine Recursive Feature Elimination (SVM-RFE), Lasso regression, and random forest. A multicollinearity test was conducted among the variables, using Variance Inflation Factor (VIF) and Tolerance as indicators. Predictive models were fitted based on logistic regression and presented as mathematical equations. The models were evaluated using ROC curves, calibration curves, clinical decision curves, clinical impact curves, and external validation. Results: The 339 patients were randomly divided into a training set of 272 cases and a validation set of 67 cases with an 8∶2 ratio. In the training set, there were 208 control cases and 64 observation cases. After processing with Borderline SMOTE, the control group remained at 208 cases, while the observation group increased to 202 cases. The SVM-RFE factor importance ranking showed: the top 5 were RNA, CD45+, CRP, PCT, and HB. Model1 was established with the following logistic equation: Logit(P):Y=3.22+2.4×HIV-RNA(1 or 0)-0.002×CD45++0.021×CRP+0.908×PCT-0.037×HB,P=1/(1+e-Y)(Y: predictive index, P: predictive probability); Lasso regression identified the top 5 indicators as L, HB, CD45+, CRP, and HIV-RNA, and model 2 was established with the following logistic equation: Logit(P):Y=2.940+2.57×HIV-RNA(1or 0)-0.002×CD45++0.0240×CRP-0.823×L-0.034×HB,P=1/(1+e-Y); The importance ranking of indicators by the random forest showed: the top 5 were CD45+, L, HIV-RNA, MLR, and PNI. Model 3 was established with the following logistic equation: Logit(P): Y=2.214+2.350×HIV-RNA(1or 0)-0.002×CD45++0.702×MLR-0.681×L-0.080×PNI,P=1/(1+e-Y). The ROC curve analysis showed the following results: model 1 (AUC: 0.944, 95%CI: 0.923-0.965), model 2 (AUC: 0.944, 95%CI: 0.922-0.965), and model 3 (AUC: 0.929, 95%CI: 0.904-0.954). The sensitivities were 87.1%, 90.6%, and 94.6% respectively, the specificities were 91.3%, 89.4%, and 81.2% respectively, the Youden’s indices were 0.784, 0.800, and 0.758 respectively, the positive likelihood ratios (+LR) were 10.010, 8.547, and 5.028 respectively, and the negative likelihood ratios (-LR) were 0.141, 0.105, and 0.066 respectively. There was no statistically significant difference among AUCs of the three models, and their calibration curves all indicated that the predictions were consistent with the actual outcomes. The clinical decision curves and impact curves for all three models demonstrated that, using the optimal cutoff value as the probability threshold, all three models could result in patient benefit. External validation showed that all three models had good predictive values for the validation set, indicating that they were stable. Conclusion: The three models established in this study all have high predictive values, with good discrimination ability, calibration, clinical applicability, and stability.

Key words: Mycobacterium infections, Acquired immunodeficiency syndrome, Models, statistical, Diagnosis, computer-assisted

CLC Number: