Email Alert | RSS    帮助

中国防痨杂志 ›› 2025, Vol. 47 ›› Issue (6): 708-718.doi: 10.19982/j.issn.1000-6621.20250011

• 论著 • 上一篇    下一篇

基于机器学习建立HIV感染并发非结核分枝杆菌病的预测模型与验证

李龙芬1, 施春晶1, 罗云1, 张华杰1, 刘俊2, 王戈1, 赵雁红1, 袁丽娟1, 李珊1, 李文明1(), 沈凌筠1()   

  1. 1昆明市第三人民医院综合结核病科/云南省传染性疾病临床医学中心/云南省非结核分枝杆菌病诊疗技术中心(2024-SW(技术)-11)/大理大学第六附属医院,昆明 650041
    2昆明市第三人民医院感染一科,昆明 650041
  • 收稿日期:2025-01-11 出版日期:2025-06-10 发布日期:2025-06-11
  • 通信作者: 李文明,Email:328202492@qq.com;沈凌筠,Email:m18608770202@163.com
  • 基金资助:
    国家自然科学基金(82460001);云南省科技厅地方高校联合专项基金(202401BA070001-063);云南省科技厅地方高校联合专项基金(202401BA070001-082);昆明市卫生健康委员会卫生科研课题(2023-03-08-012);昆明市卫生健康委员会卫生科研课题(2023-03-02-019)

Establishing and validating a prediction model for HIV-associated nontuberculous mycobacterial disease based on machine learning

Li Longfen1, Shi Chunjing1, Luo Yun1, Zhang Huajie1, Liu Jun2, Wang Ge1, Zhao Yanhong1, Yuan Lijuan1, Li Shan1, Li Wenming1(), Shen Lingjun1()   

  1. 1Department of Integrated Pulmonary Tuberculosis, Kunming Third People’s Hospital/Yunnan Clinical Medical Center for Infectious Diseases/Kunming Yunnan Diagnosis and Treatment Technology Center for Nontuberculous Mycobacterial Diseases (2024-SW (Technology)-11)/The Sixth Affiliated Hospital of Dali University, Kunming 650041, China
    2Department of Infectious Disease I, Kunming Third People’s Hospital, Kunming 650041, China
  • Received:2025-01-11 Online:2025-06-10 Published:2025-06-11
  • Contact: Li Wenming, Email: 328202492@qq.com;hen Lingjun, Email: m18608770202@163.com
  • Supported by:
    National Natural Science Foundation of China(82460001);The Basic Research Foundation of Yunnan Province Local Universities(202401BA070001-063);The Basic Research Foundation of Yunnan Province Local Universities(202401BA070001-082);Kunming Municipal Health Commission Health Science Research Project(2023-03-08-012);Kunming Municipal Health Commission Health Science Research Project(2023-03-02-019)

摘要:

目的: 探讨基于机器学习建立人类免疫缺陷病毒(human immunodeficiency virus,HIV)感染并发非结核分枝杆菌(nontuberculous mycobacteria,NTM)病的预测模型,以期为临床早期识别HIV感染并发NTM病提供依据。方法: 回顾性分析2021年8月至2024年8月在昆明市第三人民医院住院治疗的4475例HIV感染患者为研究对象,依据纳入和排除标准,将77例HIV感染并发NTM病患者为观察组,262例HIV感染未并发NTM病患者为对照组。收集患者的临床资料,应用Borderline SMOTE处理样本量组间不平衡,分别采用支持向量-递归特征消除(support vector machine recursive feature elimination,SVM-RFE)、Lasso回归、随机森林筛选因子。变量间进行多重共线性检验,以方差膨胀因子(VIF)、容差表示。基于logistic回归拟合预测模型,以数学方程呈现。采用受试者工作特征曲线(ROC曲线)、临床决策曲线、临床影响曲线、校准曲线及外部验证评价模型。结果: 339例研究对象以8∶2随机分为训练集272例和验证集67例。训练集中对照组208例,观察组64例,Borderline SMOTE处理后对照组208例,观察组202例。采用SVM-RFE进行因子重要性排序,选取前5个因子[人类免疫缺陷病毒核糖核酸(HIV-RNA)、T淋巴细胞(CD45+)、C-反应蛋白(CRP)、降钙素原(PCT)、血红蛋白(HB)],建立模型1:Logit(P):Y=3.22+2.4×HIV-RNA(1或0)-0.002×CD45++0.021×CRP+0.908×PCT-0.037×HB,P=1/(1+e-Y)(Y:预测指数,P:预测概率);Lasso回归筛选出最佳的5个指标,即L、HB、CD45+、CRP、HIV-RNA,建立模型2:Logit(P):Y=2.940+2.57×HIV-RNA(1或0)-0.002×CD45++0.0240×CRP-0.823×L-0.034×HB,P=1/(1+e-Y);随机森林指标重要性排序显示:名列前5者为CD45+、淋巴细胞、HIV-RNA、单核淋巴细胞比、预后营养指数,建立模型3:Logit(P):Y=2.214+2.350×HIV-RNA(1或0)-0.002×CD45++0.702×MLR-0.681×L-0.080×PNI,P=1/(1+e-Y)。模型1、模型2、模型3对HIV感染并发NTM病预测的曲线下面积(AUC)分别为0.944(95%CI:0.923~0.965)、0.944(95%CI:0.922~0.965)、0.929(95%CI:0.904~0.954),敏感度分别为87.1%、90.6%、94.6%,特异度分别为91.3%、89.4%、81.2%,约登指数分别为0.784、0.800、0.758,阳性似然比分别为10.010、8.547、5.028,阴性似然比分别为0.141、0.105、0.066。校准曲线均显示,3个模型预测与实际结果趋于一致,且差异无统计学意义(P>0.05)。3个模型的临床决策曲线及影响曲线均显示,在以最佳截断值为阈概率下,3个模型都能使患者获益。外部验证显示,3个模型在验证集中也有较好的预测价值,即稳定性良好。结论: 本研究建立的3个模型都有较高的预测价值,具有良好的区分度、准确度、临床适用性和稳定性。

关键词: 分枝杆菌感染, 获得性免疫缺陷综合征, 模型, 统计学, 诊断, 计算机辅助

Abstract:

Objective: To explore establishing a prediction model based on machine learning for HIV-associated nontuberculous mycobacteria (NTM) disease, in order to provide a basis for the early clinical identification of HIV co-infection with NTM. Methods: A retrospective analysis was conducted on 4475 patients who were hospitalized at the Third People’s Hospital of Kunming from August 2021 to August 2024. According to inclusion and exclusion criteria, as well as grouping standards, 77 patients with HIV complicated with NTM were designated as the observation group, while 262 patients with HIV without NTM complications were designated as the control group. We collected their clinical data. Borderline SMOTE was applied to address the imbalance between two groups. Feature selection was then conducted using Support Vector Machine Recursive Feature Elimination (SVM-RFE), Lasso regression, and random forest. A multicollinearity test was conducted among the variables, using Variance Inflation Factor (VIF) and Tolerance as indicators. Predictive models were fitted based on logistic regression and presented as mathematical equations. The models were evaluated using ROC curves, calibration curves, clinical decision curves, clinical impact curves, and external validation. Results: The 339 patients were randomly divided into a training set of 272 cases and a validation set of 67 cases with an 8∶2 ratio. In the training set, there were 208 control cases and 64 observation cases. After processing with Borderline SMOTE, the control group remained at 208 cases, while the observation group increased to 202 cases. The SVM-RFE factor importance ranking showed: the top 5 were RNA, CD45+, CRP, PCT, and HB. Model1 was established with the following logistic equation: Logit(P):Y=3.22+2.4×HIV-RNA(1 or 0)-0.002×CD45++0.021×CRP+0.908×PCT-0.037×HB,P=1/(1+e-Y)(Y: predictive index, P: predictive probability); Lasso regression identified the top 5 indicators as L, HB, CD45+, CRP, and HIV-RNA, and model 2 was established with the following logistic equation: Logit(P):Y=2.940+2.57×HIV-RNA(1or 0)-0.002×CD45++0.0240×CRP-0.823×L-0.034×HB,P=1/(1+e-Y); The importance ranking of indicators by the random forest showed: the top 5 were CD45+, L, HIV-RNA, MLR, and PNI. Model 3 was established with the following logistic equation: Logit(P): Y=2.214+2.350×HIV-RNA(1or 0)-0.002×CD45++0.702×MLR-0.681×L-0.080×PNI,P=1/(1+e-Y). The ROC curve analysis showed the following results: model 1 (AUC: 0.944, 95%CI: 0.923-0.965), model 2 (AUC: 0.944, 95%CI: 0.922-0.965), and model 3 (AUC: 0.929, 95%CI: 0.904-0.954). The sensitivities were 87.1%, 90.6%, and 94.6% respectively, the specificities were 91.3%, 89.4%, and 81.2% respectively, the Youden’s indices were 0.784, 0.800, and 0.758 respectively, the positive likelihood ratios (+LR) were 10.010, 8.547, and 5.028 respectively, and the negative likelihood ratios (-LR) were 0.141, 0.105, and 0.066 respectively. There was no statistically significant difference among AUCs of the three models, and their calibration curves all indicated that the predictions were consistent with the actual outcomes. The clinical decision curves and impact curves for all three models demonstrated that, using the optimal cutoff value as the probability threshold, all three models could result in patient benefit. External validation showed that all three models had good predictive values for the validation set, indicating that they were stable. Conclusion: The three models established in this study all have high predictive values, with good discrimination ability, calibration, clinical applicability, and stability.

Key words: Mycobacterium infections, Acquired immunodeficiency syndrome, Models, statistical, Diagnosis, computer-assisted

中图分类号: