Establishing and validating a prediction model for HIV-associated nontuberculous mycobacterial disease based on machine learning

doi:10.19982/j.issn.1000-6621.20250011

Abstract

Abstract:

Objective: To explore establishing a prediction model based on machine learning for HIV-associated nontuberculous mycobacteria (NTM) disease, in order to provide a basis for the early clinical identification of HIV co-infection with NTM. Methods: A retrospective analysis was conducted on 4475 patients who were hospitalized at the Third People’s Hospital of Kunming from August 2021 to August 2024. According to inclusion and exclusion criteria, as well as grouping standards, 77 patients with HIV complicated with NTM were designated as the observation group, while 262 patients with HIV without NTM complications were designated as the control group. We collected their clinical data. Borderline SMOTE was applied to address the imbalance between two groups. Feature selection was then conducted using Support Vector Machine Recursive Feature Elimination (SVM-RFE), Lasso regression, and random forest. A multicollinearity test was conducted among the variables, using Variance Inflation Factor (VIF) and Tolerance as indicators. Predictive models were fitted based on logistic regression and presented as mathematical equations. The models were evaluated using ROC curves, calibration curves, clinical decision curves, clinical impact curves, and external validation. Results: The 339 patients were randomly divided into a training set of 272 cases and a validation set of 67 cases with an 8∶2 ratio. In the training set, there were 208 control cases and 64 observation cases. After processing with Borderline SMOTE, the control group remained at 208 cases, while the observation group increased to 202 cases. The SVM-RFE factor importance ranking showed: the top 5 were RNA, CD45⁺, CRP, PCT, and HB. Model1 was established with the following logistic equation: Logit(P):Y=3.22+2.4×HIV-RNA(1 or 0)-0.002×CD45⁺+0.021×CRP+0.908×PCT-0.037×HB,P=1/(1+e^-Y)(Y: predictive index, P: predictive probability); Lasso regression identified the top 5 indicators as L, HB, CD45⁺, CRP, and HIV-RNA, and model 2 was established with the following logistic equation: Logit(P):Y=2.940+2.57×HIV-RNA(1or 0)-0.002×CD45⁺+0.0240×CRP-0.823×L-0.034×HB,P=1/(1+e^-^Y); The importance ranking of indicators by the random forest showed: the top 5 were CD45⁺, L, HIV-RNA, MLR, and PNI. Model 3 was established with the following logistic equation: Logit(P): Y=2.214+2.350×HIV-RNA(1or 0)-0.002×CD45⁺+0.702×MLR-0.681×L-0.080×PNI,P=1/(1+e^-^Y). The ROC curve analysis showed the following results: model 1 (AUC: 0.944, 95%CI: 0.923-0.965), model 2 (AUC: 0.944, 95%CI: 0.922-0.965), and model 3 (AUC: 0.929, 95%CI: 0.904-0.954). The sensitivities were 87.1%, 90.6%, and 94.6% respectively, the specificities were 91.3%, 89.4%, and 81.2% respectively, the Youden’s indices were 0.784, 0.800, and 0.758 respectively, the positive likelihood ratios (+LR) were 10.010, 8.547, and 5.028 respectively, and the negative likelihood ratios (-LR) were 0.141, 0.105, and 0.066 respectively. There was no statistically significant difference among AUCs of the three models, and their calibration curves all indicated that the predictions were consistent with the actual outcomes. The clinical decision curves and impact curves for all three models demonstrated that, using the optimal cutoff value as the probability threshold, all three models could result in patient benefit. External validation showed that all three models had good predictive values for the validation set, indicating that they were stable. Conclusion: The three models established in this study all have high predictive values, with good discrimination ability, calibration, clinical applicability, and stability.

Key words: Mycobacterium infections, Acquired immunodeficiency syndrome, Models, statistical, Diagnosis, computer-assisted

CLC Number:

R512.91

Li Longfen, Shi Chunjing, Luo Yun, Zhang Huajie, Liu Jun, Wang Ge, Zhao Yanhong, Yuan Lijuan, Li Shan, Li Wenming, Shen Lingjun. Establishing and validating a prediction model for HIV-associated nontuberculous mycobacterial disease based on machine learning[J]. Chinese Journal of Antituberculosis, 2025, 47(6): 708-718. doi: 10.19982/j.issn.1000-6621.20250011

Figures/Tables 13

指标	训练集(272例)				验证集(67例)
指标	对照组 (208例)	观察组 (64例)	统计检验值	P值	对照组 (54例)	观察组 (13例)	统计检验值	P值
白细胞计数[×10⁹/L, M(Q₁,Q₃)]	5.49 (4.48,74.00)	4.75 (2.91,6.04)	U=-3.040	0.002	5.65 (4.57,6.73)	4.33 (2.55,5.31)	U=-2.751	0.006
中性粒细胞计数[×10⁹/L,M(Q₁,Q₃)]	3.01 (2.29,4.23)	3.34 (1.78,4.94)	U=-0.090	0.928	3.37 (2.54,3.94)	2.68 (1.34,3.83)	U=-1.514	0.130
中性粒细胞百分比[%, M(Q₁,Q₃)]	57.90 (49.23,65.88)	72.40 (61.33,82.68)	U=-6.144	<0.001	57.45 (52.83,67.93)	71.70 (48.10,76.10)	U=-0.793	0.428
淋巴细胞计数[×10⁹/L, M(Q₁,Q₃)]	1.63 (1.22.145)	0.70 (0.36,1.10)	U=-8.393	<0.001	1.52 (1.02,2.11)	0.77 (0.34,1.38)	U=-3.100	0.002
淋巴细胞百分比[%, M(Q₁,Q₃)]	30.95 (2238.68)	15.25 (7.60,24.00)	U=-7.469	<0.001	30.55 (19.88,37.10)	18.60 (12.80,34.65)	U=-1.364	0.173
单核细胞计数[×10⁹/L, M(Q₁,Q₃)]	0.44 (0.35,0.55)	0.36 (0.24,0.57)	U=-2.527	0.011	0.45 (0.33,0.59)	0.34 (0.20,0.47)	U=-2.538	0.011
中性粒细胞与淋巴细胞比值[M(Q₁,Q₃)]	1.82 (1.26,2.93)	4.43 (2.53,10.41)	U=-7.092	<0.001	1.89 (1.40,3.40)	3.62 (1.35,6.71)	U=-1.046	0.295
血小板与淋巴细胞比值 [M(Q₁,Q₃)]	138.89 (100.99,175.86)	326.37 (194.68,452.61)	U=-7.899	<0.001	116.78 (90.86,206.31)	282.86 (116.46,402.98)	U=-2.251	0.024
单核细胞与淋巴细胞比值[M(Q₁,Q₃)]	0.26 (0.20,0.39)	0.53 (0.37,0.93)	U=-7.495	<0.001	0.28 (0.19,0.47)	0.43 (0.25,0.57)	U=-1.514	0.130
系统免疫炎症指数 [M(Q₁,Q₃)]	389.79 (261.53,626.30)	760.43 (475.58,1728.89)	U=-6.017	<0.001	360.22 (252.79,703.41)	324.10 (257.88,1024.47)	U=-0.460	0.646
预后营养指数[M(Q₁,Q₃)]	45.35 (42.11,50.34)	34.78 (28.01,40.99)	U=-8.660	<0.001	45.96 (41.84,50.94)	36.70 (32.45,43.45)	U=-3.155	0.002
单核细胞百分比[%, M(Q₁,Q₃)]	8.05 (6.53,9.80)	9.10 (6.43,11.58)	U=-0.977	0.329	8.00 (6.58,9.80)	7.90 (7.00,12.20)	U=-0.690	0.490
红细胞[×10¹²/L, M(Q₁,Q₃)]	4.18 (3.67,4.77)	3.62 (2.86,4.25)	U=-4.857	<0.001	3.99 (3.37,4.49)	4.04 (3.38,4.38)	U=-0.174	0.862
血红蛋白[g/L,M(Q₁,Q₃)]	141.00 (126.00,151.75)	112.07 (89.50,131.00)	U=-7.056	<0.001	140.00 (119.75,147.50)	119.00 (98.50,135.00)	U=-2.791	0.005
血小板[×10⁹/L, M(Q₁,Q₃)]	212.00 (169.25,267.00)	207.50 (148.00,63.75)	U=-1.163	0.245	195.50 (168.50,256.00)	180.00 (112.50,227.50)	U=-1.348	0.178
总胆红素[μmol/L, M(Q₁,Q₃)]	9.60 (6.513.55)	6.95 (5.15,10.23)	U=-3.460	0.001	9.45 (6.98,12.40)	10.20 (6.15,15.35)	U=-0.127	0.899
丙氨酸氨基转移酶[U/L, M(Q₁,Q₃)]	4.00 (2.00,7.00)	3.85 (1.11,6.75)	U=-1.048	0.295	3.00 (1.00,7.00)	5.00 (3.82,6.50)	U=-1.394	0.163
天冬氨酸氨基转移酶 [U/L,M(Q₁,Q₃)]	23.41 (18.00,35.48)	30.87 (20.25,50.12)	U=-2.376	0.018	22.00 (18.75,35.09)	26.00 (18.00,44.50)	U=-0.730	0.465
白蛋白[g/L,M(Q₁,Q₃)]	37.70 (34.23,40.90)	30.60 (24.93,36.28)	U=-6.915	<0.001	37.70 (35.18,41.28)	32.30 (29.80,37.60)	U=-2.688	0.007
前白蛋白(mg/L)	228.50 (187.75,267.75)	162.00 (115.25,246.94)	U=-4.775	<0.001	218.50 (179.50,267.50)	209.00 (115.00,231.50)	U=-1.990	0.047
肌酐[μmol/L,M(Q₁,Q₃)]	60.00 (51.00,72.75)	58.00 (45.00,66.75)	U=-2.003	0.045	61.50 (49.00,70.00)	53.00 (25.00,65.00)	U=-1.245	0.213
尿酸[μmol/L,M(Q₁,Q₃)]	307.50 (251.00,372.00)	312.50 (254.50,416.00)	U=-0.954	0.340	294.00 (248.75,395.25)	324.00 (227.00,437.50)	U=-0.420	0.674
补体1[μ/ml,M(Q₁,Q₃)]	196.00 (167.00,222.00)	228.00 (184.00,266.25)	U=-3.645	<0.001	201.00 (176.75,225.25)	260.00 (198.00,310.00)	U=-2.331	0.020
CD45⁺T淋巴细胞[个/μl,M(Q₁,Q₃)]	1659.62 (1208.14,2189.72)	763.76 (323.38,1098.00)	U=-7.101	<0.001	1623.95 (1099.50,2356.38)	703.00 (476.00,1501.64)	U=-3.314	0.001
C-反应蛋白[mg/L, M(Q₁,Q₃)]	2.72 (0.82,7.54)	32.10 (5.33,60.00)	U=-6.369	<0.001	3.52 (0.93,13.18)	8.90 (2.73,36.15)	U=-1.705	0.088
血清淀粉样蛋白A[mg/L,M(Q₁,Q₃)]	3.80 (1.40,12.00)	112.50 (5.13,31.60)	U=-6.119	<0.001	3.75 (1.00,33.33)	23.30 (3.65,87.35)	U=-1.324	0.185
白细胞介素-6[pg/ml, M(Q₁,Q₃)]	4.02 (1.95,8.34)	15.55 (5.28,59.78)	U=-5.825	<0.001	3.26 (1.50,6.58)	12.30 (3.50,28.75)	U=-2.883	0.004
降钙素原[ng/ml, M(Q₁,Q₃)]	0.04 (0.02,0.07)	0.11 (0.04,0.32)	U=-3.040	0.002	0.03 (0.02,0.06)	0.05 (0.03,0.21)	U=-1.613	0.107
HIV-RNA阳性[例(阳性率,%)]^a	62(29.8)	46(71.9)	χ²=37.660	<0.001	13(24.1)	6(46.2)	-^b	0.047

References 30

[1]	Donohue MJ. Increasing nontuberculous mycobacteria reporting rates and species diversity identified in clinical laboratory reports. BMC Infect Dis, 2018, 18(1):163. doi:10.1186/s12879-018-3043-7. pmid: 29631541
[2]	Vinnard C, Longworth S, Mezochow A, et al. Deaths Related to Nontuberculous Mycobacterial Infections in the United States, 1999—2014. Ann Am Thorac Soc, 2016, 13(11):1951-1955. doi:10.1513/AnnalsATS.201606-474BC.
[3]	Sharma SK, Upadhyay V. Epidemiology, diagnosis & treatment of non-tuberculous mycobacterial diseases. Indian J Med Res, 2020, 152(3):185-226. doi:10.4103/ijmr.IJMR_902_20.
[4]	Yu X, Liu P, Liu G, et al. The prevalence of non-tuberculous mycobacterial infections in mainland China: Systematic review and meta-analysis. J Infect, 2016, 73(6):558-567. doi:10.1016/j.jinf.2016.08.020. pmid: 27717784
[5]	Shah NM, Davidson JA, Anderson LF, et al. Pulmonary Mycobacterium avium-intracellulare is the main driver of the rise in non-tuberculous mycobacteria incidence in England, Wales and Northern Ireland, 2007—2012. BMC Infect Dis, 2016, 16:195. doi:10.1186/s12879-016-1521-3.
[6]	Diel R, Jacob J, Lampenius N, et al. Burden of non-tuberculous mycobacterial pulmonary disease in Germany. Eur Respir J, 2017, 49(4):1602109. doi:10.1183/13993003.02109-2016.
[7]	Liu Q, Du J, An H, et al. Clinical characteristics of patients with non-tuberculous mycobacterial pulmonary disease: a seven-year follow-up study conducted in a certain tertiary hospital in Beijing. Front Cell Infect Microbiol, 2023, 13:1205225. doi:10.3389/fcimb.2023.1205225.
[8]	Lopeman RC, Harrison J, Desai M, et al. Mycobacterium abscessus: Environmental Bacterium Turned Clinical Nightmare. Microorganisms, 2019, 7(3):90. doi:10.3390/microorganisms7030090.
[9]	中华医学会感染病学分会艾滋病学组, 中国疾病预防控制中心. 中国艾滋病诊疗指南(2024版). 中华传染病杂志, 2024, 42(5):257-284. doi:10.13419/j.cnki.aids.2024.08.01.
[10]	中华医学会结核病学分会. 非结核分枝杆菌病诊断与治疗指南(2020年版). 中华结核和呼吸杂志, 2020, 43(11):918-946. doi:10.3760/cma.j.cn112147-20200508-00570.
[11]	Li S, Yi H, Leng Q, et al. New perspectives on cancer clinical research in the era of big data and machine learning. Surg Oncol, 2024, 52(2):102009. doi:10.1016/j.suronc.2023.102009.
[12]	Rheinlander A, Schraven B, Bommhardt U. CD45 in human physiology and clinical medicine. Immunol Lett, 2018, 196:22-32. doi:10.1016/j.imlet.2018.01.009. pmid: 29366662
[13]	Penninger JM, Irie-Sasaki J, Sasaki T, et al. CD45: new jobs for an old acquaintance. Nat Immunol, 2001, 2(5):389-396. doi:10.1038/87687. pmid: 11323691
[14]	徐寿文. 抗HIV治疗对艾滋病合并丙肝患者HIV载量和T淋巴细胞亚群的影响. 贵州医药, 2024, 48(9):1420-1422. doi:10.3969/j.issn.1000-744X.2024.09.026.
[15]	王维勇, 陈品儒, 何间红, 等. 外周血T淋巴细胞与NTM肺病发病类型的相关性. 广州医科大学学报, 2017(2):90-93. doi:10.3969/j.issn.2095-9664.2017.02.25.
[16]	Deng H, He Y, Huang G, et al. Predictive value of prognostic nutritional index in patients undergoing gastrectomy for gastric cancer: A systematic review and meta-analysis. Medicine (Baltimore), 2024, 103(41):e39917. doi:10.1097/MD.0000000000039917.
[17]	Kouhpayeh H. Different diets and their effect on tuberculosis prevention in HIV patients. J Family Med Prim Care, 2022, 11(4):1369-1376. doi:10.4103/jfmpc.jfmpc_1289_21. pmid: 35516660
[18]	Franco JV, Bongaerts B, Metzendorf MI, et al. Undernutrition as a risk factor for tuberculosis disease. Cochrane Database Syst Rev, 2024, 6(6):D15890. doi:10.1002/14651858.CD015890.pub2.
[19]	汝触会, 陆书生, 陈爱凤, 等. 支气管扩张症合并非结核分枝杆菌肺病患者T淋巴细胞亚群及营养状态研究. 浙江医学, 2021, 43(18):1983-1987. doi:10.12056/j.issn.1006-2785.2021.43.18.2021-1412.
[20]	Masoumi M, Sakhaee F, Zolfaghari MR, et al. Mixed pulmonary infection with four isolates of nontuberculous mycobacteria: a case report of mycobacterium bacteremicum infection. Pneumonia (Nathan), 2022, 14(1):7. doi:10.1186/s41479-022-00100-6. pmid: 36333817
[21]	Yanagihara T, Ogata H, Mori A, et al. Amikacin Liposome Inhalation Suspension in the Real-World Management of Refractory Mycobacterium avium Complex Pulmonary Disease. Cureus, 2024, 16(3):e56622. doi:10.7759/cureus.56622.
[22]	Hwang H, Lee JK, Heo EY, et al. The factors associated with mortality and progressive disease of nontuberculous mycobacterial lung disease: a systematic review and meta-analysis. Sci Rep, 2023, 13(1):7348. doi:10.1038/s41598-023-34576-z. pmid: 37147519
[23]	Leboueny M, Maloupazoa Siawaya AC, Bouanga LDJ, et al. Changes of C-reactive protein and Procalcitonin after four weeks of treatment in patients with pulmonary TB. J Clin Tuberc Other Mycobact Dis, 2023, 31(1):100348. doi:10.1016/j.jctube.2023.100348.
[24]	Song H, Jeong M J, Cha J, et al. Preoperative neutrophil-to-lymphocyte, platelet-to-lymphocyte and monocyte-to-lymphocyte ratio as a prognostic factor in non-endometrioid endometrial cancer. Int J Med Sci, 2021, 18(16):3712-3717. doi:10.7150/ijms.64658. pmid: 34790044
[25]	Wang H, Guo C, Wang Y, et al. Immune cell composition and its impact on prognosis in children with sepsis. BMC Pediatr, 2024, 24(1):611. doi:10.1186/s12887-024-05087-1. pmid: 39342149
[26]	Adane T, Melku M, Ayalew G, et al. Accuracy of monocyte to lymphocyte ratio for tuberculosis diagnosis and its role in monitoring anti-tuberculosis treatment: Systematic review and meta-analysis. Medicine (Baltimore), 2022, 101(44):e31539. doi:10.1097/MD.0000000000031539.
[27]	Mayito J, Meya DB, Miriam A, et al. Monocyte to Lymphocyte ratio is highly specific in diagnosing latent tuberculosis and declines significantly following tuberculosis preventive therapy: A cross-sectional and nested prospective observational study. PLoS One, 2023, 18(11):e291834. doi:10.1371/journal.pone.0291834.
[28]	Traisathit P, Delory T, Ngo-Giang-Huong N, et al. Brief Report: AIDS-Defining Events and Deaths in HIV-Infected Children and Adolescents on Antiretrovirals: A 14-Year Study in Thailand. J Acquir Immune Defic Syndr, 2018, 77(1):17-22. doi:10.1097/QAI.00000000000001571. pmid: 29040162
[29]	Nethi AK, Karam AG, Alvarez KS, et al. Using Machine Learning to Identify Patients at Risk of Acquiring HIV in an Urban Health System. J Acquir Immune Defic Syndr, 2024, 97(1):40-47. doi:10.1097/QAI.00000000000003464. pmid: 39116330
[30]	Al Bulushi Y, Saint-Martin C, Muthukrishnan N, et al. Radiomics and machine learning for the diagnosis of pediatric cervical non-tuberculous mycobacterial lymphadenitis. Sci Rep, 2022, 12(1):2962. doi:10.1038/s41598-022-06884-3. pmid: 35194075

指标	观察组(77例)	对照组(262例)	统计检验值	P值
性别[例(构成比,%)]			χ²=1.249	0.264
男性	51(66.2)	155(59.2)
女性	26(33.8)	107(40.8)
民族[例(构成比,%)]			χ²=0.029	0.865
汉族	71(92.2)	240(91.6)
少数民族	6(7.8)	22(8.4)
年龄(岁, $\bar{x}±s$)	44.34±10.30	54.92±15.17	t=5.744	<0.001
合并症[例(发生率,%)]
糖尿病	5(6.5)	24(9.2)	χ²=0.541	0.462
高血压	6(7.8)	39(14.9)	χ²=2.601	0.107
吸烟史	44(57.1)	124(47.3)	χ²=2.293	0.130
饮酒史	31(40.3)	84(32.1)	χ²=1.785	0.182
HIV感染途径[例(构成比,%)]			χ²=2.860	0.239
毒品静脉注射	4(5.2)	20(7.6)
性传播	48(62.3)	135(51.6)
母婴、拔牙	25(32.5)	107(40.8)
婚育史[例(构成比,%)]			χ²=5.746	0.057
未婚	20(26.0)	38(14.5)
已婚	46(59.7)	174(66.4)
离异	11(14.3)	50(19.1)