摘要
To develop and validate machine learning models for early prediction of interstitial lung disease (ILD) in rheumatoid arthritis (RA) patients, and to identify key predictive biomarkers that may facilitate risk stratification and timely intervention.
We conducted a cross-sectional study enrolling 149 RA patients (84 with ILD, 65 without ILD) at the Department of Rheumatology, Xi'an Fifth Hospital, between January 2020 and December 2023. All patients met the 2010 ACR/EULAR classification criteria with disease duration ≥6 months. Patients with other connective tissue diseases, active pulmonary infections, malignancy, pregnancy, or severe organ dysfunction were excluded. All 149 patients underwent high-resolution computed tomography (HRCT) screening regardless of respiratory symptoms. ILD diagnosis was established by two experienced chest radiologists using a standardized scoring system (Cohen's kappa = 0.85). HRCT served exclusively as the reference standard and was not included as a predictive variable.
We evaluated demographic characteristics, clinical parameters, inflammatory markers, hematological parameters, and four specific biomarkers: Krebs von den Lungen-6 (KL-6), interleukin-6 (IL-6), cytokeratin 19 fragment (CYFRA21-1), and carbohydrate antigen 15-3 (CA15-3). We developed and compared four machine learning models — XGBoost, Random Forest, Support Vector Machine (SVM), and Logistic Regression — for ILD prediction. Feature selection employed a three-stage approach combining univariate analysis with Bonferroni correction, LASSO regression with 10-fold cross-validation, and XGBoost built-in feature importance assessment, reducing the initial 32 variables to a final set of 10 key predictors. Dataset splitting (80% training, 20% testing), 10-fold cross-validation, and bootstrap resampling (1000 iterations) were used to assess model stability and prevent overfitting.
The ILD group was significantly older (64.5 ± 9.8 vs 56.3 ± 11.2 years, P < 0.001), had longer disease duration (median 8.5 vs 5.2 years, P = 0.003), and higher smoking prevalence (22.6% vs 12.3%, P = 0.047). ILD patients demonstrated significantly elevated inflammatory markers, including ESR (42.5 vs 28.0 mm/h), CRP (12.8 vs 6.5 mg/L), RF (168.5 vs 89.2 IU/mL), and ACPA (245.8 vs 156.3 U/mL) (all P < 0.01). Hematological analysis revealed distinct immune dysregulation patterns in ILD patients, with elevated neutrophil-to-lymphocyte ratio (3.42 vs 2.58, P = 0.002) and platelet-to-lymphocyte ratio (168.5 vs 142.8, P = 0.015), alongside reduced lymphocyte-to-monocyte ratio (3.85 vs 4.28, P = 0.042).
Among specific biomarkers, KL-6 showed the most pronounced between-group difference (826.4 ± 458.2 vs 285.6 ± 124.8 U/mL, P < 0.001), representing a 2.9-fold elevation. IL-6 (15.8 vs 8.2 pg/mL), CYFRA21-1 (3.85 vs 2.46 ng/mL), and CA15-3 (18.6 vs 12.8 U/mL) were also significantly elevated in ILD patients (all P < 0.001). Importantly, correlation analysis demonstrated that KL-6 levels were independent of systemic inflammatory activity (r = −0.074 vs CRP, r = −0.065 vs ESR in the ILD group), and KL-6 remained significantly elevated in ILD patients even in a low-inflammation subgroup (CRP ≤ 5 mg/L and ESR ≤ 20 mm/h; 593.8 vs 299.3 U/mL, P = 0.02), confirming its role as an inflammation-independent marker of pulmonary fibrotic processes.
The XGBoost model demonstrated superior predictive performance (AUC = 0.891, 95% CI: 0.847–0.935), significantly outperforming Random Forest (AUC = 0.876), SVM (AUC = 0.845), and Logistic Regression (AUC = 0.832). The XGBoost model achieved optimal sensitivity (0.867), specificity (0.882), and overall accuracy (0.872). Feature importance analysis identified KL-6 as the strongest predictor (importance score = 0.285, stability score = 0.92), followed by IL-6 (0.156, 0.88) and CYFRA21-1 (0.128, 0.85). Clinical parameters including age and disease duration also demonstrated strong predictive value.
Machine learning approaches, particularly XGBoost, demonstrate promising potential for early RA-ILD prediction, exceeding the performance of traditional prediction tools (typical AUC 0.70–0.80). The integration of KL-6 and other identified biomarkers into clinical screening protocols may facilitate early detection before irreversible lung damage occurs. The synergistic action of the three key biomarkers — reflecting early inflammation (IL-6), epithelial injury (CYFRA21-1), and alveolar repair response (KL-6) — may capture different pathological stages of RA-ILD. This biomarker-based model is designed to complement HRCT by guiding selective screening toward high-risk individuals, optimizing resource utilization while maintaining HRCT's role as the definitive diagnostic tool. External validation in multi-center, diverse populations and longitudinal studies tracking biomarker trajectories are needed to confirm these findings and establish temporal predictive timelines.
