Predictive divergence in machine learning models for clinical mortality risk: A multicohort study of covid-19 patients
Background Machine learning (ML) algorithms are increasingly used in healthcare to support clinical decision-making. While models with similar overall performance are often considered interchangeable for deployment, they may produce divergent predictions, a phenomenon known as algorithmic multiplicity. In such cases, the choice of algorithm may introduce bias. This study investigates the impacts of algorithmic multiplicity in mortality prediction and assesses the influence of patient characteristics on model decisions. Methods A cohort of 4,337 adult patients (≥18 years) with RT-PCR–confirmed covid-19 from five tertiary care hospitals in Brazil was followed from March to August 2020. Five popular ML models for structured data were trained on demographic and laboratory data collected at early hospital admission to predict in-hospital mortality. Model performance, feature importance, and algorithmic prediction similarity were evaluated. Feature distributions were compared between patients correctly or incorrectly classified by all models using paired t-tests or Mann–Whitney U tests, as applicable, at the 5% significance level. Subgroup performance differences were assessed using 10-fold cross-validation applied to five k-means–delineated clusters, compared by one-way ANOVA. Within-cluster predictive divergence was assessed within a 95% confidence interval. Results All models achieved high overall predictive performance (µ = 0.855, σ² = 0.0072). However, the comparison of individual-level predictions revealed substantial heterogeneity, with pairwise prediction correlations ranging from R² = 0.56 to 0.80. Unsupervised k-means clustering identified five clinically distinct patient subgroups with mortality rates ranging from 22% to 80%, within which model performance varied significantly (F = 73.18, p < 0.001). Notably, TabPFN and LightGBM showed superior performance in the “Anemia” cluster, whereas TabPFN underperformed in the “Immunodeficient” cluster (95% CI). Conclusions This study demonstrates that ML models with similar overall performance can yield substantially divergent predictions at both the individual and subgroup levels, and that no single algorithm consistently outperforms others across all patient subgroups. These findings highlight the limitations of relying solely on global performance metrics and underscore the need for context-aware evaluation of ML models in heterogeneous clinical populations.