Clinical prediction models for sepsis frequently degrade when applied outside the development setting. Electronic health record data encode not only patient physiology but also observation processes such as measurement timing and frequency, which may be predictive within a site but unstable across sites. The contribution of these observation-process features to cross-site performance degradation has not been quantified. In this retrospective cohort study, we developed models for in-hospital mortality in adult intensive care unit (ICU) patients meeting Sepsis-3 criteria using Medical Information Mart for Intensive Care IV (MIMIC-IV) (n = 30,218; 16.3% mortality) and externally validated them in eICU Collaborative Research Database (eICU-CRD) (n = 31,403; 13.9% mortality). We compared seven prespecified model specifications representing physiologic summary strategies (a single aggregate severity score, most recent values, extreme values, and within-window variability), each evaluated with and without measurement counts as observation-process features. Models were fit using logistic regression and gradient-boosted trees. Internally, discrimination improved with more detailed physiologic summaries and measurement counts (logistic regression area under the receiver operating characteristic curve [AUROC] from 0.819 to 0.834). In external validation, performance drops were larger for specifications using more complex physiologic representations. Adding measurement counts was associated with larger domain shift (AUROC change, −0.047 versus −0.082 with counts in logistic regression). External calibration deteriorated progressively, with calibration slopes decreasing from 1.007 for the simplest model to 0.417 for the most complex specification in logistic regression. Gradient-boosted trees showed smaller incremental degradation from measurement counts but still exhibited domain shift in complex specifications. Inclusion of observation-process features in sepsis mortality prediction models was associated with improved internal discrimination but worse external calibration and transportability. These findings highlight that feature engineering decisions involve a tradeoff between internal performance and external generalizability, and that calibration assessment provides the most sensitive indicator of reduced transportability.
Author Summary We asked whether the way physiologic data are summarized and how clinical measurements are recorded in electronic health records can affect how well prediction models perform at new hospitals. Using data from over 60,000 intensive care patients with sepsis across two large databases, we built models to predict in-hospital death using distinct physiologic summary strategies of varying complexity. First, we found that extending physiologic summaries from simple aggregate scores to include extreme values and variability ranges was associated with improved predictions within the hospital where the model was developed, but with poorer calibration and reduced transferability when the model was applied to other hospitals. Second, we found that further incorporating observation-process features, such as measurement frequency, was also associated with improved internal discrimination but with worse external calibration and transferability. These patterns likely arise because richer summaries and measurement patterns capture not only patient physiology but also facility-specific clinical workflows and care processes. Our results suggest that model developers should carefully consider whether the data features they use capture stable biological signals or hospital-specific practices, and that checking whether predicted risks remain well-calibrated across settings is the most important step before deploying a model in a new hospital.
Competing Interest StatementThe authors have declared no competing interest.
Funding StatementYes
Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
All records in MIMIC-IV and eICU-CRD are de-identified in accordance with the Health Insurance Portability and Accountability Act Safe Harbor provisions. Because this study used only publicly available de-identified data, institutional review board approval and informed consent were not required. Access to both databases requires completion of the relevant data use agreements and human subjects research training.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Comments (0)