Objective To develop a workflow that transforms electronic health record data into machine learning-ready features for molecular endotype assignment and to evaluate whether clinician-informed feature engineering improves model performance and interpretability.
Materials and Methods We developed parallel clinician-informed and clinician-agnostic feature engineering pipelines to prepare raw EHR data from mechanically ventilated patients with respiratory failure. Molecular endotype labels derived from paired deep lung and blood profiling of subjects with acute lung injury were used to train candidate machine learning classifiers. Champion models from each pipeline were compared on predefined performance metrics.
Results Bayesian network classifiers were the top-performing models in both pipelines. The clinician-informed pipeline generated fewer features than the clinician-agnostic pipeline (645 vs 1,127) and produced a lower misclassification rate in the final Bayesian network model (0.047 vs 0.14). In an independent cohort of subjects with acute lung injury, the clinician-informed model better distinguished corticosteroid-responsive from non-responsive subgroups.
Discussion Clinical context improved feature engineering efficiency, model interpretability, and classification performance. These findings support the integration of domain expertise into machine learning workflows intended for critical care implementation.
Conclusions Clinician-informed feature engineering can simplify machine learning models while improving performance and preserving clinical relevance. AI tools developed for healthcare should incorporate subject matter expertise early in the feature engineering and analytic workflow.
Competing Interest StatementThe authors have declared no competing interest.
Funding StatementThis project was supported by the North Carolina Collaboratory at The University of North Carolina at Chapel Hill with funding appropriated by the North Carolina General Assembly (V.V. and M.C.W) and by the Rapidly Emerging Antiviral Drug Development Initiative at the University of North Carolina at Chapel Hill with funding from the North Carolina Coronavirus State and Local Fiscal Recovery Funds program, appropriated by the North Carolina General Assembly (J.C.S., R.S.H., and M.C.W.)
Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
This study was approved by the institutional review board at the University of North Carolina at Chapel Hill with a waiver of informed consent (IRB 22-3196, January 11, 2023).
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
FootnotesConflict of Interest: The authors have declared that no conflict of interest exists. Key Words: Machine Learning, Clinical Prediction, ARDS, Critical Care
Data AvailabilityAll data produced in the present study are available upon reasonable request to the authors
Comments (0)