Importance Researchers commonly use counts of diagnostic codes from EHR-linked biobanks to infer phenotypic status. However, these approaches overlook temporal changes in EHR data, such as the discontinuation or “dropout” of diagnostic codes, which may exacerbate disparities in genomics research, as EHR data quality can be confounded with demographic attributes.
Objective To address this, we propose modeling diagnostic code dropout in EHR data to inform phenotyping for schizophrenia in genomic analyses.
Design We develop and test our diagnostic dropout model by analyzing EHR data from individuals with prior schizophrenia diagnoses. We further validate model performance on a subset of patients whose diagnoses were attained through chart review. Using PRS-CS and existing GWAS summary statistics, we first extrapolate polygenic weights. Then, we apply our dropout model’s outputs to construct a data-driven filter defining our target cohort for measuring polygenic score performance.
Setting Our analysis utilizes EHR and genomic data from the Million Veteran Program.
Participants To model diagnostic dropout in schizophrenia, we leverage data from 12,739 patients with a history of schizophrenia, after excluding outliers. For polygenic score analyses, we incorporate data from a potential pool of 8,385 European ancestry and 6,806 African ancestry patients with a history of schizophrenia.
Main outcomes and measures We compare the performance of our diagnostic dropout model with alternative methodologies both in predicting diagnostic dropout on a holdout set, as well as on chart review labeled data. Using the top differential diagnosis predictors in our model, we select relevant cases by filtering out patients with a prior history of mood or anxiety disorders. We then test the impact of applying different filters for measuring polygenic score performance.
Results When evaluated on chart review-labeled data, our model improves the area under the precision-recall curve (AUPRC) by 9.6% compared to competing methods. By applying our data-driven filter for schizophrenia, we achieve a 62% increase in the association effect size when transferring a European polygenic score to an African ancestry target cohort.
Conclusions and Relevance These findings highlight the potential of modeling diagnostic code dropout to enhance the phenotypic quality of EHR-linked biobank data, advancing more equitable and accurate genomics research across diverse populations.
Question Can we leverage temporal changes in electronic health record (EHR) data to improve schizophrenia case selection for genomic studies?
Findings We trained an XGBoost model on EHR data from 12,739 patients to predict schizophrenia diagnostic code dropout in the Million Veteran Program. By excluding cases with conditions associated with diagnostic dropout, we achieved a 62% increase in effect size when applying polygenic weights to an African ancestry target cohort. Filtering based on substance use, a common approach, yielded minimal gains.
Meaning Modeling diagnostic code dropout enhances the phenotypic quality of EHR-linked biobank data, and promotes equitable genomics research across diverse populations.
Competing Interest StatementThe authors have declared no competing interest.
Funding StatementThis research is based on data from the Million Veteran Program, Office of Research and Development, Veterans Health Administration, and was supported by awards #MVP076 (PR) and MVP#096 (DB). This publication does not represent the views of the Department of Veteran Affairs or the United States Government. This study was also supported by the National Institutes of Health (NIH), Bethesda, MD under award numbers: K08MH122911 (GV), R01AG078657 (GV), R01AG067025 (PR), R01AG082185 (PR), R01AG065582 (PR), R01AG067025 (PR), R01MH125246 (PR) and by the Veterans Affairs Merit grants BX006500 (DB) and BX004189 (PR).
Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
The Veterans Affairs (VA) Central Institutional Review Board gave ethical approval for this work.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data AvailabilityGWAS summary statistics and polygenic score weights generated from this paper will be shared through dbGAP upon publication.
Comments (0)