CohortContrast: An R Package for Enrichment-Based Identification of Clinically Relevant Concepts in OMOP CDM Data

Abstract

Abstract Objective To address the unresolved bottleneck of selecting cohort-relevant clinical concepts for treatment trajectory analysis in observational health data, we introduce CohortContrast, an OMOP-compatible R package for enrichment-based concept identification, temporal and semantic noise reduction, and concept aggregation, enabling cohort-level characterization and downstream trajectory analysis. Materials and Methods We developed CohortContrast and applied it to OMOP-mapped observational data from the Estonian nationwide OPTIMA database, which includes all cases of lung, breast, and prostate cancer, focusing here on lung and prostate cancer cohorts. The workflow combines target-control statistical enrichment, temporal/global noise filtering, hierarchical concept aggregation and correlation-based merging, with optional patient clustering for downstream trajectory exploration. We validated the approach with a clinician-based plausibility assessment of extracted diagnosis-concept pairs and evaluated a large language model (LLM) as an auxiliary filtering step. Results We analyzed 7,579 lung cancer and 11,547 prostate cancer patients. The workflow reduced concept dimensionality from 5,793 to 296 concepts (94.9%) in lung cancer and from 5,759 to 170 concepts (97.0%) in prostate cancer, and identified three exploratory patient subgroups in both cohorts. In a plausibility assessment of 466 diagnosis-concept pairs, validators rated 31.3% as directly linked and 57.5% as indirectly linked. Discussion CohortContrast reduces manual concept curation by prioritizing and aggregating cohort-relevant concepts while preserving clinically interpretable treatment patterns in OMOP-based real-world data. Conclusion CohortContrast enables scalable reduction of broad OMOP concept spaces into clinically interpretable, cohort-specific representations for exploratory trajectory analysis and real-world evidence research.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

This work was supported by the Estonian Research Council (PRG1844). The study was funded by the European Union and co-funded by the Ministry of Education and Research (TEM-TA72). The European Union funded the project under its Horizon Europe research and innovation programme (grant agreement No 101060011, TeamPerMed) and co-funded the research through the European Regional Development Fund (Project No. 2021-2027.1.01.24-0444). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Executive Agency. Neither the European Union nor the granting authority can be held responsible for them. This work was also supported by the Estonian Centre of Excellence in Artificial Intelligence (EXAI), funded by the Estonian Ministry of Education and Research grant TK213. This work was further supported by the OPTIMA project (grant agreement No. 101034347) through IMI2 Joint Undertaking supported by European Union's Horizon 2020 research and innovation programme and the European Federation of Pharmaceutical Industries and Associations (EFPIA).

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

The study was approved by the Research Ethics Committee of the University of Tartu (300/T-23 and No. 330/T-10) and the Estonian Committee on Bioethics and Human Research (1.1-12/653) and the requirement for informed consent was waived.

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data Availability

There are legal restrictions on sharing de-identified data. According to legislative regulation and data protection law in Estonia, the authors cannot publicly release the data received from the health data registers in Estonia. An interactive browser of the cell-suppressed, aggregate CohortContrast results (excluding the validation outputs) is available at: http://omop-apps.cloud.ut.ee/CohortContrast/ (accessed April 21, 2026).

Comments (0)

No login
gif