In recent years, progress in medical informatics and machine learning has been accelerated by the availability of openly accessible benchmark datasets. However, patient-level electronic medical record (EMR) data are rarely available for teaching or methodological development due to privacy, governance, and re-identification risks. This has limited reproducibility, transparency, and hands-on training in cardiovascular risk modelling. Here we introduce PRIME-CVD, a parametrically rendered informatics medical environment designed explicitly for medical education. PRIME-CVD comprises two openly accessible synthetic data assets representing a cohort of 50,000 adults undergoing primary prevention for cardiovascular disease. The datasets are generated entirely from a user-specified causal directed acyclic graph parameterised using publicly available Australian population statistics and published epidemiologic effect estimates, rather than from patient-level EMR data or trained generative models. Data Asset 1 provides a clean, analysis-ready cohort suitable for exploratory analysis, stratification, and survival modelling, while Data Asset 2 restructures the same cohort into a relational, EMR-style database with realistic structural and lexical heterogeneity. Together, these assets enable instruction in data cleaning, harmonisation, causal reasoning, and policy-relevant risk modelling without exposing sensitive information. Because all individuals and events are generated de novo, PRIME-CVD preserves realistic subgroup imbalance and risk gradients while ensuring negligible disclosure risk. PRIME-CVD is released under a Creative Commons Attribution 4.0 licence to support reproducible research and scalable medical education.
Competing Interest StatementThe authors have declared no competing interest.
Funding StatementThis study did not receive any funding.
Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
The study uses only fully synthetic data generated de novo from publicly available aggregate statistics and published epidemiological studies that were accessible prior to study initiation. No individual-level or restricted-access datasets were used. Source data were obtained from openly accessible resources, including: [1]: Australian Bureau of Statistics (ABS): https://www.abs.gov.au [2]: Australian Institute of Health and Welfare (AIHW): https://www.aihw.gov.au [3]: Peer-reviewed epidemiological studies indexed via PubMed and other open-access repositories These sources provided population-level distributions and effect estimates (e.g., socioeconomic distributions such as IRSD) used to parameterise the simulation model. For full transparency, all source references are detailed in Table 3 (page 11) of the submitted manuscript.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
AbbreviationsABSAustralian Bureau of StatisticsAFAtrial FibrillationAIHWAustralian Institute of Health and WelfareBMIBody Mass IndexCKDChronic Kidney DiseaseCKMCardio–Kidney–Metabolic (domain including CVD, diabetes, and CKD)CSVComma Separated ValueCVDCardiovascular DiseaseDAGDirected Acyclic GraphDDPMDenoising Diffusion Probabilistic ModelseGFREstimated Glomerular Filtration RateEMRElectronic Medical RecordGANGenerative Adversarial NetworkHbA1cGlycated Haemoglobin A1cHRHazard RatioIRSDIndex of Relative Socioeconomic DisadvantageNHSNational Health ServiceNSWNew South WalesOROdds RatioPRIME-CVDParametrically Rendered Informatics Medical Environment for Cardiovascular DiseaseSBPSystolic Blood PressureT2DMType 2 Diabetes Mellitus
Comments (0)