Onca: An Open 9B Language Model for Pancreatic Cancer Clinical Tasks

Abstract

Pancreatic ductal adenocarcinoma (PDAC) remains one of the deadliest solid tumors and continues to face low treatment-trial participation, fragmented evidence workflows, and labor-intensive abstraction of unstructured clinical text. Existing oncology-focused language models show promise, but many depend on private institutional corpora, limiting reproducibility and practical reuse across centers. We present Onca, an open 9B dense model designed for four PDAC-relevant tasks: trial eligibility screening, case-specific clinical reasoning, structured pathology report extraction, and molecular variant evidence reasoning. Onca is fine-tuned from Qwopus3.5-9B-v3 with a single Unsloth BF16 LoRA adapter on 37,364 training rows drawn from openly available sources. The evaluation spans 11 panels and compares Onca against Woollie-7B, CancerLLM-7B, OpenBioLLM-8B, and the unmodified Qwopus base. Onca achieves the strongest overall results on Trial Screening (81.6 F1), Clinical Reasoning (14.1 composite), Pathology Extraction (30.5 field exact-match), Pub-MedQA Cancer (68.3 macro-F1), and PubMedQA (66.5 macro-F1). The strongest gains appear in tasks closest to routine oncology workflow, especially trial review and pathology structuring. These findings suggest that clinically targeted pancreatic-cancer language models can be built from open data with competitive performance while remaining practical to train on a single workstation-scale GPU setup.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

This study did not receive any funding

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

The study used only publicly available human data and publicly available derived datasets that were accessible before the study began. No non-public, institution-restricted, or application-gated data sources were used. The sources were located in open public repositories including TCGA, CIViC, MOAlmanac, and openly accessible Hugging Face datasets used for trial screening, clinical reasoning, and pathology-report extraction.

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

View original article

Medrxiv - Oncology

Like

Share Bookmark

0 0 0 0 0 0 0

More from this channel

Onca: An Open 9B Language Model for Pancreatic Cancer Clinical Tasks

Comments (0)