Adaptive, Privacy-Preserving Small Language Models for Multi-Task Clinical Assistance

DataOpen-i Indiana University Chest X-Ray Dataset

We used the deidentified, publicly available chest X-ray (CXR) data from Indiana University (https://openi.nlm.nih.gov) [18]. This dataset consisted of 3665 annotated CXR reports annotated by three board-certified radiologists for the presence of thirteen disease labels, defined in the CheXpert study [19]. Six reports lacked impressions and were removed, resulting in 3659 reports. The 13 disease labels include enlarged cardiomediastinum, cardiomegaly, lung lesion, lung opacity, edema, consolidation, pneumonia, atelectasis, pneumothorax, pleural effusion, other pleural, fracture, and support devices.

Brain MRI DICOM Metadata Dataset

This dataset consisted of DICOM metadata obtained from 2510 MRI brain examinations performed without contrast from January 2016 to November 2022 for the evaluation of acute stroke [20, 21]. The dataset consisted of 1395 unique free-text DICOM series descriptions, DICOM tag (0008,103E), reflecting free-text labels for the imaging sequence type performed, entered at the time of imaging by the manufacturer, imaging protocol, or MRI technologist. Data was classified as non-human subjects research by the IRB. Series descriptions were categorized by a practicing neuroradiologist (PK) into the following MRI sequence classifications: T1, T2, T2/FLAIR, DWI, ADC, and SWI, with an additional classification of “Other” for ambiguous labels or if the sequence did not fall into one of the categories.

Tasks

We selected three clinically relevant tasks that reflect common radiology workflows, with the specific goal of evaluating the breadth of SLM capabilities. These tasks were designed to assess performance across distinct categories of functionality: multi-label classification (diagnostic labeling), terminology mapping (metadata standardization), and free-text generation (impression summarization). All tasks were defined in a model-agnostic manner to ensure fair and consistent evaluation across both SLMs and LLMs.

Task 1: Medical Report Labeling

In this classification task, models are provided the full CXR report and asked to identify the presence or absence of thirteen predefined radiographic findings. The model’s output is structured as a JavaScript Object Notation (JSON) dictionary assigning “Yes” or “No” to each label. If no finding is present, the model should also output “No Finding: Yes.” This task evaluates a model’s ability to understand and structure findings within free-text medical narratives. An example is shown in Fig. 2. Because each report may contain multiple findings, this task follows a multi-label classification setup, which is commonly used to extract multiple clinical concepts from a single report.

Fig. 2Fig. 2

Example input of medical report labeling experiment with SLM prompt

Task 2: DICOM Metadata Harmonization

In this standardization task, models are provided with a free-text DICOM series description string and instructed to map it to a standard MRI sequence label. The output is expected to be a key-value JSON pair mapping the raw input to one of seven classes. An example is shown in Fig. 3.

Fig. 3Fig. 3

Example input of DICOM metadata harmonization experiment with SLM prompt

Task 3: Impression Generation

This generation task asks models to reconstruct the Impression section of a radiology report using the provided reports without the IMPRESSION section. The model’s output should be free text beginning with “IMPRESSION:” and should summarize the key clinical conclusions. This task evaluates the model’s ability to synthesize and abstract clinical information. An example is shown in Fig. 4.

Fig. 4Fig. 4

Example input of impression generation from findings experiment

Models Trained

This generation task asks models to reconstruct the Impression section of a radiology report using the provided reports without the IMPRESSION section. The model’s output should be free text beginning with “IMPRESSION:” and should summarize the key clinical conclusions. This task evaluates the model’s ability to synthesize and abstract clinical information. An example is shown in Fig. 4.

SLMs

We tested several SLMs as the backbone of this framework: OPT-350 m [17], Phi-4-mini (4B) [22], Llama-3.2-1B [23], Mistral-7B [24], and Qwen3-4B [25]. They vary in size from 350 million parameters to 7 billion parameters. We hypothesize that OPT-350 m is the best SLM for multi-clinical decision support compared to the other models.

To fine-tune the models efficiently, we used LoRA [16], a parameter-efficient strategy designed to reduce the computational cost of adapting transformer models to new tasks. LoRA freezes the original model parameters and introduces a small number of trainable weights in the form of low-rank matrices, typically inserted into the attention layers. Specifically, we applied LoRA adapters to the query and value projection matrices within the model’s attention mechanism. This approach significantly reduces the number of trainable parameters while maintaining or improving performance across downstream tasks. LoRA has been successfully applied to several domains, including medical question answering, health outcome prediction, and out-of-distribution detection [26,27,28].

In this work, we trained in the following SLMs:

Single-task SLMs (OPT-350m)

SLM labeling: Fine-tuned to predict 13 diagnostic labels from CXR reports (Task 1).

SLM harmonization: Fine-tuned to map DICOM series descriptions to standardized sequence labels (Task 2).

SLM impression: Fine-tuned to generate radiology impressions from report findings (Task 3).

Multi-task SLM (OPT-350m): To test our hypothesis that a single SLM could perform all tasks, we fine-tuned one model on the combined dataset from all three tasks. Each input was prepended with a task-specific instruction string, enabling instruction-style fine-tuning. No additional task embeddings were used.

Remaining SLMs: To assess the performance of multi-Task SLM (OPT-350m), we fine-tuned Phi-4-mini (4B), Llama-3.2-1B, Mistral-7B, and Qwen3-4B on the same combined dataset from all three tasks, identical to multi-task SLM (OPT-350m).

Comparative LLM Configurations with GPT-4o

We evaluated GPT-4o across all tasks without fine-tuning. Instead, we applied the following:

Zero-shot prompting: The model is given only the task instruction without additional information. Prompting was kept consistent with SLM instructions to enable fair evaluation.

Prompt engineering: Task instructions are supplemented with structured guidance, such as output format expectations and label definitions.

Experimental Setup

All SLMs were fine-tuned using supervised learning on an NVIDIA A40 GPU, though they can run efficiently on typical computers. The same hyperparameters were applied across all experiments: trained for 100 epochs with a batch size of 4 and a learning rate of 0.0008. The hyperparameters are determined through empirical tuning in preliminary experiments. An 80:20 train–test split for all datasets was done.

Inputs to the models followed an instruction-tuned format, in which a short directive (e.g., “Report CXR diagnosis”) preceded the main input, reflecting current best practices in instruction-following fine-tuning for NLP.

Evaluation Metrics

Task 1 was evaluated using the F1 score, which captures both correctness and balance between false positives and false negatives. We assessed the pairwise performance difference of this task between the multi-task OPT-350 m and other SLM models using the McNemar test for each disease label.

Task 2 was evaluated using overall classification accuracy, which reflects the proportion of correctly matched series descriptions to their canonical MRI sequence type. We assessed the pairwise performance difference of this task between the multi-task OPT-350 m and other SLM models using the McNemar test.

Task 3 was evaluated using both automatic and expert evaluation methods. Automatic evaluation involves using BLEU (Bilingual Evaluation Understudy) [29] and ROUGE (Recall-Oriented Understudy for Gisting Evaluation) [30] scores for automatic evaluation. While BLEU is a precision-based metric that measures n-gram overlap between the generated and reference impressions, ROUGE is a recall-based metric that captures how much of the reference content was captured by the model. We assessed the pairwise performance difference between multi-task OPT-350 m and other SLM models using the T-test. However, they do not always capture clinical correctness or semantic appropriateness in radiology. To address this limitation, we conducted an expert evaluation using a blinded human reader study.

We randomly selected 50 impressions generated by the SLM and 50 by GPT-4o. The corresponding Impression sections were removed from the original reports, and only the Indication and Findings were retained. Two board-certified radiologists (PK and JP) rated each generated impression on a 5-point Likert scale, where 5 represents a clear and concise IMPRESSIONS section that accurately reflects the findings. 1 represents an unacceptable IMPRESSIONS section with missing, inaccurate, or misleading impressions. The radiologists were blinded to the model source, and the order of impressions was randomized to minimize bias. The mean scores and score distributions per model were recorded to quantify the performance. To evaluate whether the models produced significantly different impression quality, we performed an independent two-sample (unpaired) two-tailed t-test comparing the average scores assigned to the SLM- and GPT-4o–generated impressions. We computed Cohen’s kappa between the two radiologists’ scores to assess interrater reliability. Given the ordinal nature of the Likert scale, we used quadratic weighting to appropriately penalize larger rating disagreements.

Statistical difference is measured at p = 0.05 for all statistical tests.

Model Fine-Tuning Trajectory Ablation

To empirically validate the optimal fine-tuning epoch for Phi, Llama, Mistral, and Qwen, all models are evaluated at 2, 5, 20, and 100 epochs for all three tasks: medical report labeling, DICOM series description harmonization, and impression generation from findings.

Multi-Task Robustness Ablation

To assess the robustness of multi-task SLM against memorization and task-to-task influence, a data poisoning ablation study was performed. We chose to poison the medical report labeling task by setting the disease label of a percentage of patients to negative, thereby artificially increasing the false-positive rate. For the percentages, we chose 25%, 50%, 75%, and 100%. Labels and ground truths for DICOM series description harmonization and impression generation from findings are untouched. Combining these data, we trained four new multi-task SLM models, each with varying medical label poisoning. If the performance for DICOM series description harmonization and impression generation from findings remains the same and only the performance for medical report labeling decreases as the percentage of poisoned increases, then we can conclude that there is no task-to-task influence, and memorization is minimal.

Comments (0)

No login
gif