A calibration-aware hierarchical CNN-SWIN fusion framework for robust Cross-Dataset brain MRI analysis

Abstract

Introduction:

Deep learning approaches have become central to brain MRI analysis; however, their reliability under dataset shift remains a critical barrier to safe and scalable deployment in neuroscience and clinical research. While convolutional neural networks (CNNs) provide strong locality-driven inductive biases for robust feature extraction, they lack global contextual awareness. Conversely, transformer-based architectures capture long-range dependencies but often exhibit reduced robustness and miscalibrated confidence when applied to heterogeneous medical imaging data, particularly in Cross-Dataset settings.

Methods:

In this work, we propose a calibration-aware hierarchical CNN-Transformer fusion framework designed for robust brain MRI analysis under dataset shift. The architecture integrates a pretrained multi-scale CNN backbone with a hierarchical transformer branch and performs scale-aligned fusion through cross-attention mechanisms. By allowing local convolutional features to selectively query global contextual representations, the proposed design maintains stable feature contributions during fusion and mitigates overconfident reliance on transformer features when generalization degrades across datasets. The framework is evaluated using a strict Cross-Dataset protocol, where models are trained on one dataset and tested on a distinct dataset.

Results:

Experimental results demonstrate that the proposed fusion model achieves competitive classification performance while substantially improving probabilistic calibration relative to both CNN-only and transformer-only baselines. Specifically, the model attains an average accuracy of 99.20% and achieves lower Expected Calibration Error (ECE = 0.0041), Brier score (0.0028), and Negative Log-Likelihood (NLL = 0.0277) compared to a standalone Swin Transformer and a strong ResNet50 baseline.

Discussion:

These findings demonstrate that calibration-aware hierarchical CNN-Transformer fusion enhances both predictive reliability and robustness under Cross-Dataset evaluation. By improving the alignment between predictive confidence and empirical correctness, the proposed method supports safer large-scale analysis of heterogeneous brain MRI data, with important implications for multi-center neuroscience studies and trustworthy clinical decision support.

1 Introduction

Automated analysis of brain magnetic resonance imaging (MRI) plays a central role in modern neuroscience research and clinical workflows, supporting tasks such as disease characterization, population-level studies, and diagnostic decision support. In recent years, deep learning has become the dominant paradigm for brain MRI analysis, largely driven by convolutional neural networks (CNNs) that demonstrate strong inductive biases for capturing local anatomical structures, texture patterns, and spatially coherent features (Litjens et al., 2017; Shen et al., 2017). Architectures such as ResNet (He et al., 2016) have shown notable robustness on limited and heterogeneous medical imaging datasets and continue to serve as strong baselines for a wide range of brain MRI analysis tasks.

Despite their effectiveness, CNNs rely on fixed local receptive fields, which limits their ability to model long-range spatial dependencies and global contextual relationships that may be important for interpreting distributed or diffuse brain abnormalities. To address this limitation, transformer-based architectures have been introduced to computer vision and medical imaging. Vision Transformers and hierarchical variants such as the Swin Transformer (Liu et al., 2021) employ self-attention mechanisms to capture long-range contextual information and have demonstrated promising results across several medical image analysis applications (Zhang et al., 2022). Hybrid CNN-Transformer architectures, including UNETR and SwinUNETR, further explore the integration of convolutional locality with attention-based global modeling, particularly in segmentation-oriented frameworks (Hatamizadeh et al., 2022a,b).

However, accumulating evidence indicates that transformer-based models exhibit important limitations in medical imaging scenarios characterized by limited training data, heterogeneous acquisition protocols, and dataset shift. Prior studies report that transformers may underperform strong CNN baselines in terms of robustness and generalization when evaluated on unseen datasets (Raghu et al., 2021; Li et al., 2022). Moreover, transformer architectures are often prone to overconfident predictions, leading to miscalibrated probability estimates under distribution shift (Ovadia et al., 2019; Müller and Smith, 2022). In brain MRI analysis, such miscalibration is not merely a technical concern: overconfident incorrect predictions can obscure uncertainty and potentially undermine clinical safety by influencing diagnostic interpretation, treatment planning, or downstream research conclusions.

To leverage the complementary strengths of CNNs and transformers, a growing body of work has proposed hybrid fusion architectures that combine convolutional feature extraction with attention-based global modeling (Wang G. et al., 2021; Heidari et al., 2023). While these approaches enhance representational capacity, most are primarily designed to optimize in-domain performance and are typically evaluated under within-dataset validation protocols. In addition, prior CNN-Transformer fusion studies largely focus on segmentation rather than classification and rarely examine how architectural fusion choices affect probabilistic calibration under distribution shift. Consequently, it remains unclear whether existing fusion mechanisms improve confidence reliability when models are deployed across scanners, institutions, and heterogeneous patient populations.

Reliable confidence estimation is particularly critical in safety-aware brain MRI analysis. Under dataset shift, deep neural networks may maintain high accuracy while exhibiting degraded calibration, resulting in a mismatch between predicted confidence and empirical correctness. Such confidence–accuracy gaps can obscure model uncertainty and reduce trustworthiness in clinical or multi-center research settings. Addressing this issue requires not only evaluating accuracy under Cross-Dataset conditions but also explicitly analyzing calibration behavior as a first-class objective.

In this work, we propose a calibration-aware hierarchical CNN-Transformer fusion framework designed explicitly to improve confidence reliability under Cross-Dataset distribution shift. The proposed approach integrates a pretrained multi-scale CNN backbone with a hierarchical transformer branch and performs scale-aligned fusion through cross-attention mechanisms. Unlike naive concatenation or additive aggregation strategies, our design allows convolutional features to act as structured queries over transformer representations, preserving locality-driven robustness while adaptively modulating global contextual contributions through learnable fusion weights. This asymmetric interaction stabilizes feature integration when the transformer branch generalizes less reliably across datasets.

Crucially, we evaluate the proposed framework under a strict train-on-one-dataset, test-on-another protocol, without post-hoc temperature scaling, ensembling, or target-domain adaptation. This setting isolates the architectural effect of hierarchical fusion on probabilistic calibration and provides a realistic assessment of deployment-relevant robustness. To our knowledge, this is among the first studies to explicitly examine hierarchical CNN-Transformer fusion from a calibration perspective under strict Cross-Dataset evaluation for brain MRI classification.

1.1 Contributions

We propose a hierarchical multi-scale CNN-Transformer fusion architecture that integrates local convolutional features with global transformer representations through scale-aligned cross-attention, explicitly designed for robust brain MRI analysis under dataset shift.

We demonstrate that structured cross-attention fusion alone—without post-hoc temperature scaling or ensembling—can materially improve probabilistic reliability, as measured by Expected Calibration Error (ECE), Brier score, and Negative Log-Likelihood (NLL), under strict Cross-Dataset evaluation.

We conduct extensive Cross-Dataset experiments on two independent brain MRI datasets using a train-on-one, test-on-another protocol, showing that the proposed method achieves performance comparable to or better than strong CNN baselines while consistently improving confidence calibration relative to both CNN-only and transformer-only models.

We provide a comprehensive evaluation highlighting the importance of calibration-aware architectural design for reliable brain MRI analysis, with direct implications for multi-center neuroscience research and safety-critical clinical decision support systems.

2 Related work2.1 CNN-based brain MRI analysis

Convolutional neural networks (CNNs) remain a strong baseline for brain MRI analysis due to their locality-driven inductive biases, which effectively capture texture, boundaries, and anatomical structure. Transfer learning with ImageNet-pretrained backbones such as ResNet (He et al., 2016) is widely used to stabilize optimization and improve generalization, often yielding strong in-domain performance on curated medical datasets (Litjens et al., 2017; Shen et al., 2017).

However, CNN performance can degrade under acquisition variability arising from differences in scanner vendors, imaging protocols, patient populations, and dataset curation strategies. Such distribution shifts expose limitations of within-dataset validation protocols and motivate explicit Cross-Dataset evaluation. Recent systematic reviews emphasize that robustness to dataset shift remains a primary barrier to safe clinical deployment (Matta et al., 2024).

2.2 Vision Transformers and Swin-style models in medical imaging

Vision Transformers (ViT) introduced global self-attention as an alternative to convolutional inductive bias, demonstrating strong scaling behavior in large-data regimes (Dosovitskiy et al., 2021). Data-efficient variants such as DeiT improved performance in limited-data settings through distillation and augmentation (Touvron et al., 2021). The Swin Transformer further extended this paradigm by incorporating hierarchical feature extraction and shifted-window attention, enabling multi-scale contextual modeling with improved computational efficiency (Liu et al., 2021).

In medical imaging, transformer encoders and Swin-style architectures are frequently adopted to capture long-range spatial dependencies, particularly in anatomically complex settings. Hybrid CNN-Transformer designs such as TransUNet and UNETR integrate convolutional feature extraction with attention-based global modeling (Chen et al., 2021; Hatamizadeh et al., 2021). While these architectures demonstrate strong representational capacity, most are developed for segmentation tasks and evaluated primarily under in-domain settings.

2.3 CNN-Transformer fusion mechanisms

Fusion architectures aim to combine CNN locality with transformer-based global context through strategies such as feature concatenation, additive aggregation, learned re-weighting, or cross-attention between streams. Cross-attention is particularly appealing because it enables structured interaction between representations, allowing one pathway to query complementary information from another.

Despite these advances, multi-scale fusion remains challenging. Misaligned spatial resolutions, channel mismatches, and unconstrained aggregation weights can destabilize optimization or degrade strong CNN baselines when transformer features generalize poorly under distribution shift. Consequently, effective fusion for brain MRI analysis requires careful feature alignment and controlled interaction rather than naive combination.

2.4 Calibration and reliability under dataset shift

For safety-critical applications in neuroscience and clinical decision support, predictive confidence must be reliable in addition to accurate. Modern neural networks are often miscalibrated, motivating explicit evaluation using metrics such as Expected Calibration Error (ECE), Brier score, and Negative Log-Likelihood (NLL) (Guo et al., 2017). Importantly, calibration can deteriorate substantially under dataset shift even when accuracy remains high, resulting in overconfident incorrect predictions (Ovadia et al., 2019).

Existing calibration strategies commonly rely on post-hoc techniques such as temperature scaling, Platt scaling, ensembling, or Monte Carlo dropout. Although these methods can improve confidence alignment, they operate independently of architectural design and do not directly address the feature interaction mechanisms that may contribute to miscalibration. Recent work further highlights the need for robustness-aware evaluation protocols that assess reliability under deployment-like conditions (Park et al., 2021).

2.5 Gap analysis

While CNN-Transformer fusion has been widely explored to enhance representational capacity and in-domain performance—particularly in segmentation—its impact on probabilistic calibration under strict Cross-Dataset evaluation remains insufficiently examined for brain MRI classification. Most prior studies report within-dataset accuracy and do not analyze how fusion design influences confidence behavior when models are applied to unseen scanners or institutions.

Addressing this gap requires (i) evaluating fusion mechanisms under train-on-one, test-on-another protocols and (ii) explicitly measuring calibration alongside accuracy. This need motivates the calibration-aware hierarchical CNN-Transformer fusion framework proposed in this work.

3 Proposed method

This section presents the proposed Hierarchical Multi-Scale CNN-Transformer Fusion Framework, designed as a calibration-aware brain MRI analysis method robust to dataset shift. The design is motivated by two complementary observations. First, convolutional neural networks (CNNs) are highly effective at capturing fine-grained local structures such as boundaries, intensity discontinuities, and texture variations that are critical for reliable brain MRI interpretation. Second, transformer-based architectures, and in particular the Swin Transformer, provide powerful mechanisms for modeling long-range spatial dependencies and global anatomical context that are difficult to encode using fixed convolutional receptive fields alone. However, when applied independently, both paradigms exhibit limitations under Cross-Dataset evaluation: CNNs may lack sufficient global contextual awareness, while transformers may underperform on small or subtle regions due to token aggregation and hierarchical downsampling effects.

To address these challenges, we propose a unified framework that explicitly integrates local and global representations through hierarchical cross-attention fusion. The network consists of two parallel feature extraction branches—a multi-scale CNN encoder and a pretrained Swin Transformer encoder—whose outputs are fused at three spatial resolutions. At each scale, CNN features act as queries that selectively attend to transformer-derived contextual features, allowing local evidence to be refined using global anatomical cues while preserving locality-driven robustness. The resulting multi-scale fused representations are aggregated and passed to a lightweight classification head. An overview of the complete pipeline is illustrated in Figure 1.

Flowchart illustrating a five-stage brain MRI classification pipeline. Stage one inputs preprocessed MRI data. Stage two splits into CNN-based local feature extraction with ResNet and Swin-based global feature extraction. Stage three applies cross-attention fusion to combine features. Stage four aggregates using global average pooling and concatenation. Stage five performs classification with linear, dropout, and softmax layers, outputting tumor type.

Overview of the proposed hierarchical CNN–Swin fusion framework. A multi-scale ResNet-based CNN branch extracts local texture and boundary information, while a pretrained Swin Transformer branch models global contextual dependencies. At each resolution, cross-attention fusion aligns CNN queries with Swin keys and values. The fused representations are globally pooled, concatenated, and classified by a lightweight MLP head.

Formally, given an input MRI slice x, the overall mapping implemented by the model can be summarized as:

Where Ci and Si represent the i-th scale CNN and Swin features respectively, Mfuse is the fused multi-scale representation, and ŷ denotes the predicted class probability distribution.

Notation. An input MRI slice is denoted by x∈ℝ3 × H×W. For fusion scale s∈, the CNN feature map is , the Swin Transformer feature map is , and the fused representation is . In our implementation, (C1, C2, C3) = (96, 192, 384) with spatial resolutions (56, 56), (28, 28), and (14, 14), respectively. The mathematical formulation of the proposed framework is presented in Equations 122.

3.1 Multi-scale CNN encoder

The CNN branch is designed to capture local and mid-level tumor characteristics that are essential for reliable diagnosis, including boundary sharpness, texture irregularities, and local contrast variations. We adopt a ResNet50 backbone and extract feature maps from three intermediate stages corresponding to progressively increasing receptive fields. Let Ecnn(·) denote the CNN encoder. Given input x, the encoder produces:

where , , and .

To ensure compatibility with transformer features during fusion, each stage is projected using a 1 × 1 convolution:

resulting in channel dimensions 96, 192, and 384 at the three respective scales. These projections preserve spatial structure while aligning feature dimensions for cross-attention fusion.

3.2 Pretrained Swin Transformer encoder

To complement convolutional locality, the second branch employs a pretrained Swin Transformer, which provides hierarchical global context modeling through shifted-window self-attention. We use a Swin-T backbone accessed via the timm library with features_only=True, enabling extraction of intermediate representations.

Given input x, the transformer encoder produces:

corresponding to three hierarchical stages. To ensure dimensional compatibility, we apply optional 1 × 1 projections:

where ψs is either an identity mapping or a learned projection to (96, 192, 384). These features serve as the global contextual source during fusion.

3.3 Hierarchical cross-attention fusion

A key component of the proposed method is a hierarchical cross-attention fusion mechanism that integrates local CNN features with global transformer representations in a spatially aligned manner. Rather than naïvely concatenating or summing features, the CNN pathway is used to query transformer features at each spatial location, enabling structured interaction between local tumor evidence and global anatomical context.

To enforce spatial alignment, the Swin feature maps are resized to match CNN resolutions using bilinear interpolation:

Queries, keys, and values are generated using 1 × 1 convolutions:

The tensors are reshaped into h = 8 attention heads with per-head dimensionality ds = Cs/h and flattened into Ns = HsWs tokens. Cross-attention is computed via scaled dot-product attention:

After merging attention heads and applying a 1 × 1 output projection, the resulting feature map is denoted .

Finally, CNN and attention-derived features are combined using a residual-style weighted sum:

In practice, αs is initialized to 1.0 and βs to 0.1, biasing early training toward the CNN pathway.

where αs and βs are scalar learnable parameters initialized to favor the CNN pathway and optimized jointly with the cross-attention weights. Unlike static residual or additive fusion, these parameters are learned in conjunction with scale-specific cross-attention, enabling adaptive modulation of global contextual contribution under dataset shift. This design preserves strong locality-driven CNN representations while selectively incorporating transformer-derived context, particularly when transformer features generalize less reliably across datasets. As a result, the fusion mechanism mitigates overconfident reliance on global context and contributes directly to improved probabilistic calibration, as demonstrated in

This initialization encodes an inductive bias favoring local evidence early in training, which we hypothesize contributes to improved calibration under distribution shift (Section 5).

3.4 Multi-scale aggregation and classification

The fused representations capture complementary information at multiple resolutions. Global average pooling is applied at each scale:

The pooled vectors are concatenated to form a unified embedding:

Classification is performed using a lightweight two-layer MLP with ReLU activation and dropout:

where ŷ denotes the predicted probability distribution over tumor classes. The network is trained end-to-end using the standard cross-entropy loss:

By integrating convolutional locality and transformer-based global reasoning through hierarchical cross-attention, the proposed framework constructs a unified representation that is both discriminative and robust under Cross-Dataset evaluation. Unlike prior CNN-Transformer fusion approaches that primarily optimize in-domain accuracy, this design explicitly targets stable feature interaction and confidence reliability under dataset shift, laying the foundation for the improved accuracy and calibration reported in Section 5.

4 Dataset and experimental setup

To rigorously evaluate Cross-Dataset generalization and confidence reliability, we conduct experiments on two publicly available brain MRI classification datasets collected from different sources, acquisition protocols, and scanner environments. Using heterogeneous datasets provides a deployment-relevant assessment of distribution shift, which is essential for methodological validation in brain imaging studies and for safety-critical clinical decision support. Representative MRI samples across the four diagnostic categories are shown in Figure 2, highlighting visual differences between glioma, meningioma, pituitary, and no-tumor classes.

Medical illustration showing sixteen MRI brain scans in four labeled rows: glioma, meningioma, pituitary, and no-tumor. Each row displays multiple MRI image slices of brains, highlighting differences in tumor types and healthy tissue for comparative analysis.

Representative samples across the four classes.

Specifically, we adopt a strict Cross-Dataset evaluation protocol in which the model is trained on one dataset and evaluated on a separate dataset without any target-dataset validation, or data leakage. This setting reflects realistic scenarios in which labeled data from a target hospital or scanner is unavailable, and it directly tests whether predictive confidence remains meaningful under heterogeneity. An overview of the complete training, preprocessing, multi-seed learning, and Cross-Dataset evaluation pipeline is illustrated in Figure 3.

Flowchart diagram showing a machine learning workflow for analyzing source dataset BRISC2025 and target dataset BT-MRI. Steps include input preprocessing with RGB conversion, resizing, and normalization; followed by data augmentation with flip, rotate, affine, and jitter. Multi-seed training with three seeds leads to model architectures combining Swin-T Transformer for global features and ResNet50 for local features. Evaluation involves best checkpoint selection, target testing without fine-tuning, assessment of metrics like accuracy, precision, recall, F1, and calibration scores such as ECE, Brier, and NLL.

End-to-end pipeline for cross-dataset MRI classification, including preprocessing, multi-model feature extraction, prediction, and calibrated evaluation.

4.1 BRISC2025 dataset

The BRISC2025 (Fateh et al., 2025) dataset is a large-scale, curated collection of T1-weighted contrast-enhanced brain MRI slices designed for multi-class classification. It contains four diagnostic categories: glioma, meningioma, pituitary adenoma, and no-tumor. Images originate from multiple clinical sources, introducing natural variability in intensity distributions, tumor morphology, and background composition.

The dataset is provided with predefined training and testing splits. Table 1 summarizes the class-wise distribution used in our experiments, while Figure 4 illustrates the corresponding distribution across training and testing sets.

ClassTrainingTestingTotalGlioma1,1472541,401Meningioma1,3293061,635No-tumor1,0671401,207Pituitary1,4573001,757Total5,0001,0006,000

Class-wise distribution of the BRISC2025 dataset.

Bold values indicate best-performing results across models.

Two grouped bar charts showing image counts for four categories: glioma, meningioma, no tumor, and pituitary. The left chart, labeled “Training Set Distribution,” uses blue bars; pituitary has the highest count, followed by meningioma, glioma, and no tumor. The right chart, labeled “Testing Set Distribution,” uses red bars; meningioma and pituitary have the highest counts, followed by glioma and no tumor, with no tumor having the lowest count overall.

Class-wise distribution of the BRISC2025 dataset across training and testing splits.

The BRISC2025 dataset exhibits moderate class imbalance, particularly between no-tumor and pituitary cases, making it suitable for evaluating both discriminative performance and calibration behavior under uneven class priors.

4.2 BT-MRI dataset

The BT-MRI dataset is a widely used benchmark for brain MRI classification, comprising contrast-enhanced MRI slices collected from multiple institutions. It contains the same four diagnostic categories as BRISC2025: glioma, meningioma, pituitary adenoma, and no-tumor. However, the dataset differs substantially in image appearance, scanner characteristics, background composition, and tumor presentation, making it challenging in Cross-Dataset evaluation.

The dataset is released with predefined training and testing partitions. The class-wise statistics are reported in Table 2, while Figure 5 provides a visual representation of the class distribution across training and testing splits. Compared to BRISC2025, BT-MRI contains a higher proportion of no-tumor samples and exhibits different intensity and background characteristics. These discrepancies induce a non-trivial data shift that challenges both model generalization and probability calibration.

ClassTrainingTestingTotalGlioma1,3213001,621Meningioma1,3393061,645No-tumor1,5954052,000Pituitary1,4573001,757Total5,7121,3117,023

Class-wise distribution of the BT-MRI dataset.

Bold values indicate best-performing results across models.

Bar chart illustration displays training set distribution on the left, where No tumor category has the highest image count followed by Pituitary, Meningioma, and Glioma, each with over one thousand three hundred images. On the right, the testing set distribution is depicted, with No tumor again having the most images and the other three categories—Pituitary, Meningioma, and Glioma—having similar, slightly lower counts, all with values between three hundred and four hundred images. Each chart is labeled with number of images on the y-axis and tumor types on the x-axis.

Class-wise distribution of the BT-MRI dataset across training and testing splits.

4.3 Cross-Dataset evaluation protocol

To assess robustness under deployment-relevant conditions, we adopt a one-way Cross-Dataset evaluation strategy. Unless otherwise stated, the model is trained exclusively on the BRISC2025 training set and evaluated on the BT-MRI test set:

No samples from the target dataset are used during training, validation, hyperparameter tuning, or early stopping. This strict separation ensures that all reported results reflect genuine generalization rather than dataset-specific adaptation.

The reverse setting (BT-MRI → BRISC2025) is used only for supplementary analysis and is not included in the main results to maintain clarity of presentation.

Let denote the source dataset and the target dataset, with potentially different joint distributions:

The model parameters θ are optimized by minimizing the empirical risk on :

while all reported performance metrics are computed exclusively on . This protocol prevents data leakage and provides a realistic assessment of robustness and confidence reliability under dataset shift.

This study evaluates methodological robustness using retrospective public datasets. While results are encouraging, prospective clinical validation and multi-site testing are required prior to deployment in real-world diagnostic workflows.

We prioritize the BRISC2025 → BT-MRI direction in the main paper due to its stronger distribution shift characteristics, while reporting the reverse setting in the supplementary material for completeness.

4.4 Dataset curation and quality control

To enhance transparency and reproducibility, we explicitly describe the dataset handling and quality control procedures adopted in this study.

Both BRISC2025 and BT-MRI are publicly released datasets with predefined training and testing splits and expert-provided diagnostic labels. No relabeling, class merging, or modification of diagnostic categories was performed. All labels were used exactly as provided by the original dataset creators.

Prior to training, all image files were subjected to automated integrity checks to remove corrupted or unreadable files. We verified consistency of image dimensions and ensured that class labels were correctly mapped. Duplicate file names and visually identical samples were screened to avoid unintended data leakage.

No additional filtering based on tumor size, anatomical location, or visual characteristics was applied. All eligible slices within the official dataset partitions were retained to preserve the natural variability and class imbalance properties of each dataset.

Original dataset splits were strictly preserved. In the Cross-Dataset setting, the target dataset was not used for model selection, hyperparameter tuning, or early stopping, ensuring complete separation between training and evaluation data.

These procedures ensure that the reported results reflect genuine Cross-Dataset generalization rather than artifacts introduced by implicit filtering, relabeling, or partition modification.

4.5 Preprocessing and data augmentation

To ensure consistency with ImageNet-pretrained backbones and reduce scanner-induced variability, all MRI slices are converted to three-channel RGB format and resized to 224 × 224 pixels.

4.5.1 Normalization

Pixel intensities are normalized using ImageNet mean and standard deviation:

ensuring compatibility with pretrained CNN and Swin Transformer weights.

4.5.2 Data augmentation

During training, we apply stochastic augmentations to improve robustness to appearance variations: random horizontal flipping, small-angle rotations (±15°), affine transformations (scaling and translation), and mild color jitter. These augmentations preserve anatomical plausibility while exposing the model to both local texture perturbations and global structural changes. No augmentation is applied during testing.

4.6 Training configuration

All models are implemented in PyTorch and trained using the AdamW optimizer. The CNN branch is initialized with ImageNet-pretrained ResNet50 weights, and the transformer branch uses a pretrained Swin-T backbone obtained via the timm library. Fusion modules and classification heads are trained from scratch.

To account for optimization stochasticity, each experiment is repeated using three independent random seeds:

Results are reported as mean and standard deviation across seeds. Given the limited number of random seeds, results are interpreted primarily through consistency and effect size rather than formal hypothesis testing.

Learning rate scheduling is performed using ReduceLROnPlateau based on validation accuracy on the source dataset. The checkpoint with the highest validation accuracy is retained for final Cross-Dataset evaluation, with the complete training configuration detailed in Table 3.

ParameterValueOptimizerAdamWInitial learning rate1 × 10−4Weight decay1 × 10−2Batch size16Number of epochs50LR schedulerReduceLROnPlateauScheduler patience5 epochsDropout (classifier)0.5Image resolution224 × 224Random seeds43, 356, 433

Training hyperparameters used in all experiments.

4.7 Evaluation metrics

Model performance is evaluated using both classification accuracy metrics and probabilistic calibration measures, reflecting the dual requirements of discriminative power and confidence reliability in medical diagnosis.

4.7.1 Classification metrics

To assess discriminative performance, we employ Overall Accuracy, as well as macro-averaged Precision, Recall, and F1-score to account for potential class imbalances in the MRI datasets. These are defined based on True Positives (TPc), False Positives (FPc), and False Negatives (FNc) for each class c∈:

Overall accuracy is defined as:

where .

4.7.2 Expected Calibration Error (ECE)

ECE quantifies the mismatch between a model's predicted confidence and its empirical accuracy by discretizing predictions into M disjoint bins based on confidence (Guo et al., 2017). It is defined as:

where acc(Bm) and conf(Bm) represent the accuracy and average confidence within bin m, respectively. Under Cross-Dataset evaluation, distributional shifts frequently cause deep models to become systematically overconfident, leading to low accuracy-high confidence failure modes (Lakshminarayanan et al., 2017).

4.7.3 Brier score

The Brier score is a strictly proper scoring rule that evaluates the mean squared error between predicted probabilities and one-hot ground-truth labels (Brier, 1950):

Unlike ECE, the Brier score captures both calibration and “sharpness” without requiring binning, making it a robust measure for comparing models under dataset shift (Minderer et al., 2021).

4.7.3.1 Negative Log-Likelihood (NLL)

NLL measures the average log-probability assigned to the correct class:

NLL is highly sensitive to miscalibration, as it assigns an unbounded penalty to overconfident incorrect predictions (Ovadia et al., 2019). Improved NLL directly reflects enhanced robustness to the deceptive confidence often seen in standalone Transformer architectures.

Together, ECE, Brier score, and NLL provide complementary perspectives on probabilistic reliability. While ECE offers an interpretable, bin-based calibration diagnostic, the Brier score and NLL provide distribution-level and sample-level assessments, respectively. Reporting all three metrics enables a comprehensive evaluation of calibration behavior under distribution shift, where accuracy alone is insufficient to assess clinical safety.

In this work, these metrics are reported alongside standard classification measures to demonstrate that the proposed CNN-Transformer fusion not only improves classification performance but also enhances confidence reliability when trained on one dataset and tested on another. Although ECE is bin-dependent, reporting it alongside strictly proper scoring rules (Brier score and NLL) ensures that calibration improvements are not artifacts of binning choices.

4.8 Implementation details

All experiments are conducted on an NVIDIA A100 GPU with 40 GB memory, using an input resolution of 224 × 224 for all models. The fusion model incurs a modest computational overhead relative to standalone CNN and Swin baselines due to multi-scale cross-attention, while remaining tractable for clinical-scale inference.

All baselines (CNN-only and Swin-only) share the same preprocessing, training configuration, and evaluation protocol described abo

Comments (0)

No login
gif