In Silico Digital Breast Tomosynthesis Dataset for the Comparative Analysis of Deep Learning Models in Tumor Segmentation

The first part of this section describes the methodology for the in silico DBT dataset creation, mask delineation, reference standard definition, deep learning architectures selection, training/validation and test procedure, and performance metrics results. Additionally, we describe the statistical analysis to verify statistical significance. Our primary objective was to experimentally evaluate the outcomes of different well-established deep learning architectures when trained either from scratch or pretrained/fine-tuned on a natural images dataset and fed with a full in silico DBT dataset. Thus, we investigated the viability of this method as an in silico experimental proof-of-concept. In the second part, we aimed to assess the performance of a baseline UNet architecture trained from scratch on a hybrid dataset, comprising both in silico and real-world DBT data. Thus, we evaluated the utility of in silico DBT as a complementary resource for training and benchmarking DL models, particularly in data-limited environments. For this purpose, we provide a step-by-step, comprehensive description of our method (Fig. 1), following the recommendations suggested in reference [42].

Fig. 1figure 1

Workflow for our work: The pipeline consists of two parts. (a) In silico DBT volumes were used to extract VOIs, which were then decomposed into 2D ROIs. (b) A dataset of 230 annotated ROIs was curated, each ROI comprising tumor or healthy tissue. (c) These were resized and used as input for model training. (d) Four deep learning architectures (U-Net, FCN, DeepLabv3, DeepLabv3 +) were trained from scratch or fine-tuned with ResNet 50/101 as backbones. (e) The U-Net model was later retrained using a hybrid dataset (n = 250, in silico + real-world DBT ROIs) and tested on unseen independent real data. (f) Performance was evaluated using standard metrics (F1-score, IoU, precision, recall), and (g) statistical comparisons were conducted using the Wilcoxon Signed-Rank Test with Bonferroni correction

In Silico Digital Breast Tomosynthesis Dataset

The breast models, tumors, DBT virtual images, volumes of interest (VOIs), and regions of interest (ROIs) were created by leveraging the VICTRE (Virtual Clinical Trial for Regulatory Evaluation) software [31, 43]. The advantages of the software are as follows:

1.

The software is open-source and FDA-cleared.

2.

The software comprises several in silico tools enabling running and simulating a full pipeline starting from breast phantom creation, breast compression, lesion creation and insertion, and image generation. It also allows for mimicking real-world DBT device acquisition parameters.

3.

Prior studies on similar data derived from the software have demonstrated that the simulated images are sufficiently realistic when compared to real-world datasets, as the images’ properties such as gray levels, contrast-to-noise ratio (CNR), first five statistical moments (mean, variance, skewness, kurtosis, and hyperskewness) and radiomics textures have been assessed, thus offering a very close domain-specific distribution to real-world DBTs [33, 38, 44, 45]. More details on the software and documentation can be found in references [31, 34, 46].

Breast Phantoms and In Silico DBT

We simulated breast phantoms with diverse glandularity and categorized each phantom based on its fat fraction content. This classification was largely inspired by the four breast parenchymal density categories outlined in the American College of Radiology (ACR) BI-RADS lexicon (Fifth Edition) [47]: fatty, scattered, heterogeneously dense, and dense. While the current BI-RADS lexicon defines these categories as subjective estimates of breast density associated with variations in mammography sensitivity [48], our approach utilized a quantitative descriptor based on the fat fraction simulated for each mathematical phantom. This method aligns closely with the quantitative definitions used in the earlier ACR BI-RADS Fourth Edition lexicon [49]. We simulated a total of 30 breast phantoms, evenly distributed across the four density categories. The breast phantoms have a 0.05 mm3 voxel size. The simulated fat fraction ranged from 0.75 to 0.95 for the fatty breast category, 0.55 to 0.70 for scattered, 0.30 to 0.50 for heterogeneously dense, and 0.05 to 0.25 for the dense breast category.

In the next step, we simulated the breast compression thickness between 40 and 65 mm, reflecting the range typically observed in clinical practice [50].

The final step consisted of obtaining the DBT reconstructions derived from the breast phantoms. The VICTRE system employs a 3D filtered back projection (FBP) reconstruction algorithm to generate DBT volumes. The FBP code takes parameters such as system geometry, breast model dimensions, and voxel size. Thus, the reconstructed DBT cases have different dimensions depending on breast density and compression thickness. As an example, a DBT case might have dimensions in the order of magnitude of 1300 pixels (width), 450 pixels (height), and depths ranging from 38 to 68 slices, depending on the simulated breast compression. A more detailed DBT case example is provided in Appendix Fig. 8.

As a final step, 30 DBT cases were derived from the 30 simulated breast phantoms. Figure 2 shows the central slices of four in silico DBT cases derived from different phantoms of different tissue density categories.

Fig. 2figure 2

In silico DBT cases reconstructed with the full pipeline of the VICTRE software. A to D Show examples of the central slice of four different DBT cases across different breast tissue ACR type categories: A dense breast tissue, B heterogeneously dense breast tissue, C scattered breast tissue, and D fatty breast tissue. All DBTs shown contain an embedded tumor

Breast Tumors Simulation

We simulated 30 spiculated and irregular-shaped tumors with several densities, mimicking the characteristics described in the literature to be more likely indicators of malignancy, and a density ratio of 1.1, 1.3, and 1.5 relative to that of healthy surrounding glandular tissue. High-density ratio tumors have been described to be more likely malignant [51]. The mean tumor radius ranged from 0.5 mm to 4 mm. We then embedded the tumors into the breast phantoms, resulting in the DBTs illustrated in Fig. 2.

Volumes of Interest (VOIs) and Regions of Interest (ROIs) Extraction

Due to the high dimensionality of DBT volumes described in previous sections and to ensure anatomical diversity and maximize learning generalization, each 3D volume of interest (VOI) extracted from a simulated DBT volume (size, 109 × 109 × 10) was decomposed into ten 2D regions of interest (ROIs). These ROIs span a spectrum of anatomical contexts, including central tumor slices, peripheral margins, and tumor-free regions composed entirely of healthy tissue. ROIs containing either partial tumor structures or no tumor at all were retained to preserve contextual variability. As illustrated in Fig. 3 (columns 1, 9, and 10), several ROIs represent healthy tissue and are associated with empty binary masks.

Fig. 3figure 3

Example of a single volume of interest (VOI) extracted from a DBT volume, containing a 2 mm mean radius tumor. A The 109 × 109 × 10 VOI is decomposed into ten 2D regions of interest (ROIs) with a size of 109 × 109 pixels, each treated as an independent input for training. B Corresponding binary segmentation masks. Note that each ROI captures different anatomical contexts—central, peripheral, or healthy tissue, the latter represented by an empty mask (columns 1, 9, and 10). This ROI-based method is intended to improve model generalization through diverse learning examples. Column 6 corresponds to the central tumor slice

Although our training strategy is based on 2D ROIs, these were extracted from full-volume DBT simulations to preserve anatomical realism and intra-volume heterogeneity. Simulating the entire breast volume—rather than isolated patches—enables more realistic modeling of background parenchymal patterns, tumor-to-tissue contrast, and acquisition-related variations such as compression and slice thickness. This approach aligns with VICTRE and other virtual clinical trial frameworks designed to replicate real-world imaging conditions [31, 43]. In the context of data-limited or low-resource environments, the use of 2D ROI-based training from high-fidelity 3D simulations provides a computationally efficient compromise, enabling robust generalization while maintaining close resemblance to clinical data.

Each ROI was treated as an independent training instance to expose the model to both lesion-positive and lesion-negative patterns. This strategy aligns with prior research in breast imaging, showing that parenchymal characteristics from different areas of the breast may contribute differently towards the risk for developing breast cancer [52]. Consequently, patch-based and ROI-wise training approaches leverage local tissue heterogeneity. Moreover, by incorporating a range of radiological contexts, this sampling design supports the development of more robust segmentation models that generalize beyond lesion-centered slices.

Mask Delineation and Reference Standard Definition

The ROIs were manually segmented by an expert operator using the Image J software [53], where 0 is defined as the background pixels belonging to the healthy surrounding breast tissue, and the foreground is defined as 1 for the pixels belonging to the tumor. Thus, each binary mask was defined as the “Reference Standard,” commonly known as the Ground Truth. Note that, unlike real-world annotations, as the software enables the creation of the ground truth tumor, the operator knows the exact tumor’s location and shape beforehand, thus facilitating the annotation process.

Digital Breast Tomosynthesis Dataset ROIs Preprocessing

Each 2D ROI derived from the VOIs was used as an individual input. Before the training process, both ROIs and their respective binary masks were resized to 128 × 128 pixels, converted into 8-bit grayscale, and saved in TIFF format. No additional preprocessing steps were performed on the ROIs dataset. Our DBT ROIs dataset can be found under the name “BreasTomo-Synth” and has been made publicly available [54]. 

Deep Learning Models

In this study, 13 models were created based on four well-established architectures for semantic segmentation: U-Net [40], Fully Convolutional Network (FCN) [56], DeepLabV3 [57], and DeepLabV3 + [58]. Firstly, U-Net is one of the most widely adopted segmentation architectures as it was designed for biomedical image segmentation with limited training data. Its U-shaped architecture with an encoder reduces spatial information while enhancing features, and a decoder that restores resolution through up-convolutions and feature concatenation.

In this study, U-Net is trained from scratch using its standard architecture. We defined this model as the baseline for further performance comparison analysis.

FCN, in turn, adapts a pretrained CNN for image classification into an encoder by replacing fully connected layers with convolutional layers, reusing their weights and biases. A decoder with a transposed convolution layer is added to upsample feature maps.

Finally, DeepLabV3 utilizes dilated convolutions to expand the receptive field and capture richer contextual information. Its Atrous Spatial Pyramid Pooling (ASPP) module captures multi-scale features. DeepLabV3 +, in turn, enhances the architecture by adding a decoder that upsamples feature maps and refines segmentation accuracy by combining them with low-level features.

We adopted ResNet-50 and ResNet-101 as backbones [59] since those types of convolutional neural network architectures are designed to mitigate the degradation problem in deep learning through residual connections, enabling efficient training of very deep models. With 50 and 101 layers respectively, these models are widely adopted for image recognition tasks and often serve as backbones in more complex network implementations. In the present study, we used the backbones pretrained on the COCO dataset [18], a large-scale benchmark for object detection, segmentation, and image captioning, featuring over 330,000 images and 1.5 million object instances across 80 categories. It emphasizes diverse, real-world scenes with rich contextual annotations.

In Silico Models Comparative Analysis

We defined the U-Net as the baseline model. This architecture was compared with three other well-established segmentation architectures’ performances: FCN, DeepLabv3, and DeepLabv3 +. Two approaches were explored: (1) training from scratch using the in silico-generated dataset and (2) fine-tuning pretrained models with ResNet50 and ResNet101 as backbones. In total, 13 deep learning models were comparatively analyzed.

In Silico Training Strategy and Code Implementation

An efficient region-of-interest-based (ROI-based) training strategy was implemented to optimize computational resources. By focusing on distinct ROI segments representing both surrounding healthy breast tissue and tumor regions, each treated as a unique sample, the method provides diverse examples for training. This approach enhances the deep learning network’s ability to generalize across various breast regions while minimizing computational costs.

Since the U-Net model takes input sizes that need to be divisible by 32, the 109 × 109 pixel 2D ROIs were resized to 128 × 128 pixels. No additional intensity normalization was applied. The UNet architecture training was only conducted from scratch using a binary cross-entropy (BCE) loss function, which is well-suited for binary segmentation tasks and helps optimize performance for pixel-wise classification.

The fine-tuning process for FCN, Deeplabv3, and Deeplabv3 + (with ResNet 50 or 101) was applied only to the final convolutional layers, while early layers remained frozen to retain feature extraction capabilities from the COCO dataset.

The Adam optimizer with a learning rate of 0.001 and a batch size of 20 was used for optimization. Additionally, stochastic weight averaging (SWA) was employed to average weights across multiple epochs, enhancing generalization and improving the model’s robustness. Batch normalization (BN) was employed to stabilize training and improve convergence by ensuring consistent feature distributions across mini-batches.

In Silico Dataset Splitting

The dataset was divided using the stratified shuffle split method [60] to ensure that label distributions were preserved across all subsets. Initially, 80% of the data was allocated to training/validation, and 20% was reserved for testing. This stratification approach maintained consistent label proportions in all subsets, mitigating potential class imbalance.

Five re-shuffling splits were performed to support robust cross-validation, with a fixed random seed of 42, ensuring reproducibility.

During inference, the model produced pixel-wise probability maps via a sigmoid activation function. A threshold of 0.5 was applied to generate binary segmentation masks, classifying pixels as tumor or non-tumor. This threshold was chosen based on its ability to generate binary predictions that align well with the ground truth masks, thus optimizing performance metrics such as precision and recall. The code was implemented using PyTorch Lightning and ran using Google Colaboratory, with NVIDIA GPU T4 [61]. Table 1 summarizes the hyperparameter settings used in the code implementation for each segmentation model and both training approaches.

Table 1 Segmentation architectures and hyperparameters settingsOptimization and Performance Metrics

As mentioned in the previous section, we utilized the Binary Cross Entropy (BCE) loss during the in silico models’ training. This optimization metric measures the discrepancy between the predicted and actual probability distributions for classification tasks. It is particularly suitable for pixel-level classification in segmentation [62].

The evaluation of semantic segmentation presents inherent complexity, as it requires assessing both classification accuracy and localization precision. The objective is to quantify the degree of similarity between the predicted segmentation output and the annotated reference standard (ground truth). Thus, the performance metrics selected for our study were F1-score, precision, recall, and intersection over union (IoU) [63]. These metrics collectively provide a robust framework for evaluating and optimizing the performance of segmentation models, balancing accuracy and precision in diverse scenarios.

Statistical Analysis

A pairwise statistical analysis was conducted to compare the performance of U-Net from scratch, which was used as the baseline model, with the other 12 deep learning models’ variants, either trained from scratch or pretrained and fine-tuned using ResNet 50 or 101 as backbones. Given that we assumed the non-parametric nature of the data, the Wilcoxon Signed-Rank Test was applied to the performance metrics (F1-score, Intersection over Union (IoU), Precision, and Recall). The Bonferroni correction was applied to account for multiple comparisons, setting the adjusted significance threshold at α = 0.0042 (calculated as α = 0.05/12, where 12 is the number of pairwise comparisons).

Hybrid Dataset and Training Strategy

To evaluate generalization across domains, we designed a hybrid training experiment that combined in silico-generated ROIs with real-world DBT ROIs. We leveraged 230 simulated in silico ROIs from our previous experiment and then extracted a subset of real-world tumor ROIs from the publicly available Breast Cancer Screening-Digital Breast Tomosynthesis (BCS-DBT) (version 5) dataset by Buda et al. [13], hosted at The Cancer Imaging Archive [64]. From the 38 malignant lesions belonging to the BCS-DBT dataset volumes we extracted the central ROIs from real tumor cases and combined the data with the already existing in silico dataset. We excluded cases containing either architectural distortions (AD) or tumoral masses with metallic clip markers in the surrounding tissue. Thus, a total of 20 central slice tumor ROIs were integrated into the new dataset of 250 hybrid DBT ROIs. Images preprocessing, resizing, and masks delineation procedures were the same as described in the previous section for in silico ROIs.

The 250 hybrid DBT ROIs dataset was used to retrain the baseline U-Net model, and its performance was assessed on a separate holdout set to determine the effectiveness of combining both data sources for tumor segmentation tasks. This method consists of dividing the 250 hybrid ROIs into 200 for training (80%), 25 for validation, and 25 for testing. Training is performed using 200 samples, with validation conducted using another 25 validation samples between epochs. Once training is completed, performance metrics are evaluated using the 25 hybrid test samples.

Finally, after the hybrid model was trained and evaluated with the hybrid approach, predictions were made using a new, independent subset composed of 20 real-world data ROI samples derived from the BSC-DBT dataset.

The Dice loss function has been implemented in this hybrid dataset experiment, as it is a widely used metric and loss function for biomedical image segmentation due to its robustness to class imbalance [65]. All the remaining hyperparameter settings are similar to the in silico experiment.

Comments (0)

No login
gif