Multi-center evaluation of radiomics and deep learning to stratify malignancy risk of IPMNs

In this large multi-center study of 359 subjects, demonstrated the feasibility of radiomics and DL approaches for malignancy risk stratification of IPMN lesions using cyst region–level analysis on T2-weighted MRI. Visual scoring raters had minimal to moderate agreement with weighted Kappa scores of 0.33–0.67 [41]. The visual scoring accuracy for the majority cases and for Rater 1 was similar to the accuracies of the radiomics-only algorithms on the testing set and was higher than the accuracies of the fusion algorithms on the testing set, Rater 2, and Rater 3. Our DenseNet121 based DL-only model achieved the highest performance with an AUC of 73.3% and accuracy of 68.0%), followed closely by our radiomics-DL fusion algorithm using 2D radiomic features with an AUC of 69.2% and accuracy of 61.6% in testing and an AUC of 74.3% and accuracy 71.0% in cross-validation. This performance effectively balanced parameter efficiency and predictive power. Lightweight models, EfficientNet-B0 and ShuffleNet-V2, exhibited lower AUC values, underscoring the trade-off between model complexity and predictive accuracy across diverse architectures. The fusion of DL and radiomics algorithm utilizing 2D radiomic features attained a weighted average AUC of 69.2% and accuracy of 61.6% in testing, and a weighted average AUC of 74.3% and Acc of 71.0% on cross validation. Radiomics-only analyses employing 3D features followed with respective AUC and accuracy of 66.5% and 66.1% on testing. Though, a comparable performance was observed between algorithms utilizing 2D versus 3D radiomics features on the independent test set, a significant discrepancy was observed between cross-validation and independent external testing from 3D radiomics results, suggesting potential overfitting and highlighting the importance of independent-center performance as the clinical meaningful benchmark. This gap likely reflects the relatively small sample size, potential information leakage, and distribution shifts in patient population and imaging protocols across centers. Moreover, the 3D radiomics fusion model contains numerous voxel-level features that are sensitive to protocols and devices’ differences, which may introduce center-specific rather than disease-invariant patterns. These factors together contribute to the underperforming on the independent dataset and emphasize the challenges of model generalization in real-world clinical. It is important to note that the lower performance of T1 could be attributed to lower quality of the sequences from the AHN dataset, which made up a larger proportion of the testing set in this trial. Images from AHN have lower resolution and more artefact which could explain the distribution outside of the primary clusters on UMAP analysis. T1 nevertheless performed well on testing when using 3D radiomic features in the radiomics-only and fusion algorithms with AUCs of 59.2% (SD: 2.5) and 59.3% (95% CI: 59.1, 63.2), respectively. Overall, these advanced methods demonstrated performance that matched or exceeded a limited-input expert radiologist assessment, highlighting their potential to augment clinical decision-making in IPMN management.

In our earlier work (Yao et al. 2023), IPMN malignancy risk was classified using an automatic whole pancreas segmentation algorithm on 246 T1W and T2W MRI scans from five centers [23]. In that work, three algorithms were developed with incorporated clinical features (age, gender, body mass index, diabetes mellitus, and chronic pancreatitis) to accomplish this task: a radiomics-only, DL-only, and DL-radiomics fusion using four CNNs and Vision Transformer (ViT). Algorithms stratified cases as healthy (n = 70), low-grade risk (n = 85), and high-grade risk (n = 91). Our results are not entirely comparable with Yao et al. 2023 [23] because we switched into two class-classification from three-class classification by focusing only in cystic cases. Another key difference compared to our earlier study is our earlier study did not include cyst-type, and all the experiments conducted on a much smaller cohort.

Cui et al. 2021 conducted a study to develop a nomogram to predict the pathological grade of BD-IPMN [42]. The nomogram incorporated clinical features (sex, symptoms, age, CA19-9, and CEA) and radiomic features derived from manually segmented cysts. Their dataset included T2W, T1W, and contrast enhanced T1W scans pertaining to 202 patients collected from three centers. Cases were classified by dysplasia grades as low or high. They found that 24.8% of their BD-IPMN cases had high grade dysplasia. On testing using radiomic-only features, they had specificity, sensitivity, and AUC of 81.6%, 70.0%, and 81.1% respectively on validation. Once radiomic and clinical features were incorporated, their nomogram achieved specificity, sensitivity, and AUC of 79.0%, 90.0%, and 88.4% in validation. While promising, nomograms may perform poorly when applied to populations different from their development cohort, limiting their generalizability across diverse clinical settings [43]. When comparing our studies, their utilization of BD-IPMN only could lead to selection bias because BD-IPMN has a lower risk of malignancy. Additionally, their ratio of high-grade dysplasia cases is lower than ours and may not be representative of a real-world cohort of IPMN which we tried to approximate. Furthermore, authors included an additional scan of contrast enhanced T1W sequences in their analysis while we confined ourselves into conventional T1W and T2W. In comparing our results, their radiomics-only analysis outperformed oursin AUC and specificity. This could be highly likely because their analysis included several clinical features which are known to be predictive of higher risk IPMN [2]. We are aware of the significance of clinical features in predicting IPMN malignancy risk and plan to incorporate them into our future analyses. Despite this, our radiomics-only analysis had similar sensitivity to theirs. This is in spite of our inclusion of multiple centers but could have been due to our larger data set. Overall, the similarities in performance between our two studies suggests that inclusion of clinical markers could augment our algorithm and make it more robust.

To our knowledge, most studies that have used radiomics to classify IPMN are largely CT-based [44,45,46,47]. MRI is the preferred imaging method for IPMN classification and monitoring compared to CT because it has no radiation exposure, has higher contrast resolution, and it is better at assessing tissue and cysts [3, 48]. Furthermore, ours is the most comprehensive study on IPMN malignancy risk stratification that utilizes cyst masks in MRI [20, 42, 49]. Cheng et al. 2022 found superior performance of an MRI radiomics algorithm when compared to CT in predicting IPMN malignant potential [20]. Among studies that have utilized MRI, two analyzed only BD-IPMN and the remainder did not specify IPMN type [20, 23, 42, 49]. We found that 38.5% of our Mixed/MD-IPMN and 78.2% of BD-IPMN lesions were Low-Risk. This suggests that many pancreatic resections are unnecessarily performed because a lesion is a MD-IPMN, without any further analysis to stratify lesions that may actually be at risk of malignancy. MD-IPMN is frequently surgically resected in patients that do not have contraindications to surgery, as it has a higher risk of malignant transformation than BD-IPMN [2, 3]. Studies that only include BD-IPMN are excluding an important and under-investigated subtype. We included MD- and mixed-IPMN in our advanced algorithm training to address this gap.

Our study has several limitations that should be considered. Firstly, its retrospective design inherently limits causal inference and introduces potential biases in the historical data collection. The data collected over two decades contributed to variations, including differences in scan quality and uncertainties regarding the accurate grading of dysplasia. The experience levels of operators and pathologists varied across cases, potentially affecting the reliability of dysplasia assessments. Additionally, there was no standardized protocol for selecting cases for EUS-FNA, which may introduce bias since some patients might have undergone EUS for reasons unrelated to the malignancy risk of cystic lesions. Consequently, cytology might have been obtained from cysts that were not classified as high-risk based on imaging. Furthermore, the appearance of cysts may have changed in MRI images taken after the EUS procedure, potentially complicating image analysis. Despite the risk of cyst appearance changes following EUS, we have found our results to be reliable using segmentations of visible cystic lesions. Moreover, our cyst segmentations exhibited a low interobserver variability, DSC of 75%, compared to current standards in the field. However, IPMN can exhibit complex morphology and severe IPMN can cause significant distortion of the normal pancreatic anatomy. Precise and consistent segmentation of these severe cases is challenging. Acknowledging these concerns, we thoroughly reviewed the dataset to ensure its suitability for the study.

An additional limitation results from our exclusive inclusion of sampled cysts. This intentional selection introduced some selection bias. However, focusing on patients at higher risk for malignancy was crucial. EUS-FNA sensitivity is limited and cannot definitively rule out HGD or malignancy, as EUS diagnosed some of our cases. Nevertheless, our observed rates of malignancy risk for BD-IPMNs are similar to higher than those reported in the broader literature, which frequently includes milder cases [2, 3]. Our dataset was collected from seven institutions using various brands of MRI scanners and field strengths (1.5T and 3 T) with differing image acquisition protocols. This variability poses analytical challenges and ultimately affects the algorithm’s performance. This diversity and heterogeneity also strengthens prediction robustness, ensures stability across different environments, and enhances its applicability in real-world clinical settings where imaging protocols frequently vary. Our image analysis was limited to T2W MRI sequences due to data availability constraints. We plan to include and analyze additional MRI sequences in our future studies.

Lastly, radiologist raters utilized only T1W and T2W sequences for expert risk assessment. These sequences alone are insufficient for a thorough visual evaluation and do not fully reflect real-life assessments. Moreover, the radiologist raters lacked access to previous scans, clinical information, or other critical MRI sequences—such as diffusion sequences—that are valuable for accurately estimating risk. These factors could affect the accuracy of visual scoring compared to standard comprehensive imaging analyses. Our study’s limitations ultimately point to several promising directions for future research. Prospective validation studies with standardized imaging protocols would strengthen evidence for clinical translation. Integration of clinical parameters and additional MRI sequences could further improve model performance. Development of ensemble approaches that combine imaging features with other biomarkers (cyst fluid analysis, circulating markers) might provide more comprehensive risk assessment. Finally, extending these methods to predict long-term outcomes rather than cross-sectional histopathology would better align with the clinical goal of identifying lesions likely to progress to malignancy.

In conclusion, our multi-center, pancreatic cyst-focused study demonstrates the feasibility and potential clinical utility of radiomics and DL for IPMN risk stratification using routinely acquired T2W MRI scans. While predictive performance requires further enhancement, our advanced machine learning models achieved performance comparable to, and in some metrics better than, limited-input expert radiologist evaluations in this challenging cohort, offering greater objectivity and reproducibility compared to visual assessment. Given that current international consensus guidelines lack optimal specificity for identifying low-risk IPMNs without invasive procedures, computational tools like ours represent a valuable step toward more precise patient selection for intervention versus surveillance. Hence, our findings have immediate clinical relevance. The fusion model’s comparable performance to expert radiologists, suggests potential for integration into clinical workflows as a decision support tool. By providing objective risk stratification of IPMNs, our approach could reduce the high rates of unnecessary surgical resections of low-risk lesions, particularly for MD-IPMNs which are often resected based solely on morphology. Implementation could take the form of a software plugin for radiology workstations, offering real-time risk assessment during routine reads without disrupting workflow. Cost-effectiveness analyses and prospective validation would be logical next steps toward clinical translation.

View original article

ABDOMINAL RADIOLOGY

Like

Share Bookmark

0 0 0 0 0 0 0

More from this channel

Multi-center evaluation of radiomics and deep learning to stratify malignancy risk of IPMNs

Comments (0)