Diagnostic performance of artificial intelligence models for pulmonary nodule classification: a multi-model evaluation

This study investigated three commercially available AI models for the detection and characterization of pulmonary nodules compared to histopathological ground truth. The main findings can be summarized as follows: models showed moderate sensitivities but low specificities. The diagnostic accuracies for two of the software models were limited, while one software failed to classify the majority of nodules as either benign or malignant, thus barring further analysis.

The increasing detection of pulmonary nodules in routine CT scans presents diagnostic challenges, particularly for nodules < 8 mm, which lack clear image features and have high biopsy failure rates. This often leads to repeated follow-up scans, increasing radiation exposure, financial burden and patient anxiety [13]. Furthermore, false-positive findings result in unnecessary invasive procedures and healthcare strain, while false negatives delay diagnosis, giving patients a false sense of security and potentially postponing treatment [14]. Accurate risk-stratified assessment is therefore crucial to improve early cancer detection while minimizing unnecessary interventions [15].

For lung cancer diagnosis, AI has been proven to enhance diagnostic accuracy and efficiency in early detection using CT scans of the thorax and X-rays and reducing false positives [16]. Additionally, it has been shown that the use of AI in lung cancer detection can significantly increase the sensitivity of radiologists and reduce the false-positive rate for pulmonary nodules > 5 mm [17]. Moreover, when AI models are applied as a second reader alongside radiologists, sensitivity can be further improved by identifying nodules missed by radiologists. Studies have shown that AI systems can detect pulmonary nodules not identified by radiologists, while radiologists may also detect nodules overlooked by AI, demonstrating the complementary potential of combining human expertise with AI systems [18]. In lung cancer screening, AI could be integrated to support the detection of pulmonary nodules and risk stratification [19]. Further, AI can characterize different types of pulmonary nodules (spiculated, solid, partially solid) with a comparable performance to radiologists [20]. For malignancy prediction, AI software models helped to achieve better diagnostic accuracy [21].

Multiple studies have demonstrated high accuracy in classifying lung cancer types using convolutional neural networks (CNNs) in lung cancer classification [22]. Quanyang et al compared two automated classification models for identifying the histopathological types of lung cancer on CT scans, with accuracies of 72.0–79.0% and AUC values of 0.79–0.82 [23]. Yoo et al evaluated an AI model specifically designed to detect malignant nodules in chest radiographs. The study reported higher sensitivity and specificity compared to radiologists, demonstrating the potential to improve lung cancer detection in X-ray imaging [24].

Further, the performance of AI in classifying pulmonary nodules in benign or malignant was compared to radiologists by Hu et al, who showed a higher sensitivity and specificity for the software models than the radiologists [25]. Hendrix et al investigated the detection and classification of pulmonary nodules. Their AI model achieved high sensitivity up to 96.6% for benign nodules, primary lung cancer and lung metastases. However, the results of their AI model were validated by radiologist assessments and the national cancer registry [26]. In contrast, our study directly compared the classification performance of three AI models against the histopathological gold standard, ensuring an objective reference for benign and malignant classification. This approach provided a more definitive evaluation of the diagnostic accuracy of AI models, as it eliminates the potential bias from radiologists.

The three software modules of our study demonstrated moderate sensitivity, a low specificity, a moderate false-negative rate, and low false-positive rates. This indicates that all three software modules of our study were relatively successful in identifying malignant nodules but had difficulties with the correct classification of benign pulmonary nodules, resulting in a high rate of false positives. A high false-positive rate could lead to overtreatment, including unnecessary invasive procedures.

The diagnostic performance of the software models changed when analyzing subgroups based on the size of pulmonary nodules and histopathological classification. Sensitivity decreased in nearly all subgroups, which indicates that the software models struggled with the accurate classification of pulmonary nodules when divided into smaller, more homogeneous subgroups. This finding suggests that smaller pulmonary nodules may be more challenging to classify for the software models.

In contrast, the specificity increased across all three subgroups, indicating improved performance in correctly identifying benign nodules. The wide range of values highlights inconsistencies in the software’s reliability.

The increase in FNR is clinically concerning, as this could result in malignant pulmonary nodules being overlooked and therefore delayed diagnosis and treatment with subsequent negative downstream effects on patients’ outcome [27].

Significant differences were identified regarding the influence of independent variables such as slice thickness and the type of sampling (biopsy vs. surgical resection). A likely explanation lies in the effect of these parameters on image quality. Increased slice thickness leads to greater volume averaging, resulting in blurred nodule margins that may be misinterpreted by the software models. Similarly, differences in the scan acquisition may induce variations in lesion characteristics that affect automated evaluation. Additionally, our analyses showed a significant impact of the type of CT scan (breath hold vs. free breathing) on the diagnostic performance of one software model. A possible explanation for this is that the software models were primarily trained on breath-held CT scans, thus resulting in a reduced performance on CT scans acquired in free breathing. Another contributing factor could be suboptimal breath holding, which may introduce motion artifacts such as slight blurring of anatomical structures, potentially leading to artificially increased density measurements and subsequent misclassifications.

The findings underscore the importance of carefully assessing AI tools when using them outside their original training or conception, which was recently corroborated in a study by Li et al [28]. The study demonstrated that the performance of AI models varies depending on the clinical application and, similar to our results, models had relatively low diagnostic accuracy (AUC up to a maximum of 0.6) in the assessment of biopsied pulmonary nodules. The authors therefore concluded that AI models for pulmonary nodule risk assessment are not yet suitable for use as a standalone diagnosis in clinical routine.

However, the absence of an influence of the other analyzed potential confounders in our study suggests a certain level of robustness. While the confounder analyses highlighted factors that may have influenced the performance of the software models, non-classified pulmonary nodules (categorized with an intermediate malignancy risk) represented the most severe limitation of the software models from a clinical viewpoint. Depending on the model, up to half of all pulmonary nodules were not classified as malignant or benign based on the vendors’ dichotomous interpretation guidelines. Of these intermediate risk nodules, the majority (44.4–88.5%) showed malignant histopathology. These results could potentially lead to a misdiagnosis of pulmonary nodules and thus negatively impact downstream management. Therefore, it is important to benchmark the often heavily training dataset-dependent and prevalence-adjusted models to real-world data in order to find optimal cutoff values for individualized risk prediction.

This study has limitations. It included a consecutive patient collective, which involves different scanner manufacturers, varying slice thicknesses and different acquisition techniques. Despite its reflection of real-world conditions, this may have introduced selection bias, thereby reducing the models’ ability to extract distinguishing features between benign and malignant nodules, which likely influenced the AUC values. Additionally, the unequal distribution of benign and malignant nodules in the dataset may have influenced the diagnostic accuracies.

Another limitation of this study is the retrospective design, with patient inclusion being based on available histopathology and resulting inability to report Free-Response Receiver Operating Characteristic (FROC). In cases with multiple pulmonary metastases, only one pulmonary nodule was biopsied or surgically resected for histopathological confirmation, while additional nodules were not systematically evaluated. As a result, the true dignity of these (non-sampled) pulmonary nodules remains unknown. Assuming all non-sampled nodules to be benign (as would be required to calculate FROC curves) could introduce substantial bias. Therefore, the sensitivity and specificity were calculated based on nodules with confirmed histology, which may, in turn, limit the comprehensive performance assessment of the models. Future studies with systematic histological assessment or follow-up of all detected nodules are needed to address this limitation.

One further limitation is that the software models were trained on incidental pulmonary nodules rather than being explicitly developed for high-risk populations such as the study sample. As a result, their performance may not optimally transfer to this specific patient group, potentially contributing to the observed results.

It is important to note that the sample size in this study was relatively small for the image classification task, which may have limited the models’ performance and generalizability. In addition, the sample size in the subgroups was limited, which may have further restricted the generalizability of the findings, as imbalanced distributions could have introduced bias, complicating the interpretation of the results. These limitations emphasize the need for larger datasets and more balanced subgroup distributions in future multicenter studies to enable more robust model training and support more reliable conclusions.

Additionally, in clinical practice, the characterization of pulmonary nodules is based on the evaluation of multiple variables such as patient history and other risk factors, especially for nodules with an intermediate malignancy risk. Clinical predictors are necessary to assess the risk of these nodules using the analyzed AI models. However, in this study, the software models assessed the pulmonary nodules without contextual information, relying solely on imaging-based classification. This may be considered another limitation, as it does not fully reflect real-world clinical decision-making. Future studies may assess the impact of multimodal models incorporating clinical predictors as additional variables during analysis.

View original article

EUROPEAN RADIOLOGY

Like

Share Bookmark

0 0 0 0 0 0 0

More from this channel

Diagnostic performance of artificial intelligence models for pulmonary nodule classification: a multi-model evaluation

Comments (0)