Transformer-Based RT-DETR Framework for Accurate Chest X-Ray Disease Detection

Chest X-ray disease detection plays a crucial role in early diagnosis and treatment, particularly for conditions like lung cancer and pneumonia [1], [2], [4], [6]. Detecting chest X-ray diseases at an early stage significantly enhances survival rates. Research indicates that allowing symptomatic patients to self-refer for chest X-rays could accelerate the diagnostic process and facilitate timely treatment. The existing works categorize into the following themes: Classical Approaches, CNN-Based Techniques, Transformer-Based Techniques, and Hybrid & Advanced Deep Learning Models. This structure provides a coherent understanding of the progression from traditional to modern AI-based methods in chest X-ray disease detection.

Early chest X-ray (CXR) disease detection relied heavily on handcrafted features and manual preprocessing techniques. These traditional image analysis methods, although foundational, often struggled to deliver high accuracy and computational efficiency. Divo et al. [4] emphasized that the increasing elderly population significantly raises the demand for advanced diagnostic tools, particularly in lung disease detection. Historically, image analysis methods depended heavily on handcrafted features and manual preprocessing, which posed challenges in achieving high accuracy and computational efficiency. Çallı et al. [5] noted that these traditional approaches often struggled to deliver precise results, leading researchers to explore more advanced techniques. The emergence of artificial intelligence (AI) transformed medical image analysis by allowing models to autonomously learn meaningful features from large datasets, eliminating the need for manual feature extraction.

The advent of deep learning, particularly Convolutional Neural Networks (CNNs), revolutionized medical imaging by enabling automatic feature extraction and significantly improving diagnostic accuracy. Avramescu et al. [7] highlighted that the incorporation of bounding box supervision techniques has significantly enhanced the classification and localization of lung pathologies in CXR images. As the demand for multi-disease classification grew, researchers such as Yimer et al. [8] developed deep learning models capable of handling multiple diseases, improving both diagnostic accuracy and scalability. This progress has led to increased exploration of human-AI collaboration in CXR analysis, fostering advancements in automated and assistive diagnostic systems. Artificial intelligence (AI) has significantly advanced chest X-ray (CXR) analysis, with various CNN based models, and attention-based networks demonstrating notable efficacy. CNNs have been foundational in medical image processing because to its ability to automatically learn spatial features. These models have been extensively applied in CXR analysis, improving diagnostic accuracy and efficiency. Studies have shown that CNN-based architectures, including ResNet and DenseNet, achieve high performance in detecting lung diseases such as pneumonia and tuberculosis [9].

Vision Transformers (ViTs) have become a compelling alternative to CNNs, utilizing self-attention mechanisms to model long-range dependencies in images effectively. Jain et al. [10] compared CNNs, ResNet, and ViTs for multi-classification of chest diseases and found that pre-trained ViT models outperformed CNNs and ResNet architectures in multilabel classification tasks. Their findings suggest that ViTs hold strong potential for accurate and automated diagnosis of lung conditions from CXR images. Attention-based models enhance feature representation by focusing on the most relevant regions of the input data, improving interpretability and performance. Çallı et al. [11] conducted a systematic review, concluding that attention-based models generally outperform traditional CNNs in medical image analysis. Their research highlights that attention mechanisms, such as Transformer-based networks and ensemble methods, provide better transparency and decision-making in clinical applications. Misclassification remains a critical issue, as AI models may incorrectly diagnose diseases due to variations in data quality, imbalanced datasets, or biases present in training data. Furthermore, transparency and explainability in deep learning models are still areas requiring significant research. Current studies indicate heat maps and attention mechanisms, aiding radiologists in understanding AI-based predictions [3].

The literature also includes several hybrid and enhanced models that push the boundaries of CXR diagnosis. The landscape of DL for chest X-ray (CXR) disease detection is rapidly evolving, with numerous studies introducing novel approaches to enhance diagnostic accuracy, efficiency, and explainability. Guo et al. [12] introduced a deep learning-based approach for tuberculosis diagnosis, outperforming traditional techniques by efficiently identifying affected regions in CXR images. Fan et al. [13] leveraged the YOLOv5 algorithm for abnormality detection, demonstrating its real-time efficiency and computational advantages in clinical settings. Similarly, Bharati et al. [14] developed a hybrid model that integrates convolutional neural networks (CNNs) with machine learning classifiers to enhance diagnostic performance.

Recent efforts have also focused on capturing complex anatomical relationships within CXR images. Lian et al. [15] introduced a network, incorporating relational information to improve thoracic disease detection and segmentation. Lin et al. [16] designed a scalable attention residual CNN that refined lesion detection by extracting fine-grained features, while Yuan et al. [17] proposed a multi-scale lesion feature fusion model, significantly improving multi-disease classification by integrating features at different scales. In addition, several studies have sought to enhance abnormality localization and detection accuracy. Nguyen et al. [18] combined a bounding coordinate-based lung abstraction approach with object detection algorithms to improve lesion localization. Sheng et al. [19] introduced BarlowTwins-CXR, a cross-domain self-supervised learning model that addresses dataset variability challenges in abnormality localization. Xu and Duan [20] developed DualAttNet, a fusion model that integrates fine-grained and entire image-level disease attention mechanisms for more precise multi-lesion detection.

Further improving diagnostic efficiency, Lin et al. [21] proposed CXR-RefineDet, optimized for multiple lesion detection with computational efficiency in mind. Ngo et al. [22] also explored deep learning's potential in abnormality detection, demonstrating significant gains in diagnostic accuracy. Beyond traditional CNN architectures, researchers have introduced advanced models to capture complex patterns in medical imaging. Hashmi et al. [23] developed a compound-scaled model for pneumonia detection, surpassing conventional CNNs in sensitivity, specificity, and overall accuracy. Hao et al. [24] addressed the increasing demand for multi-disease detection by introducing the YOLO-CXR network, incorporating RefConv layers for improved feature extraction for enhanced small-lesion detection. Lastly, Zhang et al. [25] presented NSEC-YOLO, a novel approach aimed at improving both the accuracy and efficiency of CXR detection. This model incorporates an advanced noise filtering mechanism to minimize background distractions, a comprehensive feature aggregation head for improved classification and localization, and a refined AccurEIOU-Loss function to optimize training for precise detection results. These advancements highlight the continuous progress in deep learning for CXR analysis, demonstrating improvements in lesion detection, classification accuracy, computational efficiency, and clinical applicability.

Despite numerous advancements in deep learning-based chest X-ray disease detection, existing models continue to face critical limitations. These include suboptimal multi-scale feature representation, limited contextual awareness, and difficulty in accurately localizing small or overlapping lesions particularly in low-contrast images. Additionally, many frameworks lack architectural efficiency and interpretability, which are essential for clinical deployment. Current convolutional and attention-based methods often fall short in effectively capturing hierarchical spatial and semantic relationships. To bridge these gaps, the proposed model introduces a hybrid architecture that integrates HGBlock and HGStem modules in the backbone to enhance multi-scale spatial feature encoding. The AIFI and REpc3 modules are employed to facilitate rich contextual feature aggregation. Furthermore, the model leverages the RT-DETR decoder to enable efficient query-based object detection, improving both classification accuracy and localization precision.

The contributions of the proposed model can be written as:1.

Integrated HGBlock and HGStem blocks in the backbone to enhance multi-scale spatial feature representation, directly addressing challenges related to low contrast, small lesions, and overlapping anatomical structures commonly encountered in chest X-ray images.

2.

Employed the AIFI and REpc3 modules to refine semantic feature aggregation and contextual understanding, mitigating issues of misclassification due to limited receptive field and weak contextual encoding in standard CNNs.

3.

Incorporated the RT-DETR decoder to enable efficient query-based object detection, which overcomes the limitations of anchor-based approaches and improves the localization of subtle or overlapping abnormalities.

4.

Validated the model on the VinBigData CXR dataset, achieving 55.7% precision, 43% recall, and 45.3% mAP, demonstrating the method's superior performance and generalizability in a real-world, multi-label disease detection scenario.

5.

Assessed model robustness and reliability through Paired T-Test, Kruskal-Wallis Test, and Wilcoxon Test, ensuring statistical significance and confidence in the performance improvements over existing methods.

The total paper is organized as follows: Section 1 discusses the introduction and related works on chest X-ray detection using traditional and deep learning methods. Section 2 describes the proposed methodology, including the architecture and training process. Section 3 details the experimental setup and presents the results. Additionally, an in-depth discussion and comparison with baseline models is also presented. Finally, Section 4 concludes the paper and outlines future research directions.

Comments (0)

No login
gif