Gliomas are the most common primary intracranial tumors. Patients often experience significant changes in central nervous function, including headaches, vomiting, seizures, cognitive impairment, papilledema, and other typical clinical symptoms[[1], [2]]. Clinically, the diagnosis of glioma primarily relies on MRI examinations. MRI-based medical imaging technology is one of the most accurate and reliable methods for diagnosing and analyzing gliomas. This is because MRI can obtain images in multiple sequences, each providing different tissue contrasts and biological information that aid in comprehensively assessing the nature and characteristics of the tumor[3]. Various MRI images accurately evaluate the lesion’s size, location, and recurrence degree.
In MR images of patients with glioma, T1WI is sensitive to the fat and protein content of brain tissue. Therefore, tumors often present signals different from the surrounding normal brain tissue, providing information about the tumor’s location, morphology, size, and tissue characteristics, which is crucial for detecting and locating gliomas. Enhanced T1WI images obtained after administering contrast agents can show the tumor’s blood supply and vascular structure, helping assess the tumor’s aggressiveness, boundaries, and relationship with surrounding structures. DWI can determine the degree of free diffusion of water molecules in tissues and is highly sensitive to evaluating tumor cell density and activity, aiding in differentiating tumors from cystic changes, abscesses, and other lesions. Yet, actual clinical diagnosis typically demands a holistic assessment of tissue morphology, structural details, functional dynamics, and metabolic changes-information that cannot be adequately provided by a single imaging technique. To provide an accurate diagnosis, radiologists fuse the feature information of different modality images into a single image. Compared with the source images, the fused image not only has better brightness and quality but also richer information. It can provide accurate tumor localization and boundary information, helping surgeons excise the tumor precisely, guiding radiotherapy treatment plans, reducing damage to surrounding normal tissues, and comprehensively evaluating tumor activity and invasiveness to formulate appropriate treatment strategies for patients[4]. Thus, combining multi-modal medical images to highlight critical tissue and organ characteristics is essential for accurate clinical diagnosis. Image fusion combines data from aligned images while preserving all original information intact[5]. Within medical imaging, AI-assisted diagnostic systems leverage multimodal image fusion to improve diagnostic accuracy while significantly shortening interpretation time[6].
Given the diversity and complexity of medical images, the fusion process is prone to data loss in the fused images, and there are nonlinear tissue contrast differences between different modalities[7]. The integration of deep learning into medical image processing has significantly addressed this challenge. Current deep learning fusion approaches mainly rely on convolutional neural network (CNN) and generative adversarial network (GAN) architectures[8]. CNNs have strong feature extraction capabilities and local inductive bias abilities, which allow them to effectively represent feature spaces, but they find it difficult to understand and learn global semantic information. Additionally, as a result of the small local receptive field of CNNs, they cannot effectively capture the long-distance dependencies between different regions in an image[9]. GAN can adjust adversarial losses according to the features of different modality images, continuously optimizing the generator and discriminator to select appropriate fusion weights and improve image fusion quality. However, training GAN-based models often encounters difficulties with network convergence, leading to imbalance during the adversarial game process[10], which results in suboptimal fusion outcomes.
To overcome these limitations, we present CvTFuse - a novel dual-branch medical image fusion network integrating both CNN and Vision Transformer architectures. The proposed framework consists of three core components: an encoder, fusion layer, and decoder. The encoder incorporates parallel CNN and Transformer branches designed to extract local features and global representations from source images, respectively. Leveraging CNNs' powerful feature extraction capabilities and inherent local inductive bias, combined with Transformers' ability to model long-range dependencies in images, our approach processes local and global information separately. The key contributions of the proposed fusion method are as follows:•A dual-branch medical image fusion network combining CNN and vision transformer is proposed. The CNN branch captures the local features of the image, whereas the vision transformer branch captures long-distance dependencies and global features.
•To fully capture contextual information, we propose a global context aggregation module, which aggregates multi-scale features extracted by the transformer branch.
•A fusion strategy of energy sensing and gradient enhancement is designed, which uses gradient information to help preserve the edges and detailed features of the source images, thereby improving the quality of the fused images.
Comments (0)