Infrared and Laser Engineering, Volume. 54, Issue 8, 20250210(2025)

Infrared and visible image fusion based on cross-modal feature interaction and multi-scale reconstruction

Rui YAO*, Kai WANG, Haofan GUO, Wentao HU, and Xiangrui TIAN
Author Affiliations
  • School of Automation Enginerring, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China
  • show less

    ObjectiveThe goal of infrared and visible light image fusion is to amalgamate image data from disparate sensors into a unified representation that preserves the complementary information and salient features inherent to both modalities. The autoencoder framework offers significant advantages in imagine fusion, whose encoder can efficiently extract image features, and decoder can precisely reconstruct the image. Moreover, by introducing mechanisms such as attention, the issues of detail loss and information retention can be effectively addressed. However, existing fusion methods still have many shortcomings: 1) Conventional autoencoder frameworks struggle to effectively learn deep features, leading to gradient vanishing issues during the image reconstruction process; 2) In low-light conditions, existing algorithms are frequently impeded by the suboptimal quality of visible light images, which significantly degrades the overall fusion quality; 3) Existing deep learning-based algorithms are often computationally intensive, thus being unsuitable for real-time applications. Therefore, a novel infrared and visible light image fusion algorithm based on Cross-modal Feature Interaction and Multi-scale Reconstruction, CFIMRFusion, has been proposed.MethodsThe proposed algorithm comprises four key components: a convolutional attention enhancement module, an encoder network, a cross-modal feature interaction fusion module, and a decoder network based on multi-scale reconstruction (Fig.1). The convolutional attention enhancement module extracts features through convolutional operations, and enhances the contrast and texture visibility of degraded visible light images, thereby enhancing the detailed features of the images. The encoder module employs convolutions to extract deep features from both infrared images and the enhanced visible light images. To fully leverage the multi-modal features of infrared and visible light images, a cross-modal feature interaction fusion module has been developed, which performs complementary fusion on infrared and visible light image via a channel-spatial attention mechanism. Additionally, considering that the encoder structure with direct connections is prone to feature vanishing during training, a decoder network based on multi-scale reconstruction has been developed, where fused features extracted by encoders at different levels are skip-connected to the decoder network.Results and DiscussionsIn the objective analysis of the proposed algorithm compared with GANMcC, SwinFuse, U2Fusion, LapH, MUFusion, CMRFusion, and TUFusion, the proposed algorithm obtained six optimal values and one suboptimal value on the TNO dataset (Tab.2) and LLVIP dataset (Tab.3), three optimal values and four sub-optimal values are obtained on the MSRS data set (Tab.4). The subjective evaluations on TNO, LLVIP, and MSRS datasets are shown in Fig.3-Fig.5 respectively. The fused images exhibit superior detail features and overall visual effects compared with the benchmark algorithms. Moreover, the fusion speed is 24.1%, 23.86% and 25.2% of the fastest comparison algorithm respectively.ConclusionsA novel infrared and visible light image fusion algorithm based on cross-modal feature interaction and multi-scale reconstruction, CFIMRFusion, has been developed, to enhance information acquisition in low-light conditions and produce fused images with clear details and improved visual effects. Initially, the degraded visible light image is fed into a convolutional attention enhancement module to obtain enhanced features. Subsequently, the enhanced visible light image and the infrared image are fed an autoencoder network to extract multi-scale features, which are then complementarily fused via a cross-modal feature interaction fusion module. Finally, a decoder network based on multi-scale reconstruction is employed to generate the fused image with rich details and clear structures. Experimental results indicate that, compared to the optimal comparison algorithm, the fused images of CFIMRFusion achieved 15.8% and 18.2% increase in average gradient and edge intensity, respectively, on the TNO dataset; 11.5% and 9.5% increase in mutual information and standard deviation on the LLVIP dataset; and 10.1% increase in edge intensity in the MSRS dataset. Moreover, ablation studies further demonstrate the effectiveness of each module in the algorithm, where the complete model outperforms partial module combinations in terms of AG, EI, SD, SF, and VIF metrics. Regarding computational efficiency, the proposed CFIMRFusion algorithm achieves the shortest average runtime and is thus capable of satisfying the real-time requirements of applications in low-light scenarios with high temporal constraints.

    Keywords
    Tools

    Get Citation

    Copy Citation Text

    Rui YAO, Kai WANG, Haofan GUO, Wentao HU, Xiangrui TIAN. Infrared and visible image fusion based on cross-modal feature interaction and multi-scale reconstruction[J]. Infrared and Laser Engineering, 2025, 54(8): 20250210

    Download Citation

    EndNote(RIS)BibTexPlain Text
    Save article for my favorites
    Paper Information

    Category: Optical imaging, display and information processing

    Received: Apr. 3, 2025

    Accepted: --

    Published Online: Aug. 29, 2025

    The Author Email: Rui YAO (yaorui@nuaa.edu.cn)

    DOI:10.3788/IRLA20250210

    Topics