ObjectiveThe goal of infrared and visible light image fusion is to amalgamate image data from disparate sensors into a unified representation that preserves the complementary information and salient features inherent to both modalities. The autoencoder framework offers significant advantages in imagine fusion, whose encoder can efficiently extract image features, and decoder can precisely reconstruct the image. Moreover, by introducing mechanisms such as attention, the issues of detail loss and information retention can be effectively addressed. However, existing fusion methods still have many shortcomings: 1) Conventional autoencoder frameworks struggle to effectively learn deep features, leading to gradient vanishing issues during the image reconstruction process; 2) In low-light conditions, existing algorithms are frequently impeded by the suboptimal quality of visible light images, which significantly degrades the overall fusion quality; 3) Existing deep learning-based algorithms are often computationally intensive, thus being unsuitable for real-time applications. Therefore, a novel infrared and visible light image fusion algorithm based on Cross-modal Feature Interaction and Multi-scale Reconstruction, CFIMRFusion, has been proposed.
MethodsThe proposed algorithm comprises four key components: a convolutional attention enhancement module, an encoder network, a cross-modal feature interaction fusion module, and a decoder network based on multi-scale reconstruction (
Fig.1). The convolutional attention enhancement module extracts features through convolutional operations, and enhances the contrast and texture visibility of degraded visible light images, thereby enhancing the detailed features of the images. The encoder module employs convolutions to extract deep features from both infrared images and the enhanced visible light images. To fully leverage the multi-modal features of infrared and visible light images, a cross-modal feature interaction fusion module has been developed, which performs complementary fusion on infrared and visible light image via a channel-spatial attention mechanism. Additionally, considering that the encoder structure with direct connections is prone to feature vanishing during training, a decoder network based on multi-scale reconstruction has been developed, where fused features extracted by encoders at different levels are skip-connected to the decoder network.
Results and DiscussionsIn the objective analysis of the proposed algorithm compared with GANMcC, SwinFuse, U2Fusion, LapH, MUFusion, CMRFusion, and TUFusion, the proposed algorithm obtained six optimal values and one suboptimal value on the TNO dataset (
Tab.2) and LLVIP dataset (
Tab.3), three optimal values and four sub-optimal values are obtained on the MSRS data set (
Tab.4). The subjective evaluations on TNO, LLVIP, and MSRS datasets are shown in
Fig.3-
Fig.5 respectively. The fused images exhibit superior detail features and overall visual effects compared with the benchmark algorithms. Moreover, the fusion speed is 24.1%, 23.86% and 25.2% of the fastest comparison algorithm respectively.
ConclusionsA novel infrared and visible light image fusion algorithm based on cross-modal feature interaction and multi-scale reconstruction, CFIMRFusion, has been developed, to enhance information acquisition in low-light conditions and produce fused images with clear details and improved visual effects. Initially, the degraded visible light image is fed into a convolutional attention enhancement module to obtain enhanced features. Subsequently, the enhanced visible light image and the infrared image are fed an autoencoder network to extract multi-scale features, which are then complementarily fused via a cross-modal feature interaction fusion module. Finally, a decoder network based on multi-scale reconstruction is employed to generate the fused image with rich details and clear structures. Experimental results indicate that, compared to the optimal comparison algorithm, the fused images of CFIMRFusion achieved 15.8% and 18.2% increase in average gradient and edge intensity, respectively, on the TNO dataset; 11.5% and 9.5% increase in mutual information and standard deviation on the LLVIP dataset; and 10.1% increase in edge intensity in the MSRS dataset. Moreover, ablation studies further demonstrate the effectiveness of each module in the algorithm, where the complete model outperforms partial module combinations in terms of AG, EI, SD, SF, and VIF metrics. Regarding computational efficiency, the proposed CFIMRFusion algorithm achieves the shortest average runtime and is thus capable of satisfying the real-time requirements of applications in low-light scenarios with high temporal constraints.