Mask-Guided Two-Stage Infrared and Visible Image Fusion Network

Xiaodong Zhang; Dianwei Zhang; Yuanyuan Li; Shanshan Peng; Long Zhang

doi:10.3788/AOS250859

Acta Optica Sinica, Volume. 45, Issue 15, 1510007(2025)

Mask-Guided Two-Stage Infrared and Visible Image Fusion Network

Xiaodong Zhang¹, Dianwei Zhang^1、*, Yuanyuan Li², Shanshan Peng¹, and Long Zhang¹

¹Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum (East China), Shandong Provincial Key Laboratory of Intelligent Oil & Gas Industrial Software, Qingdao 266580, Shandong , China

²Daqing Power Supply Company, State Grid Heilongjiang Electric Power Company Limited, Daqing 163000, Heilongjiang , China

show less

Abstract Get PDF(in Chinese)

Objective

Due to current technical limitations and the influence of shooting environments, images captured by the same device often fail to provide a comprehensive description of different scenes. The visible sensor can generate visible images that contain rich texture details, but the image quality is easily affected by harsh environments. On the other hand, the infrared sensor can generate infrared images that offer distinct salient targets but lack texture details. Infrared and visible image fusion aims to integrate the complementary information from infrared and visible modalities to generate a single image. The fused image exhibits more prominent targets along with abundant texture details, thereby facilitating downstream visual tasks. Although most current methods have proven effective in generating satisfactory fused images in typical scenarios, performance remains unsatisfactory in strong light scenes. To achieve a fused image that contains prominent targets and rich texture details under strong light scenes, we propose a mask-guided two-stage infrared and visible image fusion network.

Methods

Considering the issue of blurred salient targets and texture details under strong light conditions, we design a salient object detection (SOD) network in the first stage to extract salient targets from the infrared image to guide feature extraction and reconstruction. Specifically, we embed RepVGG block into U2-Net, achieving structural reparameterization, which effectively enhances U2-Net’s ability to extract salient object instances. In the second stage, due to the degradation of texture details in strong light scenes, we establish a scene segmentation and enhancement module (SSEM) in the encoder stage. This module uses discrete wavelet transform to extract high-frequency information and supplements it into the hierarchical features, thus providing texture details. In addition, to avoid interference from redundant information between foreground and background, a dual-branch feature fusion module (DFFM) is designed to separately fuse the foreground and background, and then merge them into a unified feature map during the fusion stage. Moreover, in the decoder stage, a mask-guided reconstruction module (MGRM) is proposed to fully utilize the mask and address the problem of indistinct salient targets in strong light scenes. This module employs two sets of channel and spatial attention modules to extract critical information. It enhances significant regions while suppressing redundant information in both channel and spatial dimensions, improving the network’s ability to extract crucial features. For loss functions, in the first training stage, the loss function contains the loss of side output saliency maps and the final fusion output saliency map, which constrains the multi-scale saliency maps to focus on key features. In the second training stage, pixel intensity loss and structural similarity loss are used to guide the fused images in retaining salient targets, texture details, and structural information.

Results and Discussions

During the testing stage, comparative experiments are conducted on the TNO, RoadScene and FLIR datasets. The proposed method is compared with nine other methods, and eight quantitative indicators are used to assess the performance of the aforementioned ten methods. In qualitative experiments, compared to other methods, the proposed method can generate fusion results with clear salient targets and rich texture details under both normal and strong light environments. In quantitative experiments, the proposed method achieves the best discrete cosine features mutual information (FMI_dct), small Baud sign mutual information (FMI_ω), visual-fidelity (VIF), and the second-best mutual information (MI) values. This suggests that the proposed method effectively transfers relevant information from the source images to the fused image while maintaining high visual fidelity. Finally, ablation experiments are conducted to validate the effectiveness of the proposed salient object detection network, the scene segmentation and enhancement module, and the mask-guided reconstruction module. These results confirm the contributions of each proposed module.

Conclusions

In this study, we propose a mask-guided two-stage infrared and visible image fusion method, which consists of two phases: Firstly, a RepVGG module is introduced to reparameterize U2-Net, which enhances its ability to extract salient targets. Secondly, a two-stage autoencoder-based infrared and visible image fusion method is designed. In the encoder phase, wavelet transform is employed to extract high-frequency information to supplement texture details. During the fusion phase, the foreground and background are fused separately to minimize redundant information. Finally, an attention mechanism is used to incorporate the extracted foreground targets for supplementing the salient information. The fused images contain richer texture details and enhanced salient target information. Applying this method in strong light scenes can mitigate information loss caused by intense illumination. This approach provides a novel image processing strategy to preserve more salient information under strong light conditions.

Note: This section is automatically generated by AI . The website and platform operators shall not be liable for any commercial or legal consequences arising from your use of AI generated content on this website. Please be aware of this.

Keywords

attention mechanism image fusion mask guided salient object detection strong light scene

Tools

Get Citation

Copy Citation Text

Xiaodong Zhang, Dianwei Zhang, Yuanyuan Li, Shanshan Peng, Long Zhang. Mask-Guided Two-Stage Infrared and Visible Image Fusion Network[J]. Acta Optica Sinica, 2025, 45(15): 1510007

Download Citation

EndNote(RIS)BibTex Plain Text

Set citation alerts for article

Save article for my favorites