Acta Optica Sinica, Volume. 45, Issue 15, 1510007(2025)
Mask-Guided Two-Stage Infrared and Visible Image Fusion Network
Due to current technical limitations and the influence of shooting environments, images captured by the same device often fail to provide a comprehensive description of different scenes. The visible sensor can generate visible images that contain rich texture details, but the image quality is easily affected by harsh environments. On the other hand, the infrared sensor can generate infrared images that offer distinct salient targets but lack texture details. Infrared and visible image fusion aims to integrate the complementary information from infrared and visible modalities to generate a single image. The fused image exhibits more prominent targets along with abundant texture details, thereby facilitating downstream visual tasks. Although most current methods have proven effective in generating satisfactory fused images in typical scenarios, performance remains unsatisfactory in strong light scenes. To achieve a fused image that contains prominent targets and rich texture details under strong light scenes, we propose a mask-guided two-stage infrared and visible image fusion network.
Considering the issue of blurred salient targets and texture details under strong light conditions, we design a salient object detection (SOD) network in the first stage to extract salient targets from the infrared image to guide feature extraction and reconstruction. Specifically, we embed RepVGG block into U2-Net, achieving structural reparameterization, which effectively enhances U2-Net’s ability to extract salient object instances. In the second stage, due to the degradation of texture details in strong light scenes, we establish a scene segmentation and enhancement module (SSEM) in the encoder stage. This module uses discrete wavelet transform to extract high-frequency information and supplements it into the hierarchical features, thus providing texture details. In addition, to avoid interference from redundant information between foreground and background, a dual-branch feature fusion module (DFFM) is designed to separately fuse the foreground and background, and then merge them into a unified feature map during the fusion stage. Moreover, in the decoder stage, a mask-guided reconstruction module (MGRM) is proposed to fully utilize the mask and address the problem of indistinct salient targets in strong light scenes. This module employs two sets of channel and spatial attention modules to extract critical information. It enhances significant regions while suppressing redundant information in both channel and spatial dimensions, improving the network’s ability to extract crucial features. For loss functions, in the first training stage, the loss function contains the loss of side output saliency maps and the final fusion output saliency map, which constrains the multi-scale saliency maps to focus on key features. In the second training stage, pixel intensity loss and structural similarity loss are used to guide the fused images in retaining salient targets, texture details, and structural information.
During the testing stage, comparative experiments are conducted on the TNO, RoadScene and FLIR datasets. The proposed method is compared with nine other methods, and eight quantitative indicators are used to assess the performance of the aforementioned ten methods. In qualitative experiments, compared to other methods, the proposed method can generate fusion results with clear salient targets and rich texture details under both normal and strong light environments. In quantitative experiments, the proposed method achieves the best discrete cosine features mutual information (FMIdct), small Baud sign mutual information (FMIω), visual-fidelity (VIF), and the second-best mutual information (MI) values. This suggests that the proposed method effectively transfers relevant information from the source images to the fused image while maintaining high visual fidelity. Finally, ablation experiments are conducted to validate the effectiveness of the proposed salient object detection network, the scene segmentation and enhancement module, and the mask-guided reconstruction module. These results confirm the contributions of each proposed module.
In this study, we propose a mask-guided two-stage infrared and visible image fusion method, which consists of two phases: Firstly, a RepVGG module is introduced to reparameterize U2-Net, which enhances its ability to extract salient targets. Secondly, a two-stage autoencoder-based infrared and visible image fusion method is designed. In the encoder phase, wavelet transform is employed to extract high-frequency information to supplement texture details. During the fusion phase, the foreground and background are fused separately to minimize redundant information. Finally, an attention mechanism is used to incorporate the extracted foreground targets for supplementing the salient information. The fused images contain richer texture details and enhanced salient target information. Applying this method in strong light scenes can mitigate information loss caused by intense illumination. This approach provides a novel image processing strategy to preserve more salient information under strong light conditions.
Get Citation
Copy Citation Text
Xiaodong Zhang, Dianwei Zhang, Yuanyuan Li, Shanshan Peng, Long Zhang. Mask-Guided Two-Stage Infrared and Visible Image Fusion Network[J]. Acta Optica Sinica, 2025, 45(15): 1510007
Category: Image Processing
Received: Apr. 8, 2025
Accepted: May. 12, 2025
Published Online: Aug. 15, 2025
The Author Email: Dianwei Zhang (z23070050@s.upc.edu.cn)
CSTR:32393.14.AOS250859