Acta Optica Sinica, Volume. 45, Issue 11, 1110001(2025)
Semantic Information Driven Multimodal Image Fusion Network
The increasing demand for applications in target detection, identification, and military surveillance necessitates rapid and accurate traffic target detection, particularly in complex environments such as smoke or nighttime conditions. Infrared (IR) and visible light (VIS) image fusion technology plays a vital role in these scenarios by combining complementary information from both image types. This fusion integrates the detailed edges and features of VIS images with the thermal radiation data from IR images. However, conventional image fusion algorithms depend on hand-crafted fusion rules to combine depth features, potentially resulting in important information loss. Additionally, the lack of explainability in deep learning models compounds this challenge, constraining potential algorithmic improvements. We introduce the semantic information driven multimodal image fusion (SIDM-Fusion) network, designed to effectively and accurately identify targets such as people, vehicles, and backgrounds in complex scenes.
The SIDM-Fusion network represents an innovative fusion framework based on the self-encoder model. It incorporates two essential components: the dual branch-by-layer interactive feature coding network (DBL-IFCN) and the semantic prior classification network (SPC-Net). The DBL-IFCN enables adaptive extraction of multi-scale and salient features, effectively addressing challenges associated with varying target sizes and complex environments. The SPC-Net employs a lightweight encoder architecture to extract deep features and utilizes the class-activation mapping mechanism to implement a dichotomous, learnable fusion rule. The fused features subsequently pass through a pre-trained decoder to generate the final image. To minimize information loss during multimodal fusion, content loss and saliency loss are employed to constrain the information loss in the SIDM-Fusion network. Furthermore, a cross-entropy loss function is used to regulate the semantic prior assignment network. Comprehensive experiments are conducted using multiple datasets, including the multi-sensor road scenes (MSRS) and Netherlands organisation for applied scientific research (TNO) datasets, to validate the network’s semantic segmentation accuracy and evaluate its performance comprehensively.
The SIDM-Fusion network exhibits superior performance across various datasets. Compared to alternative fusion algorithms, the network enhances the average gradient (AG) and mutual information (MI) metrics by approximately 22% and 14%, respectively. Fig. 1 depicts the structural components of the network model, while Figs. 5 and 6 showcase visual quality comparisons across different datasets. The ablation experiments results, presented in Table 3 and Fig. 7, demonstrate the essential contributions of the MEGB, significantly dense residuals block (SDRB), spatial bias module (SBM), and SPC-Net modules to overall performance improvement. Specifically, the MEGB enhances texture details, the SDRB module strengthens pixel distribution, and the SBM module facilitates progressive feature fusion. The integration of SPC-Net further optimizes intensity distribution while maintaining key features. The final model demonstrates excellent performance across all evaluation metrics, validating the effectiveness of the proposed architecture. Additionally, semantic segmentation experiments confirm the network’s feasibility, demonstrating robust performance under diverse weather conditions and varying lighting environments, making it suitable for practical applications.
To overcome the limitations of existing methods that often prioritize visual effects over supporting high-level tasks, the SIDM-Fusion network is engineered to deliver more accurate information for real-world applications by simultaneously addressing both image visual quality and utility in complex environments. The two-branch, layer-by-layer interactive feature coding network preserves edge information, enhances texture details, and improves edge clarity. Furthermore, the SPC-Net effectively captures pixel correlation within local windows and across windows, utilizing the class-activation mapping (CAM) mechanism. Through class-activation weights, the network assigns fusion weights to channels, implementing a learnable fusion rule based on binary classification. This approach effectively addresses the challenge where models emphasize visual quality at the expense of practical image functionality. Experimental results demonstrate that the proposed method efficiently utilizes complementary information from multimodal images, significantly reducing information loss. Future research will focus on optimizing network speed and efficiency to enhance its practical applicability across various domains.
Get Citation
Copy Citation Text
Yulan Han, Yaozu Zhai, Tong Wu, Chaofeng Lan. Semantic Information Driven Multimodal Image Fusion Network[J]. Acta Optica Sinica, 2025, 45(11): 1110001
Category: Image Processing
Received: Jan. 26, 2025
Accepted: Apr. 15, 2025
Published Online: Jun. 23, 2025
The Author Email: Yulan Han (hanyulan@hrbust.edu.cn)
CSTR:32393.14.AOS250551