Infrared and visible image fusion is a vital task in image processing[
Journal of Infrared and Millimeter Waves, Volume. 44, Issue 2, 275(2025)
BDMFuse: Multi-scale network fusion for infrared and visible images based on base and detail features
The fusion of infrared and visible images should emphasize the salient targets in the infrared image while preserving the textural details of the visible images. To meet these requirements, an autoencoder-based method for infrared and visible image fusion is proposed. The encoder designed according to the optimization objective consists of a base encoder and a detail encoder, which is used to extract low-frequency and high-frequency information from the image. This extraction may lead to some information not being captured, so a compensation encoder is proposed to supplement the missing information. Multi-scale decomposition is also employed to extract image features more comprehensively. The decoder combines low-frequency, high-frequency and supplementary information to obtain multi-scale features. Subsequently, the attention strategy and fusion module are introduced to perform multi-scale fusion for image reconstruction. Experimental results on three datasets show that the fused images generated by this network effectively retain salient targets while being more consistent with human visual perception.
Introduction
Infrared and visible image fusion is a vital task in image processing[
To maximize the retention of useful information in fused images,various methods have been developed. Traditional image fusion techniques operate in either the spatial domain or the transform domain. Spatial domain methods directly process image pixels,whereas transform domain methods handle the frequency representation of images using transformations such as the Fourier transform or wavelet transform[
In recent years,the feature extraction capabilities of deep learning have attracted significant attention from researchers. Studies have demonstrated that deep learning methods can substantially enhance the quality of fused image[
Although methods for infrared and visible image fusion are relatively mature,there are still unresolved issues. Increasing the number of network layers enhances the network's expressive ability,but it also exacerbates the loss of image detail. This,to some extent,limits the further improvement of fusion image quality. Our proposed network effectively addresses this problem,with its main contributions being as follows:
(1)The encoder network constructs base and detail encoders to extract low and high frequency information based on the optimisation objective,and constructs compensation encoders to supplement the information.
(2)The decoder network first fuses the different scales of low-frequency,high-frequency,and compensatory information to obtain multi-scale features. These multi-scale features are then multiplied by the acquired attention map,and the Fusion module is introduced to perform multi-scale fusion for image reconstruction.
(3)Our proposed fusion method demonstrates superior fusion results in terms of both visual assessment and objective evaluation compared to nine other fusion algorithms on three public datasets.
1 Related works
1.1 Methods based on autoencoder
An autoencoder is an unsupervised learning model commonly used for tasks such as data compression and feature extraction. The encoder maps input data into a low-dimensional feature space,while the decoder reconstructs the original data from this reduced space. Through this process,the autoencoder retains essential information and effectively captures significant feature representation[
In image processing,the feature extraction capabilities of self-encoders are extensively utilized in image fusion tasks. Image fusion aims to combine information from multiple image sources to generate richer and more informative results. Autoencoder is usually divided into three parts in this field:the encoder,the decoder,and the fusion layer. The encoder automatically learns the feature information of the source images without the need for manually designed features. The decoder maps the low-dimensional features extracted by the encoder back to the original space,adaptively reconstructing the image. Typically,image fusion methods based on autoencoders train the encoder and decoder during the training phase. In the testing phase,a fusion layer,such as an addition or maximum strategy,is introduced to merge the image features of infrared and visible.
Li et al.[
1.2 Multi-scale transform
Multi-scale transform is a technique for processing and analyzing information by decomposing an image or signal into different scales or frequency components[
With the advancement of deep learning,researchers and scholars have integrated multi-scale transforms into deep learning-based networks. Lin et al.[
2 Proposed fusion method
This section provides a detailed description of the proposed architecture for infrared and visible image fusion,encompassing the encoder,decoder,and training specifics.
2.1 Network architecture
Our proposed infrared and visible image fusion method(BDMFuse)adopts an end-to-end network architecture. The encoder employs three branches:the base encoder,detail encoder,and compensation encoder. The base encoder extracts low-frequency information from the image,the detail encoder captures high-frequency information,and the compensation encoder gathers information not captured by the other encoders. Each encoder operates at three different scales,and their outputs are fed into the decoder. Multi-scale features are first generated by summing the low-frequency,high-frequency,and compensatory information from different scales. These multi-scale features are then fused to achieve image reconstruction.The specific overall architecture of the training and testing phases is shown in
Figure 1.The overall network of BDMFuse: (a) overall structure of the training phase; (b) overall structure of the testing phase
2.2 Encoder
Inspired by the literature[
where
where
The detail encoder minimizes
where
The compensation encoder also adopts the method of finding the optimal value to obtain the feature information,which is similar to the detail encoder and will not be repeated. We obtain the compensation filter by Eq.(4)
The base encoder network is depicted in
Figure 2.Base encoder schematic
The detail encoder network is illustrated in
Figure 3.Detail encoder schematic
The compensation encoder is shown in
Figure 4.Compensation encoder schematic
2.3 Decoder
2.3.1 Decoder network
Initially,the corresponding low-frequency information and high-frequency information of different scales are fused. As shown in
Figure 5.Decoder schematic: (a) BDC-fusion; (b) multi-fusion
In order to be able to make the image learn more multi-scale information,we obtain the attention map of the corresponding scale from
Figure 6.Fusion module
2.3.2 Attention strategy
Our computation of
where,
The weights corresponding to
where
Taking Attention-3 as an example,
Figure 7.Attention strategy
2.4 Training strategy
Our training strategy is similar to DenseFuse. In the training phase,the fusion network is discarded. Through training,we expect the encoder to extract multi-scale depth features and the decoder to reconstruct the image based on these features. This training strategy can leave more choice space for the fusion layer.
In the training phase,the loss function
where
where O and I denote the output and input images,respectively.
where SSIM(-)represents the structural similarity measure. When the value of SSIM(-)is larger,it indicates a higher structural similarity between the output image O and the input image I.
The Fusion module is introduced in the decoder network to retain more feature information. However,it tends to ignore the structural information of the image to some extent. To address this,
where
where
The goal of the training phase is to train an autoencoder network capable of effectively extracting image features and reconstructing images. We employ 1 600 images from FLIR as input images for network training. The training samples are randomly cropped to 224 × 224. We set the parameter λ to 10 to train the network.
3 Experimental results and analysis
In this section,we will validate our proposed fusion method through a series of experiments. The network is executed on an NVIDIA GeForce RTX 2080Ti and implemented using the PyTorch framework.
3.1 Experimental setting
In this experiment,images are selected from the TNO,RoadScene [
3.2 Ablation experiment
The proposed network incorporates compensated encoders into the encoder network and introduces an attention strategy along with a Fusion module for multi-scale fusion in the decoder network. To validate the efficacy of these strategies,the article includes ablation experiments. Additionally,to further assess the benefits of the attention strategies,the article compares two attention mechanisms SE and ECA with the strategies we have adopted. As illustrated in
Figure 8.Fusion image of ablation experiments: (a) IR; (b) VIS; (c) SE-Attention; (d) ECA-Attention; (e) No-Attention; (f) No-Comp Encoder; (g) No-Fusion Module; (h) No-Strategy; (i) No-Strategy & No-Multi-scale; (j) proposed
To further demonstrate the superiority of our method,we conducted a quantitative analysis using test results from 40 pairs of infrared and visible images selected from the TNO dataset. The optimal values are highlighted in bold black. As shown in
|
3.3 Analysis of experimental results
3.3.1 Subjective analysis
Subjective qualitative comparisons of the nine existing fusion methods with our fusion method are presented in Figs.
Figure 9.Fusion image of TNO: (a) IR; (b) VIS; (c) DenseFuse; (d) RFN-Nest; (e) FusionGAN; (f) U2Fusion; (g) CSF; (h) SEDR; (i) SwinFusion; (j) SeAFusion; (k) CDDFuse; (l) proposed
Figure 10.Fusion image of RoadScene: (a) IR; (b) VIS; (c) DenseFuse; (d) RFN-Nest; (e) FusionGAN; (f) U2Fusion; (g) CSF; (h) SEDR; (i) SwinFusion; (j) SeAFusion; (k) CDDFuse; (l) proposed
Figure 11.Fusion image of LLVIP: (a) IR; (b) VIS; (c) DenseFuse; (d) RFN-Nest; (e) FusionGAN; (f) U2Fusion; (g) CSF; (h) SEDR; (i) SwinFusion; (j) SeAFusion; (k) CDDFuse; (l) (Our proposed)
3.3.2 Quantitative analysis
In addition to subjective qualitative analysis,we employ objective quantitative analysis to measure the performance of the proposed method. The quantitative evaluation results for the test sets of 40 pairs from TNO,100 pairs from RoadScene,and 100 pairs from LLVIP are shown in Tables 2-4. The best value is indicated in bold black text,and the second-best value is underlined. Using the experimental results of the LLVIP dataset as an example,we achieve the highest values on five indicators:EN,MI,SD,VIF and SCD. This demonstrates our network's capability to effectively extract image details. The results from the other two datasets show that while our proposed fusion method may fall slightly short of the optimal values in some evaluation metrics,the outcomes are relatively balanced across all metrics and generally superior to other fusion methods. Additionally,the experimental results from all three datasets indicate that our method has good feasibility and generalizability.
|
|
|
4 Conclusions
The article constructs a base encoder and a detail encoder based on the optimization problem to extract low-frequency and high-frequency information from the image. Additionally,a compensation encoder is proposed to supplement the missing information. To capture more feature information,a multi-scale approach is introduced for extracting image features. The decoder combines low-frequency,high-frequency,and compensation information to generate the fused image at different scales. Attention maps are then derived from these fused images,and the corresponding weights are multiplied with the fused images at different scales. Finally,the Fusion module is introduced for multi-scale fusion to achieve image reconstruction. During the training phase,the three encoders and one decoder are trained according to the loss function to ensure the image reconstruction capability. In the testing phase,the low-frequency,high-frequency,and compensatory information at different scales,decomposed by the encoder,are fed into the fusion layer to integrate the corresponding infrared and visible images. These are then sent into the decoder for image reconstruction,resulting in the final fused image.
The network undergoes comparative experiments on the TNO,RoadScene,and LLVIP datasets. The results demonstrate that both the subjective fusion effect and the overall objective evaluations of the proposed network outperform those of the comparison methods,showing good generalization across datasets.
[11] Z P Lin, Y H Luo, B Y Li et al. Gradient-aware channel attention network for infrared small target image denoising before detection. Journal of Infrared and Millimeter Waves, 43, 254-260(2024).
[24] Z P Lin, B Y Li, M Li et al. Light-weight infrared small target detection combining cross-scale feature fusion with bottleneck attention module. Journal of Infrared and Millimeter Wave, 41, 1102-1112(2022).
[30] S Liu, D Huang, Y Wang. Learning spatial fusion for single-shot object detection. arXiv preprint(2019).
[34] H Xu, J Ma, J Jiang et al. U2Fusion: A unified unsupervised image fusion network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44, 502-518(2020).
Get Citation
Copy Citation Text
Hai-Ping SI, Wen-Rui ZHAO, Ting-Ting LI, Fei-Tao LI, Bacao FERNADO, Chang-Xia SUN, Yan-Ling LI. BDMFuse: Multi-scale network fusion for infrared and visible images based on base and detail features[J]. Journal of Infrared and Millimeter Waves, 2025, 44(2): 275
Category: Interdisciplinary Research on Infrared Science
Received: Jul. 25, 2024
Accepted: --
Published Online: Mar. 14, 2025
The Author Email: LI Yan-Ling (lyl_lingling@163.com)