Journal of Infrared and Millimeter Waves, Volume. 44, Issue 2, 275(2025)

BDMFuse: Multi-scale network fusion for infrared and visible images based on base and detail features

Hai-Ping SI1... Wen-Rui ZHAO1, Ting-Ting LI1, Fei-Tao LI1, Bacao FERNADO2, Chang-Xia SUN1 and Yan-Ling LI1,* |Show fewer author(s)
Author Affiliations
  • 1College of Information and Management Science,Henan Agricultural University,Zhengzhou 450046,China
  • 2NOVA Information Management School,Universidade Nova de Lisboa,Lisboa1070-312,Portugal
  • show less

    The fusion of infrared and visible images should emphasize the salient targets in the infrared image while preserving the textural details of the visible images. To meet these requirements, an autoencoder-based method for infrared and visible image fusion is proposed. The encoder designed according to the optimization objective consists of a base encoder and a detail encoder, which is used to extract low-frequency and high-frequency information from the image. This extraction may lead to some information not being captured, so a compensation encoder is proposed to supplement the missing information. Multi-scale decomposition is also employed to extract image features more comprehensively. The decoder combines low-frequency, high-frequency and supplementary information to obtain multi-scale features. Subsequently, the attention strategy and fusion module are introduced to perform multi-scale fusion for image reconstruction. Experimental results on three datasets show that the fused images generated by this network effectively retain salient targets while being more consistent with human visual perception.

    Keywords

    Introduction

    Infrared and visible image fusion is a vital task in image processing1. Infrared images,captured through thermal radiation,are less affected by external factors but often suffer from a significant loss of texture details and structural information2. In contrast,visible images contain rich texture information but are more sensitive to environmental changes3. Therefore,combining infrared and visible images in a bimodal manner can generate high-quality images4.

    To maximize the retention of useful information in fused images,various methods have been developed. Traditional image fusion techniques operate in either the spatial domain or the transform domain. Spatial domain methods directly process image pixels,whereas transform domain methods handle the frequency representation of images using transformations such as the Fourier transform or wavelet transform5-6. While these traditional methods have produced satisfactory fusion results,they largely depend on heuristic fusion rules. Consequently,traditional image fusion techniques often fall short of meeting the increasingly stringent fusion requirements.

    In recent years,the feature extraction capabilities of deep learning have attracted significant attention from researchers. Studies have demonstrated that deep learning methods can substantially enhance the quality of fused image7. Currently,deep learning-based image fusion methods can be broadly categorized into three types:autoencoder(AE)-based methods,convolutional neural network(CNN)-based methods,and generative adversarial network(GAN)-based methods. AE-based methods focus on encoding data to create a low-dimensional representation,which is then decoded to reconstruct the original data8. CNN-based methods extract image features through local connections and shared weights9. GAN-based methods employ adversarial training between a generator and a discriminator to produce high-quality images10.

    Although methods for infrared and visible image fusion are relatively mature,there are still unresolved issues. Increasing the number of network layers enhances the network's expressive ability,but it also exacerbates the loss of image detail. This,to some extent,limits the further improvement of fusion image quality. Our proposed network effectively addresses this problem,with its main contributions being as follows:

    (1)The encoder network constructs base and detail encoders to extract low and high frequency information based on the optimisation objective,and constructs compensation encoders to supplement the information.

    (2)The decoder network first fuses the different scales of low-frequency,high-frequency,and compensatory information to obtain multi-scale features. These multi-scale features are then multiplied by the acquired attention map,and the Fusion module is introduced to perform multi-scale fusion for image reconstruction.

    (3)Our proposed fusion method demonstrates superior fusion results in terms of both visual assessment and objective evaluation compared to nine other fusion algorithms on three public datasets.

    1 Related works

    1.1 Methods based on autoencoder

    An autoencoder is an unsupervised learning model commonly used for tasks such as data compression and feature extraction. The encoder maps input data into a low-dimensional feature space,while the decoder reconstructs the original data from this reduced space. Through this process,the autoencoder retains essential information and effectively captures significant feature representation11. As a result,autoencoder have been widely applied in areas such as target detection 12,target segmentation13,and early warning systems14.

    In image processing,the feature extraction capabilities of self-encoders are extensively utilized in image fusion tasks. Image fusion aims to combine information from multiple image sources to generate richer and more informative results. Autoencoder is usually divided into three parts in this field:the encoder,the decoder,and the fusion layer. The encoder automatically learns the feature information of the source images without the need for manually designed features. The decoder maps the low-dimensional features extracted by the encoder back to the original space,adaptively reconstructing the image. Typically,image fusion methods based on autoencoders train the encoder and decoder during the training phase. In the testing phase,a fusion layer,such as an addition or maximum strategy,is introduced to merge the image features of infrared and visible.

    Li et al.15 proposed a deep learning architecture that integrates convolutional layers,fusion layers,and dense blocks,where the output of each layer is connected to the outputs of other layers. Recognizing that the features extracted by a single branch might lack comprehensiveness,Li et al.16 introduced a method incorporating a multi-level residual encoder module and a decoder module with hybrid transmission. This design features a multi-level residual encoder module with two independent branches for extracting image features. Zhao et al. 17 proposed a dual-branch structure for multimodal feature decomposition and image fusion. Tang et al.18 proposed a darkness-free infrared and visible image fusion method,which fully considers the intrinsic relationship between low-light image enhancement and image fusion,achieving effective coupling and information complementarity. Additionally,Tang et al.19 were the first to consider the gap between high-level vision tasks and image fusion,proposing a semantic-aware image fusion framework.

    1.2 Multi-scale transform

    Multi-scale transform is a technique for processing and analyzing information by decomposing an image or signal into different scales or frequency components20. It can effectively capture local details and global features in images and is widely used in tasks such as image fusion,texture analysis and feature extraction. The commonly used multi-scale transforms mainly contain methods such as pyramid transformations21,wavelet transformations22,and non-subsampled multi-scale multi-directional geometric transformations23.

    With the advancement of deep learning,researchers and scholars have integrated multi-scale transforms into deep learning-based networks. Lin et al.24 proposed cross-scale fusion module for interactive fusion of features between encoder and decoder. Jian et al.25 introduced a multi-scale encoder to extract featured image features and constructed a symmetric encoder-decoder with residual blocks(SEDRFuse)network for fusing infrared and visible images in night vision applications. Wang et al.26 introduced a novel and efficient fusion network based on dense Res2net and dual nonlocal attention models. They integrated Res2net and dense connectivity into an encoder network,enabling the utilization of multiple available receptive fields to extract multi-scale features. This approach aims to retain as much effective information as possible for the fusion task.

    2 Proposed fusion method

    This section provides a detailed description of the proposed architecture for infrared and visible image fusion,encompassing the encoder,decoder,and training specifics.

    2.1 Network architecture

    Our proposed infrared and visible image fusion method(BDMFuse)adopts an end-to-end network architecture. The encoder employs three branches:the base encoder,detail encoder,and compensation encoder. The base encoder extracts low-frequency information from the image,the detail encoder captures high-frequency information,and the compensation encoder gathers information not captured by the other encoders. Each encoder operates at three different scales,and their outputs are fed into the decoder. Multi-scale features are first generated by summing the low-frequency,high-frequency,and compensatory information from different scales. These multi-scale features are then fused to achieve image reconstruction.The specific overall architecture of the training and testing phases is shown in Fig. 1.

    The overall network of BDMFuse: (a) overall structure of the training phase; (b) overall structure of the testing phase

    Figure 1.The overall network of BDMFuse: (a) overall structure of the training phase; (b) overall structure of the testing phase

    2.2 Encoder

    Inspired by the literature27,we construct the base encoder,detail encoder,and compensation encoder with the idea of solving for the optimal solution. The base encoder extracts low-frequency features by minimizing Eq.(1) to obtain the optimal solution.

    Bn=BI-γB1×Bm,

    where BI is the image acquired after applying a blur filter to the input image and γB1 is the tuning hyperparameter.Bn(n∈{1,2,3})is the decomposed base image. Bm(m∈{1,2,3})is the low-frequency features of different scales obtained after convolution operations of different depths,which are represented as shown below:

    Bm=Ln-γB2×En,

    where Ln(n∈{1,2,3})denotes features of different scales obtained after convolution operations of different depths,En(n∈{1,2,3})denotes high-frequency features of different scales extracted from the En feature map,and γB2 is the tuning hyperparameter.

    The detail encoder minimizes Eq.(3) to obtain the optimal solution.

    Dn=DI-γD1*Dm,

    where DI is the image obtained after applying Laplace filter to the input image and Dm(m∈{1,2,3})denotes the high frequency features at different scales obtained from the convolution operation that has been performed at different depths. γD1 is the tuning hyperparameter. Dn(n∈{1,2,3})is the decomposed detail image.

    The compensation encoder also adopts the method of finding the optimal value to obtain the feature information,which is similar to the detail encoder and will not be repeated. We obtain the compensation filter by Eq.(4)

    CI=1-BI-DI .

    The base encoder network is depicted in Fig. 2. We adopt two processing methods to extract the low-frequency information of the input image. One is to take low pass filter and the other is to take convolution operation. First,feature coarse extraction is performed,followed by the addition of the CBAM attention mechanism 28 to emphasize effective information. To capture more comprehensive information,we introduce multi-scale feature extraction. Maximum pooling and up-sampling operations are applied to features of different scales to extract high-frequency information,thereby removing high-frequency components from the features. The pooling operation reduces the feature size by half,and to maintain the image size,bilinear interpolation-based up-sampling is adopted.

    Base encoder schematic

    Figure 2.Base encoder schematic

    The detail encoder network is illustrated in Fig. 3,and its architecture bears resemblance to the base encoder. However,the detail encoder employs two methods to obtain the high-frequency information of the image. One approach is to utilize a high-pass filter,while the other involves convolutional operations. Maximum pooling operation is taken directly on image features of different scales in convolution operation to obtain the high frequency information of the image. In this process,the CGP convolution blocks in the network all employ 3×3 convolution kernels with a stride of 1. Transitioning through the CGP convolution block alters only the number of channels and does not affect the image size.

    Detail encoder schematic

    Figure 3.Detail encoder schematic

    The compensation encoder is shown in Fig. 4. This encoder is constructed to complement the image feature information,so only the basic convolutional processing of the rest of the encoder is retained without any other additional operations.

    Compensation encoder schematic

    Figure 4.Compensation encoder schematic

    2.3 Decoder

    2.3.1 Decoder network

    Initially,the corresponding low-frequency information and high-frequency information of different scales are fused. As shown in Fig. 5(a),Fbn(n∈{1,2,3})refers to the multi-scale low-frequency features,Fdn(n∈{1,2,3})denotes the multi-scale high-frequency features,and Fcn(n∈{1,2,3})represents the multi-scale compensation features. Since the compensation features may increase image artifacts,we add hyperparameter β to modulate the multi-scale compensation features. The size of Fdn differs from the rest of the image sizes,so up-sampling is achieved through an inverse convolution operation for Fdn. Finally,FbnFdn,and Fcn are summed to obtain the multi-scale feature Fn(n∈{1,2,3}).

    Decoder schematic: (a) BDC-fusion; (b) multi-fusion

    Figure 5.Decoder schematic: (a) BDC-fusion; (b) multi-fusion

    In order to be able to make the image learn more multi-scale information,we obtain the attention map of the corresponding scale from Fn and multiply it with the fused image. Then multi-scale fusion is performed,as shown in Fig. 5(b). F3 is spliced with F2 after the CGS module is aligned with the number of F2 channels,and then sent to Fusion to achieve fusion. After F3 and F2 are fused,they are passed through CGS to maintain the same number of channels as F1. Then,they are concatenated with F1 and sent into the Fusion module. The Fusion module 29 is shown in Fig. 6. We introduce hyperparameter to the Fusion module to enhance its applicability to our proposed network. Finally,image reconstruction is achieved by two CGS modules. The CGS convolutional blocks in the network adopt 3×3 convolutional kernels with a step size of 1. The CGS convolutional blocks only change the number of channels and do not change the image size.

    Fusion module

    Figure 6.Fusion module

    2.3.2 Attention strategy

    Our computation of Fn weights draws inspiration from ASFF(adaptively spatial feature fusion)30. Fn has the same size but different numbers of channels. We adjust different number of channels for Fn to obtain three attention maps to extract the weight values of the corresponding scales. The formula is shown below:

    ω1n,ω2n,ω3n=softmaxF1n,F2n,F3n,(n,i1,2,3),

    where,ω1nω2nω3n denote the individual weights of Fnfrom the nth layer,and F1nF2nF3ndenote the number of channels of all the layers adjusted to the number of channels of the nth layer by the CGP convolution block. The softmax function is shown below:

    softmaxF1n,F2n,F3n=eFineF1n+eF2n+eF3n,(n,i1,2,3) .

    The weights corresponding to Fn are multiplied with Fn to optimise the Fn features as follows:

    Fn=Fn×ωnn,

    where ωnn is the corresponding weight of Fn in the nth layer of the attention graph.

    Taking Attention-3 as an example,Fig. 7 illustrates how to obtain the weights of F3. We input Fn into the CGP convolution block,adjust the number of channels of F1and F2to match that of F3,and obtain the feature maps. Subsequently,the three feature maps are concatenated,and the weight value of each feature map is calculated using softmax. Finally,the corresponding weight of F3is extracted and multiplied by F3 to optimize its features. In this process,the CGP convolution blocks all employ 1×1 convolution kernels with a stride of 1.

    Attention strategy

    Figure 7.Attention strategy

    2.4 Training strategy

    Our training strategy is similar to DenseFuse. In the training phase,the fusion network is discarded. Through training,we expect the encoder to extract multi-scale depth features and the decoder to reconstruct the image based on these features. This training strategy can leave more choice space for the fusion layer.

    In the training phase,the loss function Ltotal is defined as follows:

    Ltotal=Lp+λLs+λ/2(Ls2+Ls1),

    where Lp and Ls denote the pixel loss and structural similarity loss between the input image and the output image respectively. λ is a measure between LpLsLs1 and Ls2.

    Lp is calculated by Eq.(9)

    Lp=||O-I||2,

    where O and I denote the output and input images,respectively. Lp calculates the square of the difference between the output image and the input image,aiming to ensure that the reconstructed image is similar to the source image at the pixel level.

    Ls is obtained from Eq.(10)

    Ls=1-SSIM(O,I),

    where SSIM(-)represents the structural similarity measure. When the value of SSIM(-)is larger,it indicates a higher structural similarity between the output image O and the input image I.

    The Fusion module is introduced in the decoder network to retain more feature information. However,it tends to ignore the structural information of the image to some extent. To address this,Ls2 and Ls1 losses are introduced.

    Ls2 is obtained from Eq.(11)

    Ls2=1-SSIMf3+2,Ff3,f2,

    where f3+2 is the result of summing the multi-scale features F2 and F3,while F(f3,f2) is the fusion result obtained after the Fusion module.

    Ls1 is obtained from Eq.(12)

    Ls1=1-SSIM(f2+1,F(f2,f1)),

    where f2+1 is the result of fusion of multi-scale features F2 and F3 and then summed with F1,and F(f2,f1) is the fusion result obtained after the Fusion module.

    The goal of the training phase is to train an autoencoder network capable of effectively extracting image features and reconstructing images. We employ 1 600 images from FLIR as input images for network training. The training samples are randomly cropped to 224 × 224. We set the parameter λ to 10 to train the network.

    3 Experimental results and analysis

    In this section,we will validate our proposed fusion method through a series of experiments. The network is executed on an NVIDIA GeForce RTX 2080Ti and implemented using the PyTorch framework.

    3.1 Experimental setting

    In this experiment,images are selected from the TNO,RoadScene 31,and LLVIP 32datasets,and our proposed fusion network is compared with nine typical fusion networks(DenseFuse,RFN-Nest,FusionGAN 33,U2Fusion 34,CSF 35,SEDR,SwinFusion 36,SeAFusion,CDDFuse)to evaluate its performance. We select seven commonly used image quality evaluation metrics for objective quality evaluation of fused images,which are entropy(EN)37,mutual information(MI)38,standard deviation(SD)39,average gradient(AG)40,visual information fidelity(VIF)41,the sum of the correlations of differences(SCD)42,multi-scale structural similarity measure(MS_SSIM)43. Among them,EN is an objective evaluation metric that measures how much information the image contains. MI is used to measure the amount of information transferred from the source image to the fused image. SD reflects the distribution and contrast of the fused image. AG is a metric used to evaluate the sharpness and detail of an image. VIF measures the information fidelity of the fused image. SCD is a measure of the differences between the fused image and the original image. MS_SSIM is a similarity-based evaluation metric.

    3.2 Ablation experiment

    The proposed network incorporates compensated encoders into the encoder network and introduces an attention strategy along with a Fusion module for multi-scale fusion in the decoder network. To validate the efficacy of these strategies,the article includes ablation experiments. Additionally,to further assess the benefits of the attention strategies,the article compares two attention mechanisms SE and ECA with the strategies we have adopted. As illustrated in Fig. 8,our proposed network demonstrates superior visual fusion effects in the fused image. Specifically,it effectively highlights the presence of the villain in the infrared image while preserving the dendritic texture features in the visible image. Specifically,it effectively highlights the presence of a person in infrared images while preserving the textural features of tree branches in visible image.

    Fusion image of ablation experiments: (a) IR; (b) VIS; (c) SE-Attention; (d) ECA-Attention; (e) No-Attention; (f) No-Comp Encoder; (g) No-Fusion Module; (h) No-Strategy; (i) No-Strategy & No-Multi-scale; (j) proposed

    Figure 8.Fusion image of ablation experiments: (a) IR; (b) VIS; (c) SE-Attention; (d) ECA-Attention; (e) No-Attention; (f) No-Comp Encoder; (g) No-Fusion Module; (h) No-Strategy; (i) No-Strategy & No-Multi-scale; (j) proposed

    To further demonstrate the superiority of our method,we conducted a quantitative analysis using test results from 40 pairs of infrared and visible images selected from the TNO dataset. The optimal values are highlighted in bold black. As shown in Table 1,our proposed network achieves optimal values for the EN,MI,SD,and VIF metrics. These results provide additional evidence of the effectiveness of our adopted approach.

    • Table 1. Average quality evaluation metrics for ablation experiments

      Table 1. Average quality evaluation metrics for ablation experiments

      ENMISDAGVIFSCDMS_SSIM
      SE-Attention6.850113.700134.73153.74150.96121.81920.9416
      ECA-Attention6.950413.900840.67163.55030.95481.65140.9148
      No-Attention7.177414.354740.25764.20961.36051.88010.9306
      No-Comp Encoder6.872813.745638.00532.96820.92191.80600.9403
      No-Fusion Module7.212914.425940.82633.70761.27271.88700.9277
      No-Strategy7.029114.058137.98053.36661.08511.89350.9511

      No-Strategy&

      No-Multi-scale

      6.591013.181935.04113.27850.80631.61110.9216
      Proposed7.252514.505043.61834.16271.41281.87150.9207

    3.3 Analysis of experimental results

    3.3.1 Subjective analysis

    Subjective qualitative comparisons of the nine existing fusion methods with our fusion method are presented in Figs. 9-11. As depicted in the figures,all fusion methods effectively fuse infrared and visible images,but notable differences in visual quality are observed. FusionGAN retains more information from the infrared image during fusion,which can result in blurred image details. The remaining fusion methods retain more detailed information from the visible image,especially SwinFusion,SeAFusion and CDDFuse. For example,in Fig. 9,the cloud information present in the infrared image is missing. Our proposed network retains more light and shadow information,as illustrated in Fig. 10,indicating its enhanced ability to perceive light. To further validate this,we conducted experiments on the LLVIP dataset. LLVIP is a dataset for low-light vision paired with visible and infrared images. As demonstrated in Fig. 11,our fused images reveal the basic outline of the manhole cover even under low-light conditions.

    Fusion image of TNO: (a) IR; (b) VIS; (c) DenseFuse; (d) RFN-Nest; (e) FusionGAN; (f) U2Fusion; (g) CSF; (h) SEDR; (i) SwinFusion; (j) SeAFusion; (k) CDDFuse; (l) proposed

    Figure 9.Fusion image of TNO: (a) IR; (b) VIS; (c) DenseFuse; (d) RFN-Nest; (e) FusionGAN; (f) U2Fusion; (g) CSF; (h) SEDR; (i) SwinFusion; (j) SeAFusion; (k) CDDFuse; (l) proposed

    Fusion image of RoadScene: (a) IR; (b) VIS; (c) DenseFuse; (d) RFN-Nest; (e) FusionGAN; (f) U2Fusion; (g) CSF; (h) SEDR; (i) SwinFusion; (j) SeAFusion; (k) CDDFuse; (l) proposed

    Figure 10.Fusion image of RoadScene: (a) IR; (b) VIS; (c) DenseFuse; (d) RFN-Nest; (e) FusionGAN; (f) U2Fusion; (g) CSF; (h) SEDR; (i) SwinFusion; (j) SeAFusion; (k) CDDFuse; (l) proposed

    Fusion image of LLVIP: (a) IR; (b) VIS; (c) DenseFuse; (d) RFN-Nest; (e) FusionGAN; (f) U2Fusion; (g) CSF; (h) SEDR; (i) SwinFusion; (j) SeAFusion; (k) CDDFuse; (l) (Our proposed)

    Figure 11.Fusion image of LLVIP: (a) IR; (b) VIS; (c) DenseFuse; (d) RFN-Nest; (e) FusionGAN; (f) U2Fusion; (g) CSF; (h) SEDR; (i) SwinFusion; (j) SeAFusion; (k) CDDFuse; (l) (Our proposed)

    3.3.2 Quantitative analysis

    In addition to subjective qualitative analysis,we employ objective quantitative analysis to measure the performance of the proposed method. The quantitative evaluation results for the test sets of 40 pairs from TNO,100 pairs from RoadScene,and 100 pairs from LLVIP are shown in Tables 2-4. The best value is indicated in bold black text,and the second-best value is underlined. Using the experimental results of the LLVIP dataset as an example,we achieve the highest values on five indicators:EN,MI,SD,VIF and SCD. This demonstrates our network's capability to effectively extract image details. The results from the other two datasets show that while our proposed fusion method may fall slightly short of the optimal values in some evaluation metrics,the outcomes are relatively balanced across all metrics and generally superior to other fusion methods. Additionally,the experimental results from all three datasets indicate that our method has good feasibility and generalizability.

    • Table 2. Average quality evaluation metrics for 40 pairs of TNO fused images

      Table 2. Average quality evaluation metrics for 40 pairs of TNO fused images

      ENMISDAGVIFSCDMS_SSIM
      DenseFuse6.793313.586733.00553.89581.11411.89510.9379
      RFN-Nest6.988913.977735.20262.86191.12181.87100.9138
      FusionGAN6.507313.014726.44782.42760.74081.13390.7511
      U2Fusion6.474512.948924.98583.84140.73711.65420.9433
      CSF6.929513.859134.23423.85651.22211.85030.9190
      SEDR6.879513.759039.05274.14911.55461.85540.8946
      SwinFusion6.615613.231131.03393.49710.72281.71470.8992
      SeAFusion7.147414.294939.91005.61621.73691.73230.8553
      CDDFuse7.058214.116439.17515.21351.39761.79660.8795
      Proposed7.354414.708844.58085.99071.82741.87840.8935
    • Table 3. Average quality evaluation metrics for 100 pairs of RoadScene fused images

      Table 3. Average quality evaluation metrics for 100 pairs of RoadScene fused images

      ENMISDAGVIFSCDMS_SSIM
      DenseFuse7.268414.536843.16794.59790.65041.65520.9220
      RFN-Nest7.349214.698446.10853.16680.60911.68370.8671
      FusionGAN7.054014.107939.06453.22460.48051.04960.7547
      U2Fusion7.081514.162937.84725.25140.63381.38660.9148
      CSF7.425714.851447.98285.02410.79521.72820.9261
      SEDR7.449914.899949.35814.93220.82861.69780.9005
      SwinFusion6.971213.942445.16344.33120.72751.57720.8470
      SeAFusion7.511715.023456.11196.87371.09491.67320.8786
      CDDFuse7.500315.000756.43316.50831.11201.71050.8740
      Proposed7.562315.124653.10006.55061.03591.77660.9353
    • Table 4. Average quality evaluation metrics for 100 pairs of LLVIP fused images

      Table 4. Average quality evaluation metrics for 100 pairs of LLVIP fused images

      ENMISDAGVIFSCDMS_SSIM
      DenseFuse6.974013.948036.77082.86410.46171.38350.9224
      RFN-Nest7.002814.005537.89192.24450.42161.41540.8981
      FusionGAN6.415312.830626.15402.01020.27460.75210.7865
      U2Fusion6.653113.306234.96593.33910.52831.27550.9098
      CSF6.828313.656635.37732.79600.45641.36210.9109
      SEDR6.887713.779536.54302.63190.44951.25460.8834
      SwinFusion7.384414.768850.81044.20640.90871.58870.9451
      SeAFusion7.419314.838650.44684.18980.90621.62590.9435
      CDDFuse7.313414.626748.34503.80520.81711.58890.9337
      Proposed7.539815.079552.94153.84640.98891.70710.9250

    4 Conclusions

    The article constructs a base encoder and a detail encoder based on the optimization problem to extract low-frequency and high-frequency information from the image. Additionally,a compensation encoder is proposed to supplement the missing information. To capture more feature information,a multi-scale approach is introduced for extracting image features. The decoder combines low-frequency,high-frequency,and compensation information to generate the fused image at different scales. Attention maps are then derived from these fused images,and the corresponding weights are multiplied with the fused images at different scales. Finally,the Fusion module is introduced for multi-scale fusion to achieve image reconstruction. During the training phase,the three encoders and one decoder are trained according to the loss function to ensure the image reconstruction capability. In the testing phase,the low-frequency,high-frequency,and compensatory information at different scales,decomposed by the encoder,are fed into the fusion layer to integrate the corresponding infrared and visible images. These are then sent into the decoder for image reconstruction,resulting in the final fused image.

    The network undergoes comparative experiments on the TNO,RoadScene,and LLVIP datasets. The results demonstrate that both the subjective fusion effect and the overall objective evaluations of the proposed network outperform those of the comparison methods,showing good generalization across datasets.

    [11] Z P Lin, Y H Luo, B Y Li et al. Gradient-aware channel attention network for infrared small target image denoising before detection. Journal of Infrared and Millimeter Waves, 43, 254-260(2024).

    [24] Z P Lin, B Y Li, M Li et al. Light-weight infrared small target detection combining cross-scale feature fusion with bottleneck attention module. Journal of Infrared and Millimeter Wave, 41, 1102-1112(2022).

    [30] S Liu, D Huang, Y Wang. Learning spatial fusion for single-shot object detection. arXiv preprint(2019).

    [34] H Xu, J Ma, J Jiang et al. U2Fusion: A unified unsupervised image fusion network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44, 502-518(2020).

    Tools

    Get Citation

    Copy Citation Text

    Hai-Ping SI, Wen-Rui ZHAO, Ting-Ting LI, Fei-Tao LI, Bacao FERNADO, Chang-Xia SUN, Yan-Ling LI. BDMFuse: Multi-scale network fusion for infrared and visible images based on base and detail features[J]. Journal of Infrared and Millimeter Waves, 2025, 44(2): 275

    Download Citation

    EndNote(RIS)BibTexPlain Text
    Save article for my favorites
    Paper Information

    Category: Interdisciplinary Research on Infrared Science

    Received: Jul. 25, 2024

    Accepted: --

    Published Online: Mar. 14, 2025

    The Author Email: LI Yan-Ling (lyl_lingling@163.com)

    DOI:10.11972/j.issn.1001-9014.2025.02.015

    Topics