BDMFuse: Multi-scale network fusion for infrared and visible images based on base and detail features

Hai-Ping SI; Wen-Rui ZHAO; Ting-Ting LI; Fei-Tao LI; Bacao FERNADO; Chang-Xia SUN; Yan-Ling LI

doi:10.11972/j.issn.1001-9014.2025.02.015

Introduction

Infrared and visible image fusion is a vital task in image processing^［1］. Infrared images，captured through thermal radiation，are less affected by external factors but often suffer from a significant loss of texture details and structural information^［2］. In contrast，visible images contain rich texture information but are more sensitive to environmental changes^［3］. Therefore，combining infrared and visible images in a bimodal manner can generate high-quality images^［4］.

To maximize the retention of useful information in fused images，various methods have been developed. Traditional image fusion techniques operate in either the spatial domain or the transform domain. Spatial domain methods directly process image pixels，whereas transform domain methods handle the frequency representation of images using transformations such as the Fourier transform or wavelet transform^［5-6］. While these traditional methods have produced satisfactory fusion results，they largely depend on heuristic fusion rules. Consequently，traditional image fusion techniques often fall short of meeting the increasingly stringent fusion requirements.

In recent years，the feature extraction capabilities of deep learning have attracted significant attention from researchers. Studies have demonstrated that deep learning methods can substantially enhance the quality of fused image^［7］. Currently，deep learning-based image fusion methods can be broadly categorized into three types：autoencoder（AE）-based methods，convolutional neural network（CNN）-based methods，and generative adversarial network（GAN）-based methods. AE-based methods focus on encoding data to create a low-dimensional representation，which is then decoded to reconstruct the original data^［8］. CNN-based methods extract image features through local connections and shared weights^［9］. GAN-based methods employ adversarial training between a generator and a discriminator to produce high-quality images^［10］.

Although methods for infrared and visible image fusion are relatively mature，there are still unresolved issues. Increasing the number of network layers enhances the network's expressive ability，but it also exacerbates the loss of image detail. This，to some extent，limits the further improvement of fusion image quality. Our proposed network effectively addresses this problem，with its main contributions being as follows：

（1）The encoder network constructs base and detail encoders to extract low and high frequency information based on the optimisation objective，and constructs compensation encoders to supplement the information.

（2）The decoder network first fuses the different scales of low-frequency，high-frequency，and compensatory information to obtain multi-scale features. These multi-scale features are then multiplied by the acquired attention map，and the Fusion module is introduced to perform multi-scale fusion for image reconstruction.

（3）Our proposed fusion method demonstrates superior fusion results in terms of both visual assessment and objective evaluation compared to nine other fusion algorithms on three public datasets.

1 Related works

1.1　Methods based on autoencoder

An autoencoder is an unsupervised learning model commonly used for tasks such as data compression and feature extraction. The encoder maps input data into a low-dimensional feature space，while the decoder reconstructs the original data from this reduced space. Through this process，the autoencoder retains essential information and effectively captures significant feature representation^［11］. As a result，autoencoder have been widely applied in areas such as target detection ^［12］，target segmentation^［13］，and early warning systems^［14］.

In image processing，the feature extraction capabilities of self-encoders are extensively utilized in image fusion tasks. Image fusion aims to combine information from multiple image sources to generate richer and more informative results. Autoencoder is usually divided into three parts in this field：the encoder，the decoder，and the fusion layer. The encoder automatically learns the feature information of the source images without the need for manually designed features. The decoder maps the low-dimensional features extracted by the encoder back to the original space，adaptively reconstructing the image. Typically，image fusion methods based on autoencoders train the encoder and decoder during the training phase. In the testing phase，a fusion layer，such as an addition or maximum strategy，is introduced to merge the image features of infrared and visible.

Li et al.^［15］ proposed a deep learning architecture that integrates convolutional layers，fusion layers，and dense blocks，where the output of each layer is connected to the outputs of other layers. Recognizing that the features extracted by a single branch might lack comprehensiveness，Li et al.^［16］ introduced a method incorporating a multi-level residual encoder module and a decoder module with hybrid transmission. This design features a multi-level residual encoder module with two independent branches for extracting image features. Zhao et al. ^［17］ proposed a dual-branch structure for multimodal feature decomposition and image fusion. Tang et al.^［18］ proposed a darkness-free infrared and visible image fusion method，which fully considers the intrinsic relationship between low-light image enhancement and image fusion，achieving effective coupling and information complementarity. Additionally，Tang et al.^［19］ were the first to consider the gap between high-level vision tasks and image fusion，proposing a semantic-aware image fusion framework.

1.2　Multi-scale transform

Multi-scale transform is a technique for processing and analyzing information by decomposing an image or signal into different scales or frequency components^［20］. It can effectively capture local details and global features in images and is widely used in tasks such as image fusion，texture analysis and feature extraction. The commonly used multi-scale transforms mainly contain methods such as pyramid transformations^［21］，wavelet transformations^［22］，and non-subsampled multi-scale multi-directional geometric transformations^［23］.

With the advancement of deep learning，researchers and scholars have integrated multi-scale transforms into deep learning-based networks. Lin et al.^［24］ proposed cross-scale fusion module for interactive fusion of features between encoder and decoder. Jian et al.^［25］ introduced a multi-scale encoder to extract featured image features and constructed a symmetric encoder-decoder with residual blocks（SEDRFuse）network for fusing infrared and visible images in night vision applications. Wang et al.^［26］ introduced a novel and efficient fusion network based on dense Res2net and dual nonlocal attention models. They integrated Res2net and dense connectivity into an encoder network，enabling the utilization of multiple available receptive fields to extract multi-scale features. This approach aims to retain as much effective information as possible for the fusion task.

2 Proposed fusion method

This section provides a detailed description of the proposed architecture for infrared and visible image fusion，encompassing the encoder，decoder，and training specifics.

2.1　Network architecture

Our proposed infrared and visible image fusion method（BDMFuse）adopts an end-to-end network architecture. The encoder employs three branches：the base encoder，detail encoder，and compensation encoder. The base encoder extracts low-frequency information from the image，the detail encoder captures high-frequency information，and the compensation encoder gathers information not captured by the other encoders. Each encoder operates at three different scales，and their outputs are fed into the decoder. Multi-scale features are first generated by summing the low-frequency，high-frequency，and compensatory information from different scales. These multi-scale features are then fused to achieve image reconstruction.The specific overall architecture of the training and testing phases is shown in Fig. 1.

Figure 1.The overall network of BDMFuse: (a) overall structure of the training phase; (b) overall structure of the testing phase

Download full size

View all figures

2.2　Encoder

Inspired by the literature^［27］，we construct the base encoder，detail encoder，and compensation encoder with the idea of solving for the optimal solution. The base encoder extracts low-frequency features by minimizing Eq.（1） to obtain the optimal solution.

B_{n} = B_{I} - γ_{B 1} \times B_{m}

,(1)

where $B_{I}$ is the image acquired after applying a blur filter to the input image and $γ_{B 1}$ is the tuning hyperparameter. $B_{n}$ （n∈｛1，2，3｝）is the decomposed base image. $B_{m}$ （m∈｛1，2，3｝）is the low-frequency features of different scales obtained after convolution operations of different depths，which are represented as shown below：

B_{m} = L_{n} - γ_{B 2} \times E_{n}

,(2)

where $L_{n}$ （n∈｛1，2，3｝）denotes features of different scales obtained after convolution operations of different depths， $E_{n}$ （n∈｛1，2，3｝）denotes high-frequency features of different scales extracted from the $E_{n}$ feature map，and $γ_{B 2}$ is the tuning hyperparameter.

The detail encoder minimizes Eq.（3） to obtain the optimal solution.

D_{n} = D_{I} - γ_{D 1} * D_{m}

,(3)

where $D_{I}$ is the image obtained after applying Laplace filter to the input image and $D_{m}$ （m∈｛1，2，3｝）denotes the high frequency features at different scales obtained from the convolution operation that has been performed at different depths. $γ_{D 1}$ is the tuning hyperparameter. $D_{n}$ （n∈｛1，2，3｝）is the decomposed detail image.

The compensation encoder also adopts the method of finding the optimal value to obtain the feature information，which is similar to the detail encoder and will not be repeated. We obtain the compensation filter by Eq.（4）

C_{I} = 1 - B_{I} - D_{I}

.(4)

The base encoder network is depicted in Fig. 2. We adopt two processing methods to extract the low-frequency information of the input image. One is to take low pass filter and the other is to take convolution operation. First，feature coarse extraction is performed，followed by the addition of the CBAM attention mechanism ^［28］ to emphasize effective information. To capture more comprehensive information，we introduce multi-scale feature extraction. Maximum pooling and up-sampling operations are applied to features of different scales to extract high-frequency information，thereby removing high-frequency components from the features. The pooling operation reduces the feature size by half，and to maintain the image size，bilinear interpolation-based up-sampling is adopted.

Figure 2.Base encoder schematic

Download full size

View all figures

The detail encoder network is illustrated in Fig. 3，and its architecture bears resemblance to the base encoder. However，the detail encoder employs two methods to obtain the high-frequency information of the image. One approach is to utilize a high-pass filter，while the other involves convolutional operations. Maximum pooling operation is taken directly on image features of different scales in convolution operation to obtain the high frequency information of the image. In this process，the CGP convolution blocks in the network all employ 3×3 convolution kernels with a stride of 1. Transitioning through the CGP convolution block alters only the number of channels and does not affect the image size.

Figure 3.Detail encoder schematic

Download full size

View all figures

The compensation encoder is shown in Fig. 4. This encoder is constructed to complement the image feature information，so only the basic convolutional processing of the rest of the encoder is retained without any other additional operations.

Figure 4.Compensation encoder schematic

Download full size

View all figures

2.3　Decoder

2.3.1　Decoder network

Initially，the corresponding low-frequency information and high-frequency information of different scales are fused. As shown in Fig. 5（a）， $F_{b}^{n}$ （n∈｛1，2，3｝）refers to the multi-scale low-frequency features， $F_{d}^{n}$ （n∈｛1，2，3｝）denotes the multi-scale high-frequency features，and $F_{c}^{n}$ （n∈｛1，2，3｝）represents the multi-scale compensation features. Since the compensation features may increase image artifacts，we add hyperparameter $β$ to modulate the multi-scale compensation features. The size of $F_{d}^{n}$ differs from the rest of the image sizes，so up-sampling is achieved through an inverse convolution operation for $F_{d}^{n}$ . Finally， $F_{b}^{n}$ ， $F_{d}^{n}$ ，and $F_{c}^{n}$ are summed to obtain the multi-scale feature $F_{n}$ （n∈｛1，2，3｝）.

Figure 5.Decoder schematic: (a) BDC-fusion; (b) multi-fusion

Download full size

View all figures

In order to be able to make the image learn more multi-scale information，we obtain the attention map of the corresponding scale from $F_{n}$ and multiply it with the fused image. Then multi-scale fusion is performed，as shown in Fig. 5（b）. $F_{3}$ is spliced with $F_{2}$ after the CGS module is aligned with the number of $F_{2}$ channels，and then sent to Fusion to achieve fusion. After $F_{3}$ and $F_{2}$ are fused，they are passed through CGS to maintain the same number of channels as $F_{1}$ . Then，they are concatenated with $F_{1}$ and sent into the Fusion module. The Fusion module ^［29］ is shown in Fig. 6. We introduce hyperparameter to the Fusion module to enhance its applicability to our proposed network. Finally，image reconstruction is achieved by two CGS modules. The CGS convolutional blocks in the network adopt 3×3 convolutional kernels with a step size of 1. The CGS convolutional blocks only change the number of channels and do not change the image size.

Figure 6.Fusion module

Download full size

View all figures

2.3.2　Attention strategy

Our computation of $F_{n}$ weights draws inspiration from ASFF（adaptively spatial feature fusion）^［30］. $F_{n}$ has the same size but different numbers of channels. We adjust different number of channels for $F_{n}$ to obtain three attention maps to extract the weight values of the corresponding scales. The formula is shown below：

ω_{1}^{n}, ω_{2}^{n}, ω_{3}^{n} = s o f t m a x (F_{1 \to n}, F_{2 \to n}, F_{3 \to n}), (n, i \in \{1,2, 3\})

,(5)

where， $ω_{1}^{n}$ ， $ω_{2}^{n}$ ， $ω_{3}^{n}$ denote the individual weights of $F_{n}$ _{from the $n$ th layer，and $F_{1 \to n}$ ， $F_{2 \to n}$ ， $F_{3 \to n}$ denote the number of channels of all the layers adjusted to the number of channels of the $n$ th layer by the CGP convolution block. The softmax function is shown below：}

s o f t m a x (F_{1 \to n}, F_{2 \to n}, F_{3 \to n}) = \frac{e^{F_{i \to n}}}{e^{F_{1 \to n}} + e^{F_{2 \to n}} + e^{F_{3 \to n}}}, (n, i \in \{1,2, 3\})

.(6)

The weights corresponding to $F_{n}$ are multiplied with $F_{n}$ to optimise the $F_{n}$ features as follows：

F_{n} = F_{n} \times ω_{n}^{n}

,(7)

where $ω_{n}^{n}$ is the corresponding weight of F_n in the nth layer of the attention graph.

Taking Attention-3 as an example，Fig. 7 illustrates how to obtain the weights of $F_{3}$ . We input $F_{n}$ into the CGP convolution block，adjust the number of channels of $F_{1}$ _{and $F_{2}$ _{to match that of $F_{3}$ ，and obtain the feature maps. Subsequently，the three feature maps are concatenated，and the weight value of each feature map is calculated using softmax. Finally，the corresponding weight of $F_{3}$ _{is extracted and multiplied by $F_{3}$ to optimize its features. In this process，the CGP convolution blocks all employ 1×1 convolution kernels with a stride of 1.}}}

Figure 7.Attention strategy

Download full size

View all figures

2.4　Training strategy

Our training strategy is similar to DenseFuse. In the training phase，the fusion network is discarded. Through training，we expect the encoder to extract multi-scale depth features and the decoder to reconstruct the image based on these features. This training strategy can leave more choice space for the fusion layer.

In the training phase，the loss function $L_{t o t a l}$ is defined as follows：

L_{t o t a l} = L_{p} + λ L_{s} + λ / 2 (L_{s 2} + L_{s 1})

,(8)

where $L_{p}$ and $L_{s}$ denote the pixel loss and structural similarity loss between the input image and the output image respectively. λ is a measure between $L_{p}$ ， $L_{s}$ ， $L_{s 1}$ and $L_{s 2}$ .

$L_{p}$ is calculated by Eq.（9）：

L_{p} = | | O - I | |^{2}

,(9)

where O and I denote the output and input images，respectively. $L_{p}$ calculates the square of the difference between the output image and the input image，aiming to ensure that the reconstructed image is similar to the source image at the pixel level.

$L_{s}$ is obtained from Eq.（10）：

L_{s} = 1 - S S I M (O, I)

,(10)

where SSIM（-）represents the structural similarity measure. When the value of SSIM（-）is larger，it indicates a higher structural similarity between the output image O and the input image I.

The Fusion module is introduced in the decoder network to retain more feature information. However，it tends to ignore the structural information of the image to some extent. To address this， $L_{s 2}$ and $L_{s 1}$ losses are introduced.

$L_{s 2}$ is obtained from Eq.（11）：

L_{s 2} = 1 - S S I M (f_{3 + 2}, F (f 3, f 2))

,(11)

where $f_{3 + 2}$ is the result of summing the multi-scale features $F_{2}$ and $F_{3}$ ，while $F (f 3, f 2)$ is the fusion result obtained after the Fusion module.

$L_{s 1}$ is obtained from Eq.（12）：

L_{s 1} = 1 - S S I M (f_{2 + 1}, F (f 2, f 1))

,(12)

where $f_{2 + 1}$ is the result of fusion of multi-scale features $F_{2}$ and $F_{3}$ and then summed with $F_{1}$ ，and $F (f 2, f 1)$ is the fusion result obtained after the Fusion module.

The goal of the training phase is to train an autoencoder network capable of effectively extracting image features and reconstructing images. We employ 1 600 images from FLIR as input images for network training. The training samples are randomly cropped to 224 × 224. We set the parameter λ to 10 to train the network.

3 Experimental results and analysis

In this section，we will validate our proposed fusion method through a series of experiments. The network is executed on an NVIDIA GeForce RTX 2080Ti and implemented using the PyTorch framework.

3.1　Experimental setting

In this experiment，images are selected from the TNO，RoadScene ^［31］，and LLVIP ^［32］datasets，and our proposed fusion network is compared with nine typical fusion networks（DenseFuse，RFN-Nest，FusionGAN ^［33］，U2Fusion ^［34］，CSF ^［35］，SEDR，SwinFusion ^［36］，SeAFusion，CDDFuse）to evaluate its performance. We select seven commonly used image quality evaluation metrics for objective quality evaluation of fused images，which are entropy（EN）^［37］，mutual information（MI）^［38］，standard deviation（SD）^［39］，average gradient（AG）^［40］，visual information fidelity（VIF）^［41］，the sum of the correlations of differences（SCD）^［42］，multi-scale structural similarity measure（MS_SSIM）^［43］. Among them，EN is an objective evaluation metric that measures how much information the image contains. MI is used to measure the amount of information transferred from the source image to the fused image. SD reflects the distribution and contrast of the fused image. AG is a metric used to evaluate the sharpness and detail of an image. VIF measures the information fidelity of the fused image. SCD is a measure of the differences between the fused image and the original image. MS_SSIM is a similarity-based evaluation metric.

3.2　Ablation experiment

The proposed network incorporates compensated encoders into the encoder network and introduces an attention strategy along with a Fusion module for multi-scale fusion in the decoder network. To validate the efficacy of these strategies，the article includes ablation experiments. Additionally，to further assess the benefits of the attention strategies，the article compares two attention mechanisms SE and ECA with the strategies we have adopted. As illustrated in Fig. 8，our proposed network demonstrates superior visual fusion effects in the fused image. Specifically，it effectively highlights the presence of the villain in the infrared image while preserving the dendritic texture features in the visible image. Specifically，it effectively highlights the presence of a person in infrared images while preserving the textural features of tree branches in visible image.

Figure 8.Fusion image of ablation experiments: (a) IR; (b) VIS; (c) SE-Attention; (d) ECA-Attention; (e) No-Attention; (f) No-Comp Encoder; (g) No-Fusion Module; (h) No-Strategy; (i) No-Strategy & No-Multi-scale; (j) proposed

Download full size

View all figures

To further demonstrate the superiority of our method，we conducted a quantitative analysis using test results from 40 pairs of infrared and visible images selected from the TNO dataset. The optimal values are highlighted in bold black. As shown in Table 1，our proposed network achieves optimal values for the EN，MI，SD，and VIF metrics. These results provide additional evidence of the effectiveness of our adopted approach.

Table 1. Average quality evaluation metrics for ablation experiments

View table

View all Tables

Table 1. Average quality evaluation metrics for ablation experiments

	EN	MI	SD	AG	VIF	SCD	MS_SSIM
SE-Attention	6.8501	13.7001	34.7315	3.7415	0.9612	1.8192	0.9416
ECA-Attention	6.9504	13.9008	40.6716	3.5503	0.9548	1.6514	0.9148
No-Attention	7.1774	14.3547	40.2576	4.2096	1.3605	1.8801	0.9306
No-Comp Encoder	6.8728	13.7456	38.0053	2.9682	0.9219	1.8060	0.9403
No-Fusion Module	7.2129	14.4259	40.8263	3.7076	1.2727	1.8870	0.9277
No-Strategy	7.0291	14.0581	37.9805	3.3666	1.0851	1.8935	0.9511
No-Strategy& No-Multi-scale	6.5910	13.1819	35.0411	3.2785	0.8063	1.6111	0.9216
Proposed	7.2525	14.5050	43.6183	4.1627	1.4128	1.8715	0.9207

3.3　Analysis of experimental results

3.3.1　Subjective analysis

Subjective qualitative comparisons of the nine existing fusion methods with our fusion method are presented in Figs. 9-11. As depicted in the figures，all fusion methods effectively fuse infrared and visible images，but notable differences in visual quality are observed. FusionGAN retains more information from the infrared image during fusion，which can result in blurred image details. The remaining fusion methods retain more detailed information from the visible image，especially SwinFusion，SeAFusion and CDDFuse. For example，in Fig. 9，the cloud information present in the infrared image is missing. Our proposed network retains more light and shadow information，as illustrated in Fig. 10，indicating its enhanced ability to perceive light. To further validate this，we conducted experiments on the LLVIP dataset. LLVIP is a dataset for low-light vision paired with visible and infrared images. As demonstrated in Fig. 11，our fused images reveal the basic outline of the manhole cover even under low-light conditions.

Figure 9.Fusion image of TNO: (a) IR; (b) VIS; (c) DenseFuse; (d) RFN-Nest; (e) FusionGAN; (f) U2Fusion; (g) CSF; (h) SEDR; (i) SwinFusion; (j) SeAFusion; (k) CDDFuse; (l) proposed

Download full size

View all figures

Figure 10.Fusion image of RoadScene: (a) IR; (b) VIS; (c) DenseFuse; (d) RFN-Nest; (e) FusionGAN; (f) U2Fusion; (g) CSF; (h) SEDR; (i) SwinFusion; (j) SeAFusion; (k) CDDFuse; (l) proposed

Download full size

View all figures

Figure 11.Fusion image of LLVIP: (a) IR; (b) VIS; (c) DenseFuse; (d) RFN-Nest; (e) FusionGAN; (f) U2Fusion; (g) CSF; (h) SEDR; (i) SwinFusion; (j) SeAFusion; (k) CDDFuse; (l) (Our proposed)

Download full size

View all figures

3.3.2　Quantitative analysis

In addition to subjective qualitative analysis，we employ objective quantitative analysis to measure the performance of the proposed method. The quantitative evaluation results for the test sets of 40 pairs from TNO，100 pairs from RoadScene，and 100 pairs from LLVIP are shown in Tables 2-4. The best value is indicated in bold black text，and the second-best value is underlined. Using the experimental results of the LLVIP dataset as an example，we achieve the highest values on five indicators：EN，MI，SD，VIF and SCD. This demonstrates our network's capability to effectively extract image details. The results from the other two datasets show that while our proposed fusion method may fall slightly short of the optimal values in some evaluation metrics，the outcomes are relatively balanced across all metrics and generally superior to other fusion methods. Additionally，the experimental results from all three datasets indicate that our method has good feasibility and generalizability.

Table 2. Average quality evaluation metrics for 40 pairs of TNO fused images

View table

View all Tables

Table 2. Average quality evaluation metrics for 40 pairs of TNO fused images

	EN	MI	SD	AG	VIF	SCD	MS_SSIM
DenseFuse	6.7933	13.5867	33.0055	3.8958	1.1141	1.8951	0.9379
RFN-Nest	6.9889	13.9777	35.2026	2.8619	1.1218	1.8710	0.9138
FusionGAN	6.5073	13.0147	26.4478	2.4276	0.7408	1.1339	0.7511
U2Fusion	6.4745	12.9489	24.9858	3.8414	0.7371	1.6542	0.9433
CSF	6.9295	13.8591	34.2342	3.8565	1.2221	1.8503	0.9190
SEDR	6.8795	13.7590	39.0527	4.1491	1.5546	1.8554	0.8946
SwinFusion	6.6156	13.2311	31.0339	3.4971	0.7228	1.7147	0.8992
SeAFusion	7.1474	14.2949	39.9100	5.6162	1.7369	1.7323	0.8553
CDDFuse	7.0582	14.1164	39.1751	5.2135	1.3976	1.7966	0.8795
Proposed	7.3544	14.7088	44.5808	5.9907	1.8274	1.8784	0.8935

Table 3. Average quality evaluation metrics for 100 pairs of RoadScene fused images

View table

View all Tables

Table 3. Average quality evaluation metrics for 100 pairs of RoadScene fused images

	EN	MI	SD	AG	VIF	SCD	MS_SSIM
DenseFuse	7.2684	14.5368	43.1679	4.5979	0.6504	1.6552	0.9220
RFN-Nest	7.3492	14.6984	46.1085	3.1668	0.6091	1.6837	0.8671
FusionGAN	7.0540	14.1079	39.0645	3.2246	0.4805	1.0496	0.7547
U2Fusion	7.0815	14.1629	37.8472	5.2514	0.6338	1.3866	0.9148
CSF	7.4257	14.8514	47.9828	5.0241	0.7952	1.7282	0.9261
SEDR	7.4499	14.8999	49.3581	4.9322	0.8286	1.6978	0.9005
SwinFusion	6.9712	13.9424	45.1634	4.3312	0.7275	1.5772	0.8470
SeAFusion	7.5117	15.0234	56.1119	6.8737	1.0949	1.6732	0.8786
CDDFuse	7.5003	15.0007	56.4331	6.5083	1.1120	1.7105	0.8740
Proposed	7.5623	15.1246	53.1000	6.5506	1.0359	1.7766	0.9353

Table 4. Average quality evaluation metrics for 100 pairs of LLVIP fused images

View table

View all Tables

Table 4. Average quality evaluation metrics for 100 pairs of LLVIP fused images

	EN	MI	SD	AG	VIF	SCD	MS_SSIM
DenseFuse	6.9740	13.9480	36.7708	2.8641	0.4617	1.3835	0.9224
RFN-Nest	7.0028	14.0055	37.8919	2.2445	0.4216	1.4154	0.8981
FusionGAN	6.4153	12.8306	26.1540	2.0102	0.2746	0.7521	0.7865
U2Fusion	6.6531	13.3062	34.9659	3.3391	0.5283	1.2755	0.9098
CSF	6.8283	13.6566	35.3773	2.7960	0.4564	1.3621	0.9109
SEDR	6.8877	13.7795	36.5430	2.6319	0.4495	1.2546	0.8834
SwinFusion	7.3844	14.7688	50.8104	4.2064	0.9087	1.5887	0.9451
SeAFusion	7.4193	14.8386	50.4468	4.1898	0.9062	1.6259	0.9435
CDDFuse	7.3134	14.6267	48.3450	3.8052	0.8171	1.5889	0.9337
Proposed	7.5398	15.0795	52.9415	3.8464	0.9889	1.7071	0.9250

4 Conclusions

The article constructs a base encoder and a detail encoder based on the optimization problem to extract low-frequency and high-frequency information from the image. Additionally，a compensation encoder is proposed to supplement the missing information. To capture more feature information，a multi-scale approach is introduced for extracting image features. The decoder combines low-frequency，high-frequency，and compensation information to generate the fused image at different scales. Attention maps are then derived from these fused images，and the corresponding weights are multiplied with the fused images at different scales. Finally，the Fusion module is introduced for multi-scale fusion to achieve image reconstruction. During the training phase，the three encoders and one decoder are trained according to the loss function to ensure the image reconstruction capability. In the testing phase，the low-frequency，high-frequency，and compensatory information at different scales，decomposed by the encoder，are fed into the fusion layer to integrate the corresponding infrared and visible images. These are then sent into the decoder for image reconstruction，resulting in the final fused image.

The network undergoes comparative experiments on the TNO，RoadScene，and LLVIP datasets. The results demonstrate that both the subjective fusion effect and the overall objective evaluations of the proposed network outperform those of the comparison methods，showing good generalization across datasets.

Category: Interdisciplinary Research on Infrared Science

Received: Jul. 25, 2024

Accepted: --

Published Online: Mar. 14, 2025

The Author Email: LI Yan-Ling (lyl_lingling@163.com)

DOI:10.11972/j.issn.1001-9014.2025.02.015

Table 1. Average quality evaluation metrics for ablation experiments

Table 1. Average quality evaluation metrics for ablation experiments

Table 2. Average quality evaluation metrics for 40 pairs of TNO fused images

Table 2. Average quality evaluation metrics for 40 pairs of TNO fused images

Table 3. Average quality evaluation metrics for 100 pairs of RoadScene fused images

Table 3. Average quality evaluation metrics for 100 pairs of RoadScene fused images

Table 4. Average quality evaluation metrics for 100 pairs of LLVIP fused images

Table 4. Average quality evaluation metrics for 100 pairs of LLVIP fused images

微信扫一扫：分享