Super-wide-field-of-view long-wave infrared gaze polarization imaging embedded in a multi-strategy detail feature extraction and fusion network

Dongdong Shi; Jinhang Zhang; Jun Zou; Fuyu Huang; Limin Liu; Li Li; Yudan Chen; Bing Zhou; Gang Li

doi:10.1364/PRJ.559833

1. INTRODUCTION

Infrared (IR) imaging focuses on capturing the intensity of the object’s radiation and quantifying the spatially varying response. Long-wave infrared (LWIR) detectors concentrate on the thermal radiation of the object itself making it particularly important in camouflage recognition. Radiation is also an electromagnetic wave and contains properties of light. However, conventional IR imaging focuses mainly on the intensity of the radiation, i.e., the square of the modulus of the electric field vector, while neglecting other optical properties, especially polarization information [1]. Polarization is not only a fundamental property of light but also the key to probe the interaction between light and materials. By recovering this neglected optical property, it is possible to reveal the physicochemical properties of materials and internal structures. This dimension is hard to reach by conventional optical techniques [2]. Therefore, it is important to develop optical imaging hardware that can capture polarization information and deepen the foundation of light-matter interaction research to promote the development of IR imaging technology [3].

The Stokes vector is an important parameter to describe IR polarization imaging. The vector can characterize the polarization properties of objects comprehensively, including degree of polarization, angle of polarization (AoP), degree of circular polarization, and degree of linear polarization (DoLP), which provides the foundation to describe optical properties completely [4 –9]. It shows great potential in military target detection [10,11], encrypted communication [12], biomedicine [13], metal crack detection [14], and intelligent driving [15]. However, it has been found that the development of IR polarization imaging technology faces challenges in terms of restricted imaging airspace, high cost, and poor image quality. Particularly in the super-wide-field-of-view (SWFOV) IR gaze polarization for large airspace scene surveillance, it is still a challenge to construct an IR gaze polarization imaging system equipped with large airspace detection capability, and no convincing implementation has been proposed so far.

Conventional IR polarization imaging is based on the similar imaging principle, which assumes that there is inherent similarity between the object and image. IR polarized imaging devices designed on this principle typically have a limited field-of-view (FOV) between 45° and 60°. Currently, SWFOV IR imaging mostly relies on multi-lens stitching techniques [16,17]. Realizing IR polarization imaging in large airspace requires the design of polarization optics lenses with FOV greater than 80° or even beyond 180°. This type of lens requires to utilize the non-similar imaging theory [18]. This imaging theory has broken the Gaussian optics idea by utilizing compressive mapping to image large airspace object scenes on the finite IR movement focal plane. However, the extraction of IR polarization information depends on complex nonlinear computation. Moreover, although IR imaging provides information of brightness and details of the scene, the inherent properties of the object revealed by polarization imaging are often accompanied by low brightness performance. Deep learning based on multimodal image fusion techniques exhibits great potential to overcome this challenge [19,20].

At present, image fusion techniques mainly include auto-encoders (AEs) [21,22], combined convolutional neural networks (CNNs) [23], and generative adversarial networks (GANs) [24,25]. Lin et al. [26] proposed a TIU-Net network that combines the Transformer module with an improved U-Net. By simultaneously extracting short-range local features and long-range global features, it effectively improves the target reconstruction performance in underwater highly turbid environments. The experiments show that its generalization ability under untrained target structure and different scattering imaging distances is better than the traditional CNN method. Fan et al. [27] further designed a local-global contextual polarization feature learning framework to achieve high-quality reconstruction across materials and geometric targets in scattering media by fusing the local features of the inflated convolution with the global dependencies of the Swin Transformer through parallel branches. Fan et al. further proposed a lightweight multipath collaborative 2D/3D convolutional network [28]. This network fuses 2D and 3D convolution for the first time, balances the polarization information between sparse channels and within dense channels through cross 2D/3D nonlocal attention modules, and fuses low-frequency, high-frequency, and multiscale features by using a multipath coding and decoding structure. The experiments show that it achieves 99.45% overall accuracy and 93.85% intersection-to-parallel ratio on the LDDRS dataset, which is significantly improved over existing methods. In terms of weak target imaging, Zhou et al. designed a dual discriminator generative adversarial network [29] for the issue of target information flooding in intense light background. The method guides the fusion of intensity images and polarization images through a self-attention mechanism, which retains the background luminance information while enhancing the polarization features of the target surface.

The above network models show remarkable capability in feature extraction and target texture detail enhancement, but it is often neglected to consider the source image information distribution characteristics as they mainly focus on the exploration of innovative fusion network structures [30]. Moreover, the problem that the image features of extracted IR images and IR DoLP images are inadequate and inaccurate further increases the fusion difficulty. Specifically, the processing of IR intensity images should focus on capturing background features, while IR DoLP images need to concentrate on the remarkable target detail features in the scene.

We report a long-wave IR gaze polarization imaging technique for large airspace, which integrates the design of an IR polarization lens, image computational reconstruction, and advanced image fusion network, and constitutes a comprehensive, non-conventional optical imaging system from hardware to software. To capture IR polarization information of objects in large airspace efficiently, we have developed an innovative SWFOV LWIR polarization lens coupled with a holographic line-grid IR polarization device. The polarization lens not only realizes the detection capability of large airspace, but also takes account of the capture IR polarization information. It is also equipped with a high-resolution ( $1040 \times 830 pixels$ ) large-array IR movement, which further enhances the image quality. We have proposed a high-performance image fusion network architecture to fuse the rich background features of IR images with the remarkable target details of IR DoLP images, and to facilitate the subsequent performance of the target recognition task. The fusion network has excellent background feature extraction capability and polarization information retention characteristics, thereby realizing high-quality imaging of targets with assistance of polarization attributes in large scenes under the synergy of hardware and software.

2. SWFOV LWIR GAZE POLARIZATION IMAGING TECHNOLOGY THEORY

In the field of IR gaze polarization imaging, it is crucial to break through the limitations with the introduction of non-similar imaging theory. With the dramatic expansion of the FOV, the size limitations of the IR movement focal plane arrays are even more obvious [18]. We compressed the scene by integrating several pixels into a single pixel; this process inevitably introduces “barrel distortion” and symbolizes a non-similar imaging mechanism [31]. Non-similar imaging theory covers four main modes including isometric projection, equisolid angle projection, stereo projection, and orthogonal projection [32,33]. It shows the unique advantage in extracting a target’s angular coordinates and time-domain dynamic information in the scene since isometric projection image height is proportional to the FOV angle. In this study, we designed the SWFOV LWIR polarization lens based on the isometric projection mode [Fig. 1(c)], and its object-image relationship strictly follows [34] $y^{'} = f^{'} \cdot ω,$ (1)where $y^{'}$ is the object focal length of the optical system. $ω$ is the half FOV angle of the object. It is essential to construct a visual model for SWFOV LWIR gaze polarization imaging. The target is in the world coordinate system ( $O_{W} - X_{W} Y_{W} Z_{W}$ ). The camera coordinate system is $O_{C} - X_{C} Y_{C} Z_{C}$ , the image plane coordinate system is $O_{i} - U^{'} V^{'}$ , and the image coordinate system is $O_{I} - U V$ . The coordinates of the image plane of the object space point $P$ corresponding to the image point $p$ are ${[X_{C, P}^{'}, Y_{C, P}^{'}]}^{T} = f ω \cdot {[\cos φ, \sin φ]}^{T},$ (2)where $φ$ is the angle between the image point and the $U$ -axis. The image coordinates $(u, v)$ of the image point $p$ when the movement focal plane image element size is $a$ are $[\begin{matrix} u \\ v \end{matrix}] = f ω [\begin{matrix} 1 / a & 0 \\ 0 & 1 / a \end{matrix}] [\begin{matrix} \cos φ \\ \sin φ \end{matrix}] + [\begin{matrix} u_{o} \\ v_{o} \end{matrix}] .$ (3)Equation (3) represents the mapping relationship between the object point and the image point. However, in the camera coordinate system, the $ω$ and $φ$ of the spherical coordinates of the object point can be obtained from the image coordinates $(u, v)$ , $ω = \frac{a}{f} \cdot \sqrt{{(u - u_{o})}^{2} + {(v - v_{o})}^{2}}, φ = \arccos \frac{u - u_{o}}{\sqrt{{(u - u_{o})}^{2} + {(v - v_{o})}^{2}}} .$ (4)According to $ω$ and $φ$ and the correspondence between spherical coordinates and camera coordinates, we can obtain the normalized coordinate $(X_{C}^{'}, Y_{C}^{'}, Z_{C}^{'})$ of the object point distance. Then suppose the object point distance is $d_{o}$ ; the actual coordinates of the object point are $d_{o} (X_{C}^{'}, Y_{C}^{'}, Z_{C}^{'})$ . The mapping relation between a spherical target and planar image in the non-similar imaging theory is shown in Fig. 1(c), which lets the ( $d s$ ) object on the sphere be isometrically projected into the camera coordinate system ( $d s^{'}$ ).

Figure 1.Concept of SWFOV LWIR gaze polarization imaging. (a) Equipment overview. It integrates a vanadium oxide uncooled IR focal plane detector with SWFOV LWIR gaze polarization lens designed independently. The IR movement focal plane resolution is $1280 \times 1024$ , the element size is 17 μm, and the frame rate is up to 30 Hz. (b) The polarization lens is coupled with a rotatable holographic line-grid IR polarization device to achieve $150 ° \times 120 °$ large airspace acquisition capability. (c) The polarization lens is designed based on the non-similar imaging principle and adopts isometric projection to optimize SWFOV imaging performance. (d) Stokes vectors are obtained by parsing the polarization components of the radiation information to reveal the target polarization properties. (e) Utilizing fusion image technology to improve image quality and enhance the value of fusion images in subsequent applications.

Download full size

View all figures

The design of the SWFOV LWIR gaze lens mainly involves reflective and transmissive types [34 –36]. We chose the transmissive type as the design option for the SWFOV LWIR gaze polarization lens since the transmissive type possesses better imaging quality due to the center being unobstructed. However, we faced a significant decrease in irradiance at the edge of FOV during the design process. Therefore, we introduced an anti-telescopic optical structure aimed at expanding the object FOV while reducing the image FOV. It improves illumination uniformity and ensures sufficient rear working distance effectively, providing a key solution for the optimization of the SWFOV LWIR gaze polarization lens. The radial magnification $β_{r 0} = (d y^{'} 0 \cdot d θ_{0}) / r_{0}$ in Gaussian optics, and the tangential magnification $β_{t 0}$ for the image $B^{'} C^{'}$ on the spherical image $β_{t 0} = \frac{B C}{B^{'} C^{'}} = \frac{y_{0}^{'} d θ_{0}}{r_{0} \sin ω_{0} \cdot d θ_{0}} = \frac{f_{0}}{r_{0}} \frac{ω_{0}}{\sin ω_{0}} .$ (5)Hence, the relationship between radial amplification $β_{r 0}$ and tangential amplification $β_{t 0}$ can be described as $β_{t 0} / β_{r 0} = ω_{0} / \sin ω_{0} .$ (6)Equation (6) reveals the characteristic of irregular spherical objects during imaging, i.e., the radial magnification is inversely proportional to the radius of curvature of the target. Specifically, the image of a spherical object is elliptical. It is worth noting that with increasing half FOV angle $ω_{0}$ , the ellipse flattening intensifies, while in the central region of the FOV, the image maintains a standard shape.

We determined the structure of the SWFOV LWIR gaze polarization lens after ray tracing, distortion correction, and various parameter fine-tuning [Fig. 1(b)]. We have coupled the holographic line-grid IR polarization device to the lens delicately and placed it at the minimum aperture. We captured the corresponding image data by rotating the polarizer to four angles (0°, 45°, 90°, and 135°) [Fig. 1(d)]. Then Stokes vectors are calculated for each pixel in the image. Given that the $s_{3}$ parameter is relatively weak, the $s_{3}$ parameter is usually unconsidered. The polarization parameters are computed from Stokes vectors, where DoLP serves as the primary parameter and described as follows: $DoLP = \frac{\sqrt{s_{1}^{2} + s_{2}^{2}}}{s_{0}} .$ (7)

Although IR polarization imaging can improve image contrast significantly, the image background features are lacking severely. Fortunately, image fusion techniques based on deep learning provide an effective solution to this challenge [1,37], as shown in Fig. 1(e), demonstrating its remarkable potential in optimizing imaging quality.

3. DESIGN OF SWFOV IR GAZE POLARIZATION LENS

The basic parameters of the SWFOV LWIR gaze polarization lens include the focal length of 6 mm, $F / # = 1.2$ , and the response wave band of 8–12 μm. The image quality evaluation results are shown in Fig. 2(a). The main energy of the blur spot due to aberration at each wavelength in the four FOV angles (0°, 25°, 50°, and 75°) is concentrated in the Airy spot, and their RMS radii are 4.941 μm, 9.229 μm, 15.430 μm, and 17.776 μm, respectively, which can match the requirement of the image element size of the IR movement (17 μm).

$Performance evaluation of SWFOV LWIR gazing polarizer heads. (a) Blur spot diagram: demonstrating diffuse spot characteristics by testing blur spots at four FOV angles with reference to the main light. (b) Calculating the percentage of total enclosed energy at the four FOV angles and comparing it with the diffraction limit curve (which represents the zero distortion response). (c) Using the FFT method, diffraction MTF is calculated for the four FOV angles, with the spatial frequency range set as [0,29] lp/mm (Nyquist frequency is 29 lp/mm). (d) The relative illuminance is calculated by integrating the effective area of the exit pupil observed at the image point (performed in cosine space). The effective F/# is inversely proportional to the square root of the stereo emission angular area of the optical pupil in cosine space, with consideration of polarization lens transmittance weighting. (e) The field curvature curve shows the offsets of the near-axis image plane. (f) The distortion curve shows the corresponding distortion offset.$

Figure 2.Performance evaluation of SWFOV LWIR gazing polarizer heads. (a) Blur spot diagram: demonstrating diffuse spot characteristics by testing blur spots at four FOV angles with reference to the main light. (b) Calculating the percentage of total enclosed energy at the four FOV angles and comparing it with the diffraction limit curve (which represents the zero distortion response). (c) Using the FFT method, diffraction MTF is calculated for the four FOV angles, with the spatial frequency range set as [0,29] lp/mm (Nyquist frequency is 29 lp/mm). (d) The relative illuminance is calculated by integrating the effective area of the exit pupil observed at the image point (performed in cosine space). The effective $F / #$ is inversely proportional to the square root of the stereo emission angular area of the optical pupil in cosine space, with consideration of polarization lens transmittance weighting. (e) The field curvature curve shows the offsets of the near-axis image plane. (f) The distortion curve shows the corresponding distortion offset.

Download full size

View all figures

The enclosed energy curves within the 30 μm radius of the spot rise rapidly and exceed 80% of the energy. This indicates that the spot energy is highly concentrated and verifies the excellent focusing performance of the SWFOV LWIR gaze polarization lens [Fig. 2(b)]. The modulation transfer function (MTF) curves are a vital indicator of the capability of the lens to render details. The Nyquist frequency is 29 lp/mm for the IR movement with an image element size of 17 μm. Except for the tangential MTF curve at 75°, the others are basically consistent with the increase of spatial frequency, and all of them are higher than 0.51 at the Nyquist frequency. It shows that the polarization lens has excellent imaging capability [Fig. 2(c)]. The tangential MTF curve at 75° decreases significantly due to distortion and energy loss caused by the increased FOV angle. However, an OTF value higher than 0.11 at the Nyquist frequency is acceptable in such large airspace.

Figure 2(d) exhibits the relative illuminance of the polarization lens. The relative illuminance of the main wavelength 10 μm reaches 81.8% at 75°; the relative illuminance curve decreases gently and the illumination distribution of the image plane is uniform, which verifies the effectiveness of the anti-aperture optical structure. Meanwhile the $F / #$ is stable at around 1.2. The maximum field curvature values in the sagittal and tangential directions are less than 0.07 mm for all four field angles [Fig. 2(e)]. $f - θ$ distortion curves are far less than 4% [Fig. 2(f)]. The above evaluation indexes are in line with the empirical standards of lens design.

Utilizing our imaging equipment, we conducted IR polarization imaging experiments on alloy carts in different scenes (Fig. 3). By viewing the chessboard image of the SWFOV LWIR image and the SWFOV LWIR DoLP image, it is found that the imaging equipment not only captures the large airspace scene successfully, but also presents the IR polarization information of the salient targets clearly. Providing close-ups of targets such as model cars, roads, sheds, and iron poles, the SWFOV LWIR DoLP images exhibit higher contrast, especially at texture and edge positions, which makes target recognition more accurate. The experimental results show that SWFOV LWIR DoLP images have richer features after the introduction of the polarization dimension, and the background suppression and target highlighting effects are improved significantly, which exceeds the capability of traditional IR images. In addition, the detection waveband of our system falls in the long-wave region, which focuses on the target’s own radiation. It provides a broad application prospect for military reconnaissance, target detection, camouflage identification and other fields.

Figure 3.SWFOV LWIR gaze polarization imaging test. By rotating the holographic line-grid IR polarization device in the IR polarizer head, the image data are captured in four directions: 0°, 45°, 90°, and 135°. Stokes vectors are calculated for each pixel, where the s₀ parameter directly corresponds to the SWFOV LWIR image. Further to obtain SWFOV LWIR DoLP images, the SWFOV LWIR images and the SWFOV LWIR DoLP images are presented in checkerboard image.

Download full size

View all figures

4. MULTI-STRATEGY DETAIL FEATURE EXTRACTION AND FUSION NETWORK

Although IR polarization imaging could assist to highlight targets in complex scenes, it is still challenging for target detection [10,38]. Therefore, facilitating fusion of IR images with IR DoLP images becomes an effective way to enhance the performance of imaging systems aiming at reducing the false positive rate and improving the tracking accuracy. In addition, complementary information between images needs to be considered for extracting information from different scales and different feature spaces. Based on previous work, it has been found that one of the main reasons for the poor fusion image effect is that the feature extraction network cannot extract the detail features thoroughly. Our thought is to adopt different lightweight feature extraction networks several times to deepen the understanding of the image feature information gradually.

The outline and texture positions of salient targets in IR DoLP images are characterized by high contrast, and sparse features describe these distinctly differentiated feature positions, which may appear as edges, corners, spots, etc., in the image. Therefore, extracting sparse features is one of the crucial aspects of fusion networks. In order to exploit the local and non-local feature interactions in the primary detail features synergistically, we employ self-attentive approximation branching for modeling the correlation between different locations in the image, which helps to maintain the global structure in the image during the fusion process, and it is particularly important for IR DoLP images to use the detail estimation branch to capture the local details of the image, such as outline and textures [Fig. 4(c)].

Figure 4.Image fusion network. (a) This fusion network contains LLRR model, Detail CNN, ADFU model, and decoder network. The LLRR model extracts sparse features from IR images and IR DoLP images ( $S_{IR}$ and $S_{DoLP}$ ). Detail CNN model extracts high-frequency detail features ( $Ψ_{IR}^{D}$ and $Ψ_{DoLP}^{D}$ ). The primary detail features obtained by adding $S$ and $Ψ$ are delivered to the ADFU model, which is based on the attentional mechanism, and the new detail features $F_{DoLP}^{D}$ of salient targets can be obtained. Meanwhile, the LLRR model and Base Poolformer model extract the low-rank features $L_{IR}$ and low-frequency features $B_{IR}$ of the IR image, respectively, and compose them into the background features $F_{IR}^{B}$ . Finally, the detail feature $F_{DoLP}^{D}$ and the background feature $F_{IR}^{B}$ are fused by the decoder network to obtain the fusion image. (b) LLRR model structure. The input data is $X$ . The activation function $h_{ξ}$ is defined as $h_{ξ} = sign (x) \cdot \max (| x | - ξ, 0)$ , where $Z_{0} = h_{ξ} (V_{0} * X)$ . (c) ADFU model structure. To segment the primary detail features into two parts, $A$ and $B$ , the $3 \times 3$ deep convolution operation (DWCovn) is used for $A$ to generate non-local features. The Pooling denotes the adaptive maximum pooling operation to obtain low frequency information. After using the $3 \times 3$ deep convolution operation for $B$ , the $1 \times 1$ convolution and a hidden activation function are used directly to obtain enhanced high-frequency features. The features obtained from the $A$ and $B$ channels are added to obtain the improved detail features. Then the Detail CNN model in the ADFU model is utilized to perform the third detail feature extraction on the improved detail features to obtain the final detail features $F_{DoLP}^{D}$ . (d) Base Poolformer model structure. This model acquires background features, where pooling operations are performed to enhance the perceptual ability of the feature information; PoolMLP uses two $1 \times 1$ convolutional layers and activation operations are performed on the features.

Download full size

View all figures

We propose a multi-strategy detail feature extraction and fusion network (MSDFE-Net) [Fig. 4(a)] (details of the network models and ablation studies are described in Section 4). The network consists of the learned low-rank representation (LLRR) model [Fig. 4(b)] [39], Detail CNN encoder [Fig. 4(c)] [40], Base Poolformer encoder [Fig. 4(d)] [41], the attention detail fusion layer (ADFL) model [Fig. 4(c)] [42], and the decoder network. The “multi-strategy” description is: LLRR + Detail CNN together form the first extraction process, the first part of ADFU represents the second extraction process, and Detail CNN in ADFU represents the third extraction process. Down-sampling and deep convolution operations on channel $A$ can generate non-local feature information. Then the variance of $A$ is utilized as the statistical dispersion of the spatial information de-embedded to modulate the global description of the non-local feature information. At last, the global description is subjected to aggregation operation with $A$ to characterize the overall structural information of the detail features. For channel $B$ , it is to capture simply the more minute high-frequency features.

A. Component Models of the MSDFE-Net

LLRR model: LLRR is a network that emphasizes image features decomposition. According to the literature [39,43], the decomposition mode of image features can be defined as $\min_{L, S, D_{1}, D_{2}} {‖ X - D_{1} L - D_{2} S ‖}_{F}^{2} + λ_{1} {‖ L ‖}_{*} + λ_{2} {‖ S ‖}_{1} + λ_{3} ({‖ D_{1} ‖}_{F}^{2} + {‖ D_{2} ‖}_{F}^{2}),$ (8)where $D_{1}$ and $D_{2}$ indicate the mapping of $L$ and $S$ from the source image to background features and salient features, respectively. $L$ and $S$ are low-rank coefficients and sparse coefficients, respectively. $λ_{1}$ , $λ_{2}$ , and $λ_{3}$ are constraint coefficients. $| | \cdot {| |}_{F}$ means the Frobenius norm. $| | \cdot {| |}_{*}$ and $| | \cdot {| |}_{1}$ denote nuclear-norm and $l_{1}$ -norm, respectively. According to the literature [44], Eq. (8) can be optimized and then Eq. (8) is solved efficiently by the learned iterative shrinkage-thresholding algorithm (LISTA). The iterative process is as follows: $Z_{i} = h_{ξ} [α D^{T} \cdot (X - D \cdot Z_{i - 1}) + (1 - λ_{3}) Z_{i - 1}], ζ = α {(λ_{1} λ_{2})}^{T},$ (9)where $D = (D_{1}, D_{2})$ . $Z_{i}$ is the result of the $i$ th iteration. $α$ is a constant. According to literature [39] it is found that convolutional layers can be used instead of matrix multiplication, and Eq. (9) can be further optimized as $Z_{i} = h_{ξ} [N_{2} * (X - N_{1} * Z_{i - 1}) + (1 - λ_{3}) Z_{i - 1}],$ (10)where $N_{1}$ and $N_{2}$ are convolutional layers. $*$ denotes the convolution operation. $h_{ξ}$ is defined as the activation function. In this network, the low-rank features and sparse features of the image are fully obtained by connecting multiple LLRR models.

ADFU model: the ADFU model is a detail feature fusion model that contains the attention mechanism model and Detail CNN model, where the Detail CNN model is also one of the encoder network components. The channel segmentation operation is performed after inputting the primary detail features into the ADFU model, and the segmentation process can be expressed as [42] $(A, B) = S [* ({‖ F_{in} ‖}_{2})],$ (11)where $A$ and $B$ denote the image features of the two channels. $A, B \in R^{H \times W \times C}$ . $S [\cdot]$ denotes the channel splitting operation. Finally, the two channels undergo the fusion operation to get the improved detail features. Then we perform the third capture detail feature operation using the Detail CNN model in the ADFU model to obtain the final detail feature $F_{DoLP}^{D}$ . The Detail CNN model can also be considered to contain an attention mechanism, which can accomplish the task of calculating the attention weighting by performing insertion calculations in both branches, and focuses on learning the more salient common features. Moreover, the IRB in the Detail CNN model is essentially a residual network that introduces convolution. SWFOV LWIR DoLP images have high contrast and distinct edge details. To utilize the Detail CNN model, we hope to incorporate more polarization features in SWFOV LWIR images to further improve the contrast of salient targets in the improved detail features. In the Detail CNN module, the input information can be better preserved as the input and output features are mutually generated. This module can be regarded as a lossless feature extraction module. Therefore, the use of Detail CNN with an affine coupling layer is very appropriate here, and the process can be described as follows [41]: $ψ_{I, k + 1}^{S} [c + 1 : C] = ψ_{I, k}^{S} [c + 1 : C] + Γ_{1} (ψ_{I, k}^{S} [1 : c]), ψ_{I, k + 1}^{S} [1 : c] = ψ_{I, k}^{S} [1 : c] ⊙ \exp (Γ_{2} (ψ_{I, k + 1}^{S} [c + 1 : C])) + Γ_{3} (ψ_{I, k + 1}^{S} [c + 1 : C]), ψ_{I, k + 1}^{S} = cat {ψ_{I, k + 1}^{S} [1 : C], ψ_{I, k + 1}^{S} [c + 1 : C]},$ (12)where $⊙$ is the Hadamard product. $ψ_{I, k}^{S} [1 : c] \in R^{H \times W \times C}$ is the first to the $c$ th channels of the input feature for the $k$ th invertible layer ( $k = 1, \dots, K$ ). $cat (\cdot)$ is the channel concatenation operation and $Γ_{i}$ ( $i = 1, 2, 3$ ) are the arbitrary mapping functions. In the Detail CNN module, $Γ_{i}$ can be set to any mapping without affecting the lossless information transfer in this module.

Base Poolformer model: the background features of the fusion image are determined from the IR image. The low-rank features $L_{IR}$ extracted by the LLRR model and the low-frequency features $B_{IR}$ extracted by the Base Poolformer model are fused to obtain the background features $F_{IR}^{B}$ . Base Poolformer is a residual network that introduces layer normalization, down-sampling, random depth operation, and multilayer perceptron [40]. This adaptive pooling operation enhances the perceptual range of features and also has high computational efficiency.

Training losses model: a variant of the Huber regression loss function, the a-Huber loss, is introduced in the construction of the loss function. It can adaptively change the shape of the loss function. In addition, according to the SWFOV IR DoLP image features, the target and background contrasts in the fusion images are significantly enhanced for better operations such as target recognition and semantic segmentation. We consider $Φ_{DoLP}^{D}$ and $Φ_{IR}^{D}$ to contain texture, edge, and detail information that is weakly correlated, whereas the background information shared by $L_{IR}$ and $B_{IR}$ is strongly correlated. Hence, we also introduce a texture loss function and a feature map correlation loss function to keep the fusion image more richly characterized by texture details. The final loss function is defined as [39] $L_{total} = L_{a - Huber} + β_{1} L_{tex} + β_{2} L_{dec},$ (13)where $β_{1}$ and $β_{2}$ are equilibrium parameters that adjust each sub-loss function directly. $L_{a - Huber}$ , $L_{tex}$ , and $L_{dec}$ are the a-Huber loss function, texture loss function, and feature map correlation loss function, respectively. These three sub-loss functions are defined as $L_{a - Huber} = {\begin{matrix} 0.5 Δ^{2}, & | Δ | \leq γ \\ γ \cdot (Δ - 0.5 ϕ) Δ, & | Δ | > γ \end{matrix}, L_{tex} = \frac{1}{HW} {‖ | \nabla I_{F} | - \max (| \nabla I_{DoLP} |, | \nabla I_{IR} |) ‖}_{1}, L_{dec} = \frac{CC {[(Φ_{DoLP}^{D}, Φ_{IR}^{D})]}^{2}}{CC (L_{IR}, B_{IR}) + τ},$ (14)where $Δ = I_{F} - \max (I_{DoLP} - I_{IR})$ and $γ$ is the threshold of the a-Huber loss function. $\nabla$ denotes the Sobel gradient operator, which is used to measure the texture information of the image. $| \cdot |$ is the absolute operation. $CC (\cdot)$ is the correlation coefficient operation, and $τ$ is a constant to avoid zero denominator.

B. Experimental Setup for Converged Network

Dataset [45]: we used the LDDRS dataset for testing the image fusion network. The LDDRS dataset is the LWIR polarization dataset of 2113 sets of road scenes captured while traveling. LDDRS is images acquired by using uncooled infrared split-focus planar polarization cameras with $512 \times 640$ resolution. The image background includes urban roads and highways in daytime and nighttime environments, and the dataset is dominated by moving vehicles. We randomly select 1713 images as the training set and the remaining 400 images as the test set. The network model is implemented based on the PyTorch framework. Our code experiments were conducted on a computer with NVIDIA GeForce RTX 3080 GPU, Intel(R) Xeon(R) Silve 4201R at 2.40 GHz CPU. Divide each training sample into $128 \times 128$ patches. Specifically, for each input image, three patches can be divided horizontally and two patches can be divided vertically according to the step size of 200. Therefore, six patches of $128 \times 128$ are obtained for each image. Using the Adam optimizer, the batch size is set to two. The number of training epochs is 40. The initial learning rate is $10^{- 4}$ , decreasing by 0.5 every 20 epochs. In addition, the loss function hyperparameters $β_{1}$ and $β_{2}$ in our method are set to 10 and 2, respectively.

Current evaluation metrics for fusion image quality mainly focus on the image itself. Despite the good performance of traditional evaluation metrics in terms of data values, the current evaluation metrics for fusion image quality mainly focus on the image itself, which does not adequately meet the assessment needs for characterizing the fusion effects of IR images and IR DoLP images. Therefore, we designed three targeted evaluation metrics specifically for IR images fused with IR DoLP images: contrast enhancement ratio (CER), detail retention (DR), and polarization information difference (PID). Within the range of reasonable intervals, higher values for CER and DR indicate better fusion, while the opposite is true for PID. Furthermore, we introduced two traditional metrics, peak signal to noise ratio (PSNR) and information entropy ( $E_{n}$ ), which together constitute five evaluation metrics to characterize fusion images comprehensively. The following are detailed definitions of the evaluation metrics constructed: $CER = \frac{C_{F} - C_{IR}}{C_{IR}} \times 100 %, DR = \frac{E_{F}}{E_{IR}} \times 100 %, PID = \frac{1}{M \times N} \sum_{m = 1}^{M} \sum_{n = 1}^{N} \sqrt{\sum_{c = 1}^{C} {(I_{P} - I_{F})}^{2}},$ (15)where $C_{F}$ and $C_{IR}$ indicate the contrast of the SWFOV fusion images and SWFOV LWIR images, respectively. $C_{F}, C_{IR} = V / σ$ ; $V$ and $σ$ are the average gray value and the standard deviation of the image. Larger $C$ indicates well-defined light and dark areas, i.e., high-contrast areas. Smaller $C$ indicates concentrated gray-scale distribution, i.e., low-contrast areas. If $C_{F} > C_{IR}$ , the ratio is positive, which indicates contrast enhancement. $E_{F}$ and $E_{IR}$ denote quantization of the detail information of the SWFOV fusion image and SWFOV LWIR image using the Canny edge detection algorithm, respectively. The number or intensity of edge pixels can directly reflect the richness of texture detail information. If $E_{F} > E_{IR}$ , it means that the texture detail information of the fusion image is stronger than that of the IR image. $I_{P}$ and $I_{F}$ denote SWFOV LWIR DoLP images and SWFOV fusion images, respectively. We used the Euclidean distance to measure the difference in polarization on full image channels, and the nested summation and square root operations in the formulas of the PID ensured the exact amount of difference. This design makes the metric highly sensitive to small changes in polarization information and is able to detect subtle polarization distortions introduced by the fusion algorithm. It avoids the interference of local outliers by averaging the difference in polarization information of all pixels in the whole image, and can comprehensively and stably reflect the overall polarization fidelity of the fusion image, which makes the metric maintain good stability in the presence of slight noise or uneven image quality.

C. Ablation Study

In this part, we conduct ablation studies to validate the effectiveness of the proposed approach. We denote the network without the LLRR model and the ADFU model as the baseline. Compared to the baseline network [Fig. 5(a)], the fusion images obtained by introducing the LLRR model can highlight the target outline and texture details significantly. It is a benefit that the LLRR model extracts the sparse features ( $S_{IR}$ and $S_{DoLP}$ ) of both SWFOV LWIR images and SWFOV LWIR DoLP images efficiently. We set the number of LLRRs to four, which was obtained based on our pre-test results. In our tests, we found that image fusion fails when the number of LLRRs is below two, and when the number of LLRRs is increased to eight, the computation is too complicated. Therefore, we set the number of LLRRs to four based on the consideration of the image fusion effect and the computational amount of the whole fusion network. Meanwhile, the fusion images obtained based on the ADFU model completely retain the background features of the SWFOV LWIR images. When both the LLRR model and the ADFU model are added to the fusion network, the detailed features of salient targets in SWFOV LWIR DoLP images are extracted more completely, and the IR image covers the background more efficiently (pseudo-color image and close-ups in Fig. 5(a)]. It is possible to visualize the variability of image fusion across different networks with the pseudo-color distributions. The loss function curve [Fig. 5(b)] shows that the fusion network has reached convergence at the 40th epoch, so we set the number of training epochs to 40. Compared to the LLRR model fusion network, the number of parameters in the LLRR + ADFU model fusion network does not increase significantly, only $4.75 \times 10^{6}$ , but it realizes the excellent fusion effect [Fig. 5(c)].

Figure 5.Fusion effects of ablation study. (a) The fusion effects and close-ups of the fusion images obtained from the baseline model, the LLRR model, the ADFU model, and the LLRR + ADFU model are shown. Three scenes were reconstructed using pseudo-color for viewing convenience. (b) Loss function of the image fusion network. (c) Number of parameters of the four ablation models.

Download full size

View all figures

As shown in Fig. 5(a), the baseline and LLRR fused image is an obvious black-or-white effect, and its fusion effect obviously does not satisfy the criteria for image fusion. Hence, the CER values of both baseline and LLRR fused images are anomalous data. Since CER describes the contrast of the image, the LLRR fused images in Table 1 have the maximum value of CER. Also from Fig. 5(a) it can be found that the fused image of LLRR contains more feature information than the fused image of baseline, so the CER value of the baseline fused image is sub-maximum.Table 1.

Quantitative Evaluation of Ablation Study on the Test Dataset^a

Method	CER (%)	DR (%)	PID	PSNR	$E_{n}$
Baseline	$\underline{\underline{74.5631}}$	$\underline{\underline{74.915}}$	$\underline{\underline{299.857}}$	$\underline{\underline{2.8916}}$	$\underline{\underline{3.7342}}$
LLRR	$\underline{\underline{128.8028}}$	$\underline{\underline{76.1781}}$	$\underline{\underline{259.3154}}$	$\underline{\underline{4.0078}}$	$\underline{\underline{3.0442}}$
ADFU	17.3612	128.8947	86.6665	12.9008	6.7563
LLRR + ADFU	32.0598	147.1214	81.3571	13.9214	6.5471

Best and abnormal data are bold and double underlined, respectively.

Table 2.

Quantitative Evaluation of Different Fusion Methods on the Test Dataset^a

Method	CER (%)	DR (%)	PID	PSNR	$E_{n}$
RFN-Nest	5.1064	121.9426	99.7914	11.8487	6.7448
SwinFusion	4.5682	109.5763	188.0744	6.177	7.2394
YDTR	19.1432	81.1398	166.2386	7.3832	7.0926
CDDFuse	6.0211	125.2308	186.945	6.2367	7.3335
CMTFusion	19.2669	131.9363	114.8855	10.6667	6.9971
DIF-Fusion	9.5311	126.0948	196.8322	5.9109	7.1988
DCSFuse	23.8086	113.7567	250.0697	4.1158	7.2151
Ours	32.0598	147.1214	81.3571	13.9214	6.5471

Best and second best results are bold and underlined, respectively.

D. Comparison with the State-of-the-Art

Compare our proposed fusion method with existing state-of-the-art fusion methods, including RFN-Nest [22], SwinFusion [46], YDTR [47], CDDFuse [41], CMTFusion [48], DIF-Fusion [49], and DCSFuse [50].

Qualitative comparison: Fig. 6 shows the visualization of the fusion results with different methods. Our proposed fusion method fuses the features of IR images and IR DoLP images successfully, which significantly improves the contrast of salient targets, e.g., detail textures of cars, pedestrians, and utility poles are clearly rendered, and the target outline is more distinct. Compared to other fusion methods, our fusion images are easier to recognize the target in a visual sense. In addition, the comparison methods perform poorly in suppressing the background noise of IR DoLP images, and the fusion results of methods such as SwinFusion, YDTR, CDDFuse, and DCSFuse have high similarity to IR images and contain limited IR polarization features. We believe that the crucial process for the effective fusion of the proposed fusion network fusion is the triple detailed feature extraction of IR and IR DoLP images, which adequately captures the texture and edge outline information of salient targets. Especially, the ADFU model further classifies the extracted features into local and non-local features and focuses on texture-rich regions during fusion in order to preserve the high-contrast properties of IR DoLP images completely.

Figure 6.Qualitative comparison of different fusion methods for three scenes. (a) IR. (b) IR DoLP. (c) RFN-Nest. (d) SwinFusion. (e) YDTR. (f) CDDFuse. (g) CMTFusion. (h) DIF-Fusion. (i) DCSFuse. (j) Ours.

Download full size

View all figures

Quantitative comparison: as shown in Table 2, our fusion method exhibits excellent performance on all five evaluation metrics, and especially performs best on the CER, DR, and PID metrics targeting IR polarization information fusion. It is worth noting that our proposed image fusion strategy yields slightly lower $E_{n}$ than the comparison algorithm. This is due to the fact that the $E_{n}$ of the IR image and the IR DoLP image in the test set is 7.334 and 5.6361, respectively. During the fusion process, our proposed image fusion strategy fully extracts the image detail features of the IR DoLP images, which results in the fused image containing more IR DoLP image detail features, which further results in the fused image obtained by our proposed fusion scheme having a lower $E_{n}$ than the other fusion solutions. From Fig. 6, it can be found that the fusion images obtained by the comparison algorithms are very close to the infrared intensity images, while the fusion images obtained by our proposed fusion network obviously have the characteristics of infrared polarized images. Therefore, the fusion images obtained by our proposed fusion network have more polarization features. It indicates that our fusion method can incorporate polarization information more effectively compared to the contrast methods, rather than single-bias fusion of one source modal images. It further demonstrates that our fusion image performs better in recovering the detail information and background information of both source images, maintaining a strong correlation with both source images.

Questionnaire: we designed a questionnaire to demonstrate the advantages of our fusion images for target recognition. The questionnaire was open to undergraduate and graduate students in optics. We showed them the three sets of fusion images in Fig. 4 in disrupted order and asked them to subjectively choose the image with the best fusion effect. The results show that the fusion images obtained by our fusion method received 60.4%, 66.7%, and 63.4% from 146 valid votes, which further confirms the reliability of the effect of our fusion method for target recognition.

5. APPLICATION EXPERIMENTS

A. Application in Target Detection

We target the car for the target detection task with the SWFOV LWIR images, SWFOV LWIR DoLP images, and fusion images obtained by utilizing different fusion methods. The extensive and stable YOLOX network is used for the target detection task [51].

We used an alloy model car (scale 1:18) as a substitute for real cars, and acquired a dataset containing 657 images using our imaging system for training the target detection network. The dataset covers SWFOV LWIR images, SWFOV LWIR DoLP images, and their SWFOV fusion images; 29 sets of images were selected randomly for evaluation in the test (as shown in Fig. 7). The fusion images obtained using our fusion method show significant advantages in terms of confidence and have the lowest omission ratio. mAP[0.5:0.95] represents the average mAP value over the IoU threshold range from 0.5 to 0.95 with a step size of 0.05. The fusion images obtained by our fusion method have the mAP value of 97.30, which is much higher than the mAP values of the original SWFOV LWIR images, SWFOV LWIR DoLP images, and other SWFOV fusion methods. It indicates that our proposed method achieves higher precision in target detection and shows excellent performance.

Figure 7.Testing our dataset using different fusion methods. Using the YOLOX detection network to detect car targets in SWFOV LWIR images, SWFOV LWIR DoLP images, and SWFOV fused images, and demonstrating the average precision and confidence of car detection.

Download full size

View all figures

B. Application in Detecting Distant Targets

It can also be noticed that our method shows advantages in detecting distant targets according to Fig. 5. Based on the scaling of the alloy model car, we give the results of car detection at real distances of 9, 18, 27, 36, and 45 m positions (Fig. 8). As the detection distance increases, the target recognizability and identification confidence decrease. Compared to the IR images, although the IR DoLP images performed poorly at the 9 m position, they were detected better than the IR images at distant positions. Compared to the SWFOV LWIR images, although the SWFOV LWIR DoLP images performed slightly worse at the 9 m position, they were detected better than the SWFOV LWIR images at distant positions. At 27 and 36 m, only one vehicle was detected and the detection confidence was 0.79 and 0.67 for the IR images, respectively. However, at 27 and 36 m, two vehicles could be detected with detection confidence of 0.79, 0.66 and 0.78, 0.76 for IR DoLP images, respectively. At 9, 18, 27, and 36 m, all vehicle targets can be detected for the fusion images and the detection confidence is also advantageous compared to IR images and IR DoLP images. At 45 m, the vehicle targets could not be detected for both the IR image and the IR DoLP image, but the fusion image could still detect the vehicle target. Therefore, it can be demonstrated that coupled with our proposed fusion method, our imaging technique has good potential for application in the field of distant target recognition. Furthermore, the distortion characteristics of SWFOV infrared images can decrease the target detection accuracy. However, the detection accuracy can be further improved by fusing with the SWFOV infrared polarization image, which demonstrates that our proposed fusion algorithm is very effective for the extraction of target detail features. In other words, this is also equivalent to reducing the effect of the distortion characteristics of the SWFOV infrared polarizing lens on imaging.

Figure 8.Car detection task is performed at different distance locations. Cars are detected using YOLOX detection network for SWFOV LWIR images, SWFOV LWIR DoLP images, and SWFOV fusion images.

Download full size

View all figures

C. Application in IR Camouflage Target Recognition

The IR camouflage suit shown in Fig. 9(b) meets the IR camouflage performance requirements within 8–12 μm. The IR camouflage suit can change the optical characteristics of the human body effectively while at the same time IR camouflage segmentation feature, the body’s probability of being found after camouflage is less than 30%.

Figure 9.IR camouflage target recognition experiments. (a) Experimental demonstration. (b) Visual images wearing the IR camouflage suit in bush, building, and grass environments. (c) Recognition tests of IR targets and IR camouflaged targets in road and grass environments, respectively. The SWFOV LWIR images, SWFOV LWIR DoLP images, and SWFOV fusion images are shown from left to right. The gray values of rectangular contour-marked regions in the images are counted and annotated with the HIS color model. The close-ups show the statistics of the gray values of the outline marking area, and the height indicates the gray value. (d) Analyzing the pixel distributions of the SWFOV LWIR images, SWFOV LWIR DoLP images, and SWFOV fusion images of the four scenes in Fig. 3.

Download full size

View all figures

We called two volunteers to perform the experiment on the road and in the grass environment, one in normal attire and the other in the SWFOV LWIR camouflage suit [Fig. 9(c)]. Compared with the normally dressed volunteer in the IR images, the radiation intensity of the volunteer wearing the IR camouflage suit is suppressed substantially, and the volunteer is camouflaged with IR camouflage effectively. Since the SWFOV LWIR DoLP image focuses more on the outline information of the targets, the outline information of the two volunteers is preserved completely and somewhat enhanced. The SWFOV LWIR DoLP detection technique resulted in the invalidation of the IR camouflage effect of the IR camouflage suit. However, we are not satisfied with the absolute predominance of outline information in SWFOV LWIR DoLP images. Thus, the problem of absolute predominance of outline information is solved in fusion images. From the close-ups of the fusion images, it can be noticed that the radiation intensity of the volunteers wearing IR camouflage suits in the fusion images is slightly lower compared to the SWFOV LWIR DoLP images, but significantly stronger than that of the IR images. Moreover, owing to the introduction of the SWFOV LWIR image features, all the outline information of the people in the fusion image is enhanced, especially the outline of the volunteers in normal attire.

We give the results of the gray value distribution of SWFOV LWIR images, SWFOV LWIR DoLP images, and SWFOV fusion images [Fig. 9(d)] for the four scenes (Fig. 3) captured by our imaging system. The SWFOV LWIR image has a wide distribution of gray values, concentrated in the middle- and high-gray-value regions, which shows its ability to capture low-frequency features (e.g., luminance information) of radiation. The SWFOV LWIR DoLP image gray values are concentrated in the lower-gray-value region, which reveals the high-frequency features such as the target geometry and outline effectively. It is noteworthy that the gray value distribution of the fusion image lies in the middle-gray-value region between the SWFOV LWIR image and SWFOV LWIR DoLP image. This distribution suggests that the advantages of the two source images were successfully fused, preserving the radial distribution of the scene while highlighting the surface structural features of the targets. This complementarity enhances the expressiveness and robustness of fusion images in complex scenes, demonstrating that our image fusion network fuses the features of both mode images evenly rather than preferring a single mode image. In summary, it can be demonstrated that for IR camouflaged targets, the performance of ordinary IR detection is limited, and our imaging technology can tackle this challenge.

6. DISCUSSION AND CONCLUSION

In our work, we have described the SWFOV LWIR gaze polarization imaging technique. It realizes IR polarization perception capability in large airspace by carrying an SWFOV LWIR polarization lens, which is designed by adopting the non-similar imaging principle. Then the image fusion network with multi-strategy extraction of detailed features is embedded, and the fusion image obtained retains the rich background features of SWFOV LWIR images and the high-contrast features of SWFOV LWIR DoLP images. It can be shown through car target detection experiments and IR camouflage recognition and distant target detection experiments that the fusion images show significant advantages compared to SWFOV LWIR and SWFOV LWIR DoLP images. Additionally, the high-resolution IR movement used in the system further enhances the image quality. These properties bring significant advantages to SWFOV LWIR polarization imaging applications. Different applications have validated the versatility, flexibility, and robustness of the imaging system.

The SWFOV LWIR gaze polarization imaging technique can be further extended. First, we can verify the imaging capability of other polarization covariates in large airspace to realize 3D reconstruction of more targets in SWFOV with the AoP [52,53]. It enables the application of SWFOV LWIR polarization vision technology in fields such as polarization navigation [54] and industrial production [6]. Moreover, it is necessary to construct datasets containing multiple targets in terrestrial, maritime, and celestial scenes to enhance the range of applications and improve the generalization ability of recognition detection, which can help to cope with the anomaly detection challenge. Third, upgrading the IR movement to an IR focal plane movement and integrating this technology into other advanced work platforms (e.g., vehicles, drones) can realize real-time and accurate high-altitude camouflage target recognition and medical diagnostics, among other applications [55,56], thus improving detection sensitivity and efficiency. Although imaging resolution will be sacrificed by using IR split-focal plane movement, it can be recovered by IR super-resolution techniques (we have already started to study this part of the work). Overall, we believe this work fully inherits the advantages of IR intensity imaging and may provide further opportunities to create the next generation of imaging technologies with wider observation ranges, stronger polarization imaging capability, and stronger assisted identification capability.

Category: Image Processing and Image Analysis

Received: Feb. 18, 2025

Accepted: Apr. 24, 2025

Published Online: Jul. 1, 2025

The Author Email: Fuyu Huang (hfyoptics@163.com), Limin Liu (lk0256@163.com)

DOI:10.1364/PRJ.559833

CSTR:32188.14.PRJ.559833