Attention-Guided Matching for Stereo Multi-Focus Image Fusion

Sicheng Jiang; Xiaodong Chen; Huaiyu Cai; Yi Wang

doi:10.3788/AOS250836

Acta Optica Sinica, Volume. 45, Issue 15, 1510002(2025)

Attention-Guided Matching for Stereo Multi-Focus Image Fusion

Sicheng Jiang, Xiaodong Chen^*, Huaiyu Cai, and Yi Wang

Key Laboratory of Optoelectronics Information Technology, Ministry of Education, School of Precision Instrument and Optoelectronics Engineering, Tianjin University, Tianjin 300072, China

show less

Abstract Get PDF(in Chinese)

Objective

The limited depth-of-field (DoF) in optical imaging systems poses a fundamental challenge in digital photography, making it difficult to capture all scene elements in sharp focus simultaneously. Multi-focus image fusion (MFIF) addresses this limitation by integrating multiple images with varying focus regions into a single all-in-focus image. Traditional MFIF methods typically follow a sequential “match-then-fuse” paradigm, which is susceptible to artifacts caused by imperfect registration, particularly in stereo scenarios with significant disparities. While deep learning-based methods have improved upon traditional techniques, they mostly target monocular cases and often struggle with large-scale content mismatches caused by stereo disparities or dynamic scene changes. In this paper, we tackle the more general and challenging scenario of stereo multi-focus image fusion, where issues such as lens breathing, defocus spread effects (DSE), and inter-view disparities further complicate content alignment and information preservation. The two primary goals are: 1) to achieve robust stereo matching that accommodates large disparities without explicit image registration, and 2) to ensure high-quality fusion that retains both local details and global semantic coherence. By addressing these challenges, the proposed method enhances the applicability of MFIF in critical fields such as microscopic imaging, industrial inspection, and security surveillance, where stereo vision systems are increasingly used but existing fusion methods often fall short.

Methods

The proposed framework incorporates several key innovations to meet the demands of stereo multi-focus fusion. Central to the design is a novel asymmetric network architecture that explicitly separates processing into primary and secondary branches, breaking away from traditional symmetric structures. The primary encoder, based on ResNet-101, extracts fine-grained local details using a multi-scale feature pyramid (Conv 1?Conv 4). Meanwhile, the secondary encoder leverages a Vision Transformer (ViT-B/16) to capture global context by dividing input images into 16×16 patches and encoding them into 768-dimensional token vectors. This dual-path structure enables the model to simultaneously preserve fine details and global coherence. A key innovation is the disparity-aware matching module, inspired by the querying Transformer (Q-Former)’s modality alignment capabilities. This module consists of six stacked matching blocks, each incorporating self-attention and cross-attention layers. Learnable queries facilitate feature alignment between the two branches, supporting stereo matching without explicit registration. The module is trained using a hybrid loss function that combines matching loss and feature reconstruction loss, with differentiated treatment for positive and negative sample pairs. The fusion module adopts cross-attention mechanisms to compare and selectively integrate sharp features from both branches. Unlike the matching module, it treats both streams symmetrically and uses residual learning to blend high-frequency details and inconsistent content. A U-Net-based decoder with skip connections reconstructs the final all-in-focus image. The end-to-end uses a composite loss comprising matching loss, fusion loss, multi-scale reconstruction loss, and a perceptual anti-artifact loss based on the learned perceptual image patch similarity (LPIPS) metric. The training dataset includes synthetic datasets (NYU-D2 and InStereo2K with simulated defocus) and real-world datasets (Lytro and Middlebury 2014), with a balanced emphasis on monocular and stereo tasks.

Results and Discussions

Extensive experiments validate the effectiveness of the proposed method. In monocular fusion tasks using the Lytro dataset, the method achieves state-of-the-art results, with a Q^AB/F score of 0.7894, a 0.021 improvement over the IFCNN baseline. The normalized mutual information (NMI) score of 1.1034 further confirms its strong capacity for transferring information from input images to the fused output (Table 1). In stereo fusion tasks on the Middlebury 2014 dataset, the method shows remarkable robustness to disparities. The artifact metric (N^AB/F) drops to 0.0715, a 0.06 improvement over conventional methods, while maintaining high visual fidelity [visual information fidelity for fusion (VIFF) score of 0.9813]. These results validate the effectiveness of the Q-Former-inspired matching module in handling large content mismatches (Table 2). Qualitative comparisons (Figs. 9 and 10) reveal several key advantages: 1) In regions with abrupt DoF transitions, the method achieves smooth blending without halo artifacts; 2) In regions affected by DSE, it successfully reconstructs textures that other methods blur or miss; 3) In stereo cases with substantial disparity, it preserves sharp details and avoids ghosting effects common in registration-dependent approaches. Ablation studies (Table 3) show that removing the matching module degrades N^AB/F by 0.08, confirming its importance for stereo tasks. Likewise, omitting the fusion module leads to a 0.027 drop in Q^AB/F, affirming its relevance to both monocular and stereo fusion. The asymmetric design also proves more effective than symmetric alternatives in addressing content mismatches.

Conclusions

We present a robust solution for stereo multi-focus image fusion, featuring several novel components. The asymmetric network structure balances local detail preservation with global context modeling. The Q-Former-inspired matching module establishes a new paradigm for handling stereo disparities without explicit registration. The cross-attention fusion mechanism effectively integrates complementary sharp features while suppressing artifacts. Experimental results across multiple benchmarks confirm that the proposed method outperforms existing approaches in both monocular and stereo settings. The framework’s robustness to content mismatches and ability to reconstruct missing details in defocused regions significantly expands the practical applicability of multi-focus fusion technology. Current limitations include computational demands from Transformer components and reduced effectiveness in extreme cases of content mismatch (e.g., full occlusion or severe motion blur). Future work will focus on 1) developing lightweight variants for real-time applications, 2) enhancing adaptability to extreme scene variations, and 3) exploring applications in multi-modal fusion and dynamic scene reconstruction. Overall, the proposed method offers a flexible and high-performance solution for advancing image fusion technologies in both academic and industrial applications.

Note: This section is automatically generated by AI . The website and platform operators shall not be liable for any commercial or legal consequences arising from your use of AI generated content on this website. Please be aware of this.

Keywords

asymmetric network attention mechanism image matching multi-focus image fusion stereo disparity

Tools

Get Citation

Copy Citation Text

Sicheng Jiang, Xiaodong Chen, Huaiyu Cai, Yi Wang. Attention-Guided Matching for Stereo Multi-Focus Image Fusion[J]. Acta Optica Sinica, 2025, 45(15): 1510002

Download Citation

EndNote(RIS)BibTex Plain Text

Set citation alerts for article

Save article for my favorites