Acta Optica Sinica, Volume. 45, Issue 15, 1510002(2025)

Attention-Guided Matching for Stereo Multi-Focus Image Fusion

Sicheng Jiang, Xiaodong Chen*, Huaiyu Cai, and Yi Wang
Author Affiliations
  • Key Laboratory of Optoelectronics Information Technology, Ministry of Education, School of Precision Instrument and Optoelectronics Engineering, Tianjin University, Tianjin 300072, China
  • show less

    Objective

    The limited depth-of-field (DoF) in optical imaging systems poses a fundamental challenge in digital photography, making it difficult to capture all scene elements in sharp focus simultaneously. Multi-focus image fusion (MFIF) addresses this limitation by integrating multiple images with varying focus regions into a single all-in-focus image. Traditional MFIF methods typically follow a sequential “match-then-fuse” paradigm, which is susceptible to artifacts caused by imperfect registration, particularly in stereo scenarios with significant disparities. While deep learning-based methods have improved upon traditional techniques, they mostly target monocular cases and often struggle with large-scale content mismatches caused by stereo disparities or dynamic scene changes. In this paper, we tackle the more general and challenging scenario of stereo multi-focus image fusion, where issues such as lens breathing, defocus spread effects (DSE), and inter-view disparities further complicate content alignment and information preservation. The two primary goals are: 1) to achieve robust stereo matching that accommodates large disparities without explicit image registration, and 2) to ensure high-quality fusion that retains both local details and global semantic coherence. By addressing these challenges, the proposed method enhances the applicability of MFIF in critical fields such as microscopic imaging, industrial inspection, and security surveillance, where stereo vision systems are increasingly used but existing fusion methods often fall short.

    Methods

    The proposed framework incorporates several key innovations to meet the demands of stereo multi-focus fusion. Central to the design is a novel asymmetric network architecture that explicitly separates processing into primary and secondary branches, breaking away from traditional symmetric structures. The primary encoder, based on ResNet-101, extracts fine-grained local details using a multi-scale feature pyramid (Conv 1?Conv 4). Meanwhile, the secondary encoder leverages a Vision Transformer (ViT-B/16) to capture global context by dividing input images into 16×16 patches and encoding them into 768-dimensional token vectors. This dual-path structure enables the model to simultaneously preserve fine details and global coherence. A key innovation is the disparity-aware matching module, inspired by the querying Transformer (Q-Former)’s modality alignment capabilities. This module consists of six stacked matching blocks, each incorporating self-attention and cross-attention layers. Learnable queries facilitate feature alignment between the two branches, supporting stereo matching without explicit registration. The module is trained using a hybrid loss function that combines matching loss and feature reconstruction loss, with differentiated treatment for positive and negative sample pairs. The fusion module adopts cross-attention mechanisms to compare and selectively integrate sharp features from both branches. Unlike the matching module, it treats both streams symmetrically and uses residual learning to blend high-frequency details and inconsistent content. A U-Net-based decoder with skip connections reconstructs the final all-in-focus image. The end-to-end uses a composite loss comprising matching loss, fusion loss, multi-scale reconstruction loss, and a perceptual anti-artifact loss based on the learned perceptual image patch similarity (LPIPS) metric. The training dataset includes synthetic datasets (NYU-D2 and InStereo2K with simulated defocus) and real-world datasets (Lytro and Middlebury 2014), with a balanced emphasis on monocular and stereo tasks.

    Results and Discussions

    Extensive experiments validate the effectiveness of the proposed method. In monocular fusion tasks using the Lytro dataset, the method achieves state-of-the-art results, with a QAB/F score of 0.7894, a 0.021 improvement over the IFCNN baseline. The normalized mutual information (NMI) score of 1.1034 further confirms its strong capacity for transferring information from input images to the fused output (Table 1). In stereo fusion tasks on the Middlebury 2014 dataset, the method shows remarkable robustness to disparities. The artifact metric (NAB/F) drops to 0.0715, a 0.06 improvement over conventional methods, while maintaining high visual fidelity [visual information fidelity for fusion (VIFF) score of 0.9813]. These results validate the effectiveness of the Q-Former-inspired matching module in handling large content mismatches (Table 2). Qualitative comparisons (Figs. 9 and 10) reveal several key advantages: 1) In regions with abrupt DoF transitions, the method achieves smooth blending without halo artifacts; 2) In regions affected by DSE, it successfully reconstructs textures that other methods blur or miss; 3) In stereo cases with substantial disparity, it preserves sharp details and avoids ghosting effects common in registration-dependent approaches. Ablation studies (Table 3) show that removing the matching module degrades NAB/F by 0.08, confirming its importance for stereo tasks. Likewise, omitting the fusion module leads to a 0.027 drop in QAB/F, affirming its relevance to both monocular and stereo fusion. The asymmetric design also proves more effective than symmetric alternatives in addressing content mismatches.

    Conclusions

    We present a robust solution for stereo multi-focus image fusion, featuring several novel components. The asymmetric network structure balances local detail preservation with global context modeling. The Q-Former-inspired matching module establishes a new paradigm for handling stereo disparities without explicit registration. The cross-attention fusion mechanism effectively integrates complementary sharp features while suppressing artifacts. Experimental results across multiple benchmarks confirm that the proposed method outperforms existing approaches in both monocular and stereo settings. The framework’s robustness to content mismatches and ability to reconstruct missing details in defocused regions significantly expands the practical applicability of multi-focus fusion technology. Current limitations include computational demands from Transformer components and reduced effectiveness in extreme cases of content mismatch (e.g., full occlusion or severe motion blur). Future work will focus on 1) developing lightweight variants for real-time applications, 2) enhancing adaptability to extreme scene variations, and 3) exploring applications in multi-modal fusion and dynamic scene reconstruction. Overall, the proposed method offers a flexible and high-performance solution for advancing image fusion technologies in both academic and industrial applications.

    Keywords
    Tools

    Get Citation

    Copy Citation Text

    Sicheng Jiang, Xiaodong Chen, Huaiyu Cai, Yi Wang. Attention-Guided Matching for Stereo Multi-Focus Image Fusion[J]. Acta Optica Sinica, 2025, 45(15): 1510002

    Download Citation

    EndNote(RIS)BibTexPlain Text
    Save article for my favorites
    Paper Information

    Category: Image Processing

    Received: Apr. 3, 2025

    Accepted: Apr. 27, 2025

    Published Online: Aug. 15, 2025

    The Author Email: Xiaodong Chen (xdchen@tju.edu.cn)

    DOI:10.3788/AOS250836

    CSTR:32393.14.AOS250836

    Topics