Acta Optica Sinica, Volume. 43, Issue 21, 2115003(2023)

Depth Estimation Method of Light Field Based on Attention Mechanism of Neighborhood Pixel

Xi Lin, Yang Guo, Yongqiang Zhao*, and Naifu Yao
Author Affiliations
  • School of Automation, Northwestern Polytechnical University, Xi'an 710129, Shaanxi , China
  • show less

    Objective

    Accurate acquisition of depth information has always been a research hotspot in computer vision. Traditional cameras can only capture light intensity information within a certain time period, losing other information such as the incident light angle helpful for depth estimation. The emergence of light field cameras provides a new solution for depth estimation. Compared to traditional cameras, light field cameras can capture four-dimensional light field information. Micro-lens array light field cameras also solve the problems of large camera array size and impracticality to carry. Therefore, employing light field cameras to estimate the depth of a scene has broad research prospects. However, in the existing research, there are problems such as inaccurate depth estimation, high computational complexity, and occlusions in multi-view scenarios. Occlusions have always been challenging in tasks of light field depthestimation. For scenes without occlusions, most existing methods can yield good depth estimation results, but this requires the pixels to satisfy the color consistency principle. When occluded pixels exist in the scene, this principle among different views is no longer satisfied. In such cases, the accuracy of the depth map obtained using existing methods will significantly decrease, with more errors in the occluded areas and edges. Thus, we propose a method to estimate light field depth based on the attention mechanism of neighborhood pixel. By exploiting the high correlation between depth information and neighboring pixels in sub-aperture images, the network performance in estimating the depth of light field images is improved.

    Methods

    First, after analyzing the characteristics of the sub-aperture image sequence, we utilize the correlation between the depth information of a pixel in the light field image and a limited neighborhood of surrounding pixels to propose a neighborhood pixel attention mechanism Mix Attention. This mechanism efficiently models the relationship between feature maps and depth by combining spatial and channel attention, thereby improving the estimation accuracy of light field depth and providing the network with a certain degree of occlusion robustness. Next, based on Mix Attention, a sequential image feature extraction module is proposed. It employs three-dimensional convolutions to encode the spatial and angular information contained in the sub-aperture image sequence into feature maps and adopts Mix Attention to adjust the weights. This module enhances the representation power of the network by incorporating both spatial and angular information. Finally, a multi-branch depth estimation network is proposed to take part of sub-aperture images of the light field as input and achieve fast end-to-end depth estimation for light field images of arbitrary input sizes. This network leverages the proposed attention mechanism and the sequential image feature extraction module to effectively estimate depth from the light field image. Overall, we propose a novel estimation approach for light field depth. By leveraging the correlation between neighboring pixels and incorporating attention mechanisms, this approach improves the depth estimation accuracy and enhances the network's ability to handle occlusions. The proposed network architecture enables efficient and robust depth estimation for light field images.

    Results and Discussions

    In quantitative analysis, mean square error (MSE) and bad pixel rate are chosen as evaluation metrics. The proposed method demonstrates stable performance, with an average bad pixel rate and MSE of 3.091% and 1.126, respectively (Tables 1 and 2). In most scenarios, the method achieves optimal (bold) or suboptimal (underlined) depth estimation results. The effectiveness of the proposed attention mechanism (Mix Attention) is further demonstrated by ablation experiments (Table 3). Qualitative analysis (Figs. 7 and 8) reveals that the proposed method exhibits strong robustness in depth-discontinuous regions (hanging lamp in the Sideboard scene), high accuracy in texture-rich areas and depth-continuous regions (Cotton and Pyramids scenes), reduced prediction errors in areas with reflections (shoes on the floor in the Sideboard scene), and high smoothness at depth edges (edges in the Backgammon scene). Generally, the proposed method yields more desirable disparity estimation results. Experimental results indicate that the overall performance of the proposed network surpasses that of other algorithms. Therefore, the proposed method exhibits stable and superior performance in depth estimation, as indicated by the selected evaluation metrics, quantitative results, and qualitative analysis.

    Conclusions

    Aiming at the estimation task characteristics of light field depth and the features of light field data, we propose an attention mechanism of neighborhood pixel called Mix Attention. This mechanism captures the correlation between a pixel and its limited neighborhood pixel in the light field and depth features. By calculating the feature maps of the neighborhood, different feature maps in the network are selectively attended to improve the utilization efficiency of light field images. Additionally, by analyzing the pixel displacement between different sub-aperture images in the light field, a fast end-to-end multi-stream estimation network of light field depth is introduced to employ three-dimensional convolutional kernels to extract sequential image features. Tests on the New HCI light field dataset demonstrate that the proposed estimation network outperforms existing methods in three performance metrics, including 0.07 bad pixel rate, MSE, and computational time. It effectively enhances the depth prediction performance and exhibits robustness in occluded scenes such as Boxes. Ablation experiments show that the proposed mechanism fully exploits the correlation between neighboring pixels in different channels, improving the depth prediction performance in the estimation network of light field depth. However, the performance of the proposed method is unsatisfactory in regions lacking texture information. In the future, we will focus on techniques such as spatial pyramids to enhance the network's ability to extract multi-scale features, smooth the depth results in textureless regions, and further improve the depth estimation reliability.

    Tools

    Get Citation

    Copy Citation Text

    Xi Lin, Yang Guo, Yongqiang Zhao, Naifu Yao. Depth Estimation Method of Light Field Based on Attention Mechanism of Neighborhood Pixel[J]. Acta Optica Sinica, 2023, 43(21): 2115003

    Download Citation

    EndNote(RIS)BibTexPlain Text
    Save article for my favorites
    Paper Information

    Category: Machine Vision

    Received: Apr. 7, 2023

    Accepted: Jun. 26, 2023

    Published Online: Nov. 8, 2023

    The Author Email: Zhao Yongqiang (zhaoyq@nwpu.edu.cn)

    DOI:10.3788/AOS230786

    Topics