Acta Optica Sinica, Volume. 43, Issue 14, 1415001(2023)

Monocular Depth Estimation Method Based on Plane Coefficient Representation with Adaptive Depth Distribution

Jiajun Wang, Yue Liu*, Yuhui Wu, Hao Sha, and Yongtian Wang
Author Affiliations
  • Beijing Engineering Research Center of Mixed Reality and Advanced Display, School of Optics and Photonics, Beijing Institute of Technology, Beijing 100081, China
  • show less

    Objective

    Obtaining scene depth is crucial in 3D reconstruction, autonomous driving, and other related tasks. Current methods based on lidar or time of flight (ToF) cameras are not widely applicable due to their high cost. In contrast, only employing a single RGB image to infer scene depth information is more cost-effective, which has broader potential for more applications. Inspired by the successful applications of deep learning methods in various ill-posed problems recently, many researchers tend to adopt convolutional neural networks to estimate reasonable and accurate monocular depths. However, most existing studies based on deep learning focus on how to enhance the feature extraction capability of the network, without attention paid to the distribution of image depths. Estimating the pixel distributions of images can not only improve the inference precision but also make the reconstructed 3D images more consistent with ground truth. Therefore, we propose a new adaptive depth distribution module, which allows the model to predict different depth distributions for each image during the training.

    Methods

    The NYU Depth-v2 dataset created by New York University is employed. Overall, our model is built based on the encoder-decoder structure with skip connections, which has been proven to be able to guide image generation more effectively. An indirect representation of depth maps based on plane coefficient is also introduced to implicitly add the plane constraint in the depth estimation and obtain smoother depth estimation results in the plane region of the scene. Specifically, two sub-networks with different lightweight designs are adopted at the bottleneck and other upsampling stages in the network to enhance the model's feature extraction capability. In addition, an adaptive depth distribution estimation module is also designed to estimate different depth distributions according to different input images, which makes the pixel distribution of depth maps closer to the ground truth. A two-stage training strategy is employed. In the first stage, we load the pretrained weights on ImageNet into the backbone network and optimize the model using the loss function only at the 2D level. In the second stage, we perform joint training through loss functions at both the 2D and 3D levels.

    Results and Discussions

    Our study employs multiple metrics including root mean square error (RMSE), relative error (REL), and intersection over union (IoU) to qualitatively evaluate the inference ability of the proposed model. As shown in Table 1, the proposed lightweight network model outperforms most of the listed methods with only 46 M parameters, which proves the overall structure of the model is concise and effective. The visual comparison results of 3D depth reconstruction (Fig. 5) demonstrate that the proposed network can output smoother and more continuous depth predictions in planar regions, and reasonable predictions in the partially occluded or missing areas of planar regions. In terms of depth distribution, the carefully designed adaptive depth distribution module can make the predicted distribution fit better with the ground truth in the trend of the curve and can get a higher IoU rate compared with other methods (Fig. 6 and Table 3), thus indicating the effectiveness of the proposed module. Additionally, the lightweight network can balance accuracy and speed in real-time scenarios (Table 2), and yield good inference and reconstruction results. However, the proposed network has some limitations in recovering fine details of the depth predictions (Fig. 7), and thus how to design the network to recover more depth details while ensuring the model's real-time prediction performance will be the focus of our future work.

    Conclusions

    An innovative model based on plane coefficient representation with adaptive depth distribution for monocular image depth estimation tasks is presented. Qualitative and quantitative results obtained from the NYU Depth-v2 dataset and multiple comparative experiments demonstrate that the proposed method is capable of obtaining reasonable prediction results for planar regions in images with partial occlusions or small viewing angles. Additionally, the proposed depth distribution prediction module provides differentiated pixel distribution optimization for each image, which can make the model achieve pixel depth distribution prediction results closer to the real images. With its lightweight design, this method realizes a balance between inference speed and inference accuracy and is highly applicable in practical scenarios that require accuracy in real time, such as indoor virtual reality and human-computer interaction.

    Tools

    Get Citation

    Copy Citation Text

    Jiajun Wang, Yue Liu, Yuhui Wu, Hao Sha, Yongtian Wang. Monocular Depth Estimation Method Based on Plane Coefficient Representation with Adaptive Depth Distribution[J]. Acta Optica Sinica, 2023, 43(14): 1415001

    Download Citation

    EndNote(RIS)BibTexPlain Text
    Save article for my favorites
    Paper Information

    Category: Machine Vision

    Received: Jan. 12, 2023

    Accepted: Mar. 20, 2023

    Published Online: Jul. 13, 2023

    The Author Email: Liu Yue (liuyue@bit.edu.cn)

    DOI:10.3788/AOS230468

    Topics