Recent progress in research and applications of monocular and binocular depth estimation (<i>invited</i>)

Haiyang HU; Chaoping CHEN; Tianmu GAO; Baoen HAN; Yunfan YANG; Yi LIU; Xiaojun WU

doi:10.3788/IRLA20250157

Infrared and Laser Engineering, Volume. 54, Issue 7, 20250157(2025)

Recent progress in research and applications of monocular and binocular depth estimation (invited)

Haiyang HU, Chaoping CHEN, Tianmu GAO, Baoen HAN, Yunfan YANG, Yi LIU, and Xiaojun WU

State Key Laboratory of Avionics Integration and Aviation System-of-Systems Synthesis, Shanghai Jiao Tong University, Shanghai 200240, China

show less

Abstract Get PDF(in Chinese)

Significance Depth estimation, as a foundational task in computer vision, plays a critical role in enabling technologies such as autonomous driving, augmented/virtual reality (AR/VR), and robotic navigation. By recovering 3D geometric information from 2D images, it bridges the gap between visual perception and spatial understanding, serving as the backbone for immersive 3D displays and intelligent systems. Monocular and stereo depth estimation represent two complementary paradigms: the former leverages semantic reasoning and data-driven learning to infer depth from single images, while the latter relies on geometric constraints from stereo pairs to achieve absolute depth recovery. The synergy between these approaches has driven advancements in 3D scene reconstruction, dynamic object interaction, and real-time rendering, making depth estimation indispensable for applications ranging from holographic projection to digital twins.Progress The evolution of monocular depth estimation has been marked by a shift from heuristic geometric priors to deep learning architectures. Early methods, such as linear perspective and structure-from-motion (SFM), relied on handcrafted features like vanishing points and sparse 3D point clouds but struggled with textureless regions and dynamic scenes. The introduction of convolutional neural networks (CNNs) revolutionized the field, enabling end-to-end frameworks like Eigen et al.’s two-stage network and DORN’s ordinal regression model to predict dense depth maps with unprecedented accuracy. Vision Transformers (ViTs), exemplified by DPT, further enhanced global context modeling through self-attention mechanisms, improving edge preservation and long-range consistency. Self-supervised paradigms, such as Monodepth2 and ManyDepth, bypassed the need for labeled data by exploiting photometric consistency from stereo pairs or video sequences, while hybrid models like AdaBins integrated adaptive depth discretization and traditional geometric constraints for refined outputs.Stereo depth estimation, rooted in the principles of binocular vision, has similarly transitioned from classical algorithms to deep learning frameworks. Traditional methods like semi-global matching (SGBM) employed dynamic programming and energy minimization for disparity computation but faced challenges in occluded or low-texture areas. Deep stereo networks, such as PSMNet and GC-Net, redefined the field through cost volume construction and 3D convolutions, enabling sub-pixel disparity estimation and robustness to illumination variations. Transformer-based architectures like STTR introduced cross-attention mechanisms to resolve ambiguities in repetitive textures, while lightweight designs (e.g., StereoNet) optimized real-time performance for embedded systems. Datasets such as KITTI, SceneFlow, and ApolloScape have been instrumental in benchmarking progress, with recent efforts like DA-2K and MiDaS addressing cross-domain generalization through sparse annotations and multi-dataset fusion.In 3D display applications, monocular methods excel in mobile AR/VR devices by enabling real-time spatial reconstruction and occlusion-aware rendering. For instance, Apple Vision Pro leverages monocular depth estimation to dynamically align virtual objects with physical environments, while diffusion models like Marigold enhance depth continuity in weak-texture regions for photorealistic volumetric displays. Stereo techniques, on the other hand, underpin high-precision holography and industrial digital twins, where millimeter-level accuracy is critical. Innovations such as neural radiance fields (NeRF) and multi-teacher distillation frameworks (e.g., Distill Any Depth) further bridge the gap between geometric reconstruction and semantic understanding, enabling dynamic light field synthesis and resource-efficient deployment.Conclusions and Prospects Monocular and stereo depth estimation have evolved into mature yet rapidly advancing fields, each addressing unique challenges and application demands. Monocular methods, with their lightweight deployment and semantic reasoning capabilities, are ideal for edge devices and dynamic environments, while stereo systems provide unmatched geometric precision for mission-critical tasks like autonomous navigation. Future research must prioritize multi-sensor fusion (e.g., LiDAR, IMU) to mitigate monocular scale ambiguity and enhance robustness in occluded scenes. Lightweight architectures and neuromorphic hardware integration will accelerate real-time performance, enabling seamless integration into consumer electronics and IoT devices. Cross-modal benchmarks and physics-aware rendering techniques are needed to evaluate and improve depth consistency under varying illumination and motion conditions. Emerging paradigms, such as neural symbolic computing and event-based vision, promise to unify geometric reconstruction with physical reasoning, paving the way for "what-you-see-is-what-you-get" immersive experiences. As depth estimation transitions from academic research to industrial standardization, collaborative efforts across disciplines will be essential to harness its full potential in shaping the future of 3D visualization and intelligent perception.

Note: This section is automatically generated by AI . The website and platform operators shall not be liable for any commercial or legal consequences arising from your use of AI generated content on this website. Please be aware of this.

Keywords

3D display computer vision depth estimation machine learning

Tools

Get Citation

Copy Citation Text

Haiyang HU, Chaoping CHEN, Tianmu GAO, Baoen HAN, Yunfan YANG, Yi LIU, Xiaojun WU. Recent progress in research and applications of monocular and binocular depth estimation (invited)[J]. Infrared and Laser Engineering, 2025, 54(7): 20250157

Download Citation

EndNote(RIS)BibTex Plain Text

Set citation alerts for article

Save article for my favorites

Paper Information

Category: Special issue—Advanced display technology and applications

Received: Mar. 10, 2025

Accepted: --

Published Online: Aug. 29, 2025

The Author Email:

DOI:10.3788/IRLA20250157

Topics

laser devices and laser physics

Lasers and Laser Optics

Laser physics

laser manufacturing

Instrumentation, Measurement and Metrology