Significance Depth estimation, as a foundational task in computer vision, plays a critical role in enabling technologies such as autonomous driving, augmented/virtual reality (AR/VR), and robotic navigation. By recovering 3D geometric information from 2D images, it bridges the gap between visual perception and spatial understanding, serving as the backbone for immersive 3D displays and intelligent systems. Monocular and stereo depth estimation represent two complementary paradigms: the former leverages semantic reasoning and data-driven learning to infer depth from single images, while the latter relies on geometric constraints from stereo pairs to achieve absolute depth recovery. The synergy between these approaches has driven advancements in 3D scene reconstruction, dynamic object interaction, and real-time rendering, making depth estimation indispensable for applications ranging from holographic projection to digital twins.
Progress The evolution of monocular depth estimation has been marked by a shift from heuristic geometric priors to deep learning architectures. Early methods, such as linear perspective and structure-from-motion (SFM), relied on handcrafted features like vanishing points and sparse 3D point clouds but struggled with textureless regions and dynamic scenes. The introduction of convolutional neural networks (CNNs) revolutionized the field, enabling end-to-end frameworks like Eigen et al.’s two-stage network and DORN’s ordinal regression model to predict dense depth maps with unprecedented accuracy. Vision Transformers (ViTs), exemplified by DPT, further enhanced global context modeling through self-attention mechanisms, improving edge preservation and long-range consistency. Self-supervised paradigms, such as Monodepth2 and ManyDepth, bypassed the need for labeled data by exploiting photometric consistency from stereo pairs or video sequences, while hybrid models like AdaBins integrated adaptive depth discretization and traditional geometric constraints for refined outputs.
Stereo depth estimation, rooted in the principles of binocular vision, has similarly transitioned from classical algorithms to deep learning frameworks. Traditional methods like semi-global matching (SGBM) employed dynamic programming and energy minimization for disparity computation but faced challenges in occluded or low-texture areas. Deep stereo networks, such as PSMNet and GC-Net, redefined the field through cost volume construction and 3D convolutions, enabling sub-pixel disparity estimation and robustness to illumination variations. Transformer-based architectures like STTR introduced cross-attention mechanisms to resolve ambiguities in repetitive textures, while lightweight designs (e.g., StereoNet) optimized real-time performance for embedded systems. Datasets such as KITTI, SceneFlow, and ApolloScape have been instrumental in benchmarking progress, with recent efforts like DA-2K and MiDaS addressing cross-domain generalization through sparse annotations and multi-dataset fusion.
In 3D display applications, monocular methods excel in mobile AR/VR devices by enabling real-time spatial reconstruction and occlusion-aware rendering. For instance, Apple Vision Pro leverages monocular depth estimation to dynamically align virtual objects with physical environments, while diffusion models like Marigold enhance depth continuity in weak-texture regions for photorealistic volumetric displays. Stereo techniques, on the other hand, underpin high-precision holography and industrial digital twins, where millimeter-level accuracy is critical. Innovations such as neural radiance fields (NeRF) and multi-teacher distillation frameworks (e.g., Distill Any Depth) further bridge the gap between geometric reconstruction and semantic understanding, enabling dynamic light field synthesis and resource-efficient deployment.
Conclusions and Prospects Monocular and stereo depth estimation have evolved into mature yet rapidly advancing fields, each addressing unique challenges and application demands. Monocular methods, with their lightweight deployment and semantic reasoning capabilities, are ideal for edge devices and dynamic environments, while stereo systems provide unmatched geometric precision for mission-critical tasks like autonomous navigation. Future research must prioritize multi-sensor fusion (e.g., LiDAR, IMU) to mitigate monocular scale ambiguity and enhance robustness in occluded scenes. Lightweight architectures and neuromorphic hardware integration will accelerate real-time performance, enabling seamless integration into consumer electronics and IoT devices. Cross-modal benchmarks and physics-aware rendering techniques are needed to evaluate and improve depth consistency under varying illumination and motion conditions. Emerging paradigms, such as neural symbolic computing and event-based vision, promise to unify geometric reconstruction with physical reasoning, paving the way for "what-you-see-is-what-you-get" immersive experiences. As depth estimation transitions from academic research to industrial standardization, collaborative efforts across disciplines will be essential to harness its full potential in shaping the future of 3D visualization and intelligent perception.