Recent progress in research and applications of monocular and binocular depth estimation (<i>invited</i>)

Haiyang HU; Chaoping CHEN; Tianmu GAO; Baoen HAN; Yunfan YANG; Yi LIU; Xiaojun WU

doi:10.3788/IRLA20250157

Infrared and Laser Engineering, Volume. 54, Issue 7, 20250157(2025)

Recent progress in research and applications of monocular and binocular depth estimation (invited)

Haiyang HU, Chaoping CHEN, Tianmu GAO, Baoen HAN, Yunfan YANG, Yi LIU, and Xiaojun WU

State Key Laboratory of Avionics Integration and Aviation System-of-Systems Synthesis, Shanghai Jiao Tong University, Shanghai 200240, China

show less

Abstract Get PDF(in Chinese)

References(56)

[1] LECUN Y, BOTTOU L, BENGIO Y et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 86, 2278-2324(1998).

[2] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 60, 84-90(2017).

[3] VASWANI A, SHAZEER N, PARMAR N et al. Attention is all you need[J]. arXiv, 1706.03762cs, 2017.

[4] DOSOVITSKIY A, BEYER L, KOLESNIKOV A et al. An image is worth 16×16 words: Transformers for image recognition at scale[J]. arXiv, 2010.11929cs, 2020.

[5] CORDONNIER J-B, LOUKAS A, JAGGI M. On the relationship between self-attention and convolutional layers[J]. arXiv, 1911.03584cs, 2019.

[6] ARAMPATZAKIS V, PAVLIDIS G, MITIANOUDIS N et al. Monocular depth estimation: A thorough review[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46, 2396-2414(2023).

[7] KHAN F, SALAHUDDIN S, JAVIDNIA H. Deep learning-based monocular depth estimation methods—a state-of-the-art review[J]. Sensors, 20, 2272(2020).

[8] [8] SAXENA A, CHUNG S, NG A. Learning depth from single monocular images [J]. Advances in Neural Infmation Processing Systems, 2005: 11611168.

[9] NARASIMHAN S G, NAYAR S K. Contrast restoration of weather degraded images[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25, 713-724(2003).

[10] SNAVELY N, SEITZ S M, SZELISKI R. Photo tourism: Exploring photo collections in 3D[J]. ACM Transactions on Graphics, 25, 835-846(2006).

[11] ZHANG R, TSAI P-S, CRYER J E et al. Shape-from-shading: A survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21, 690-706(1999).

[12] KARSCH K, LIU C, KANG S B. Depth transfer: Depth extraction from video using non-parametric sampling[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36, 2144-2158(2014).

[13] [13] EIGEN D, PUHRSCH C, FERGUS R. Depth map prediction from a single image using a multiscale deep wk [J]. Advances in Neural Infmation Processing Systems, 2014, 27.

[14] [14] LAINA I, RUPPRECHT C, BELAGIANNIS V, et al. Deeper depth prediction with fully convolutional residual wks[C]Proceedings of the 2016 Fourth International Conference on 3D Vision, 2016: 239248.

[15] [15] FU H, GONG M, WANG C, et al. Deep dinal regression wk f monocular depth estimation[C]Proceedings of the IEEE Conference on Computer Vision Pattern Recognition, 2018: 20022011.

[16] HATAMIZADEH A, KAUTZ J. Mambavision: A hybrid mamba-transformer vision backbone[J]. arXiv, 2407.08083cs, 2024.

[17] [17] LIU Z, LIN Y, CAO Y, et al. Swin transfmer: Hierarchical vision transfmer using shifted windows[C]Proceedings of the IEEECVF International Conference on Computer Vision, 2021: 1001210022.

[18] LIU Y, ZHANG Y, WANG Y et al. A survey of visual transformers[J]. arXiv, 2111.06091cs, 2023.

[19] [19] RANFTL R, BOCHKOVSKIY A, KOLTUN V. Vision transfmers f dense prediction[C]Proceedings of the IEEECVF International Conference on Computer Vision, 2021: 1217912188.

[20] LI B, DAI Y, HE M. Monocular depth estimation with hierarchical fusion of dilated cnns and soft-weighted-sum inference[J]. Pattern Recognition, 83, 328-339(2018).

[21] [21] GARG R, BG V K, CARNEIRO G, et al. Unsupervised cnn f single view depth estimation: Geometry to the rescue[C]Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The herls, October 1114, 2016, Proceedings, Part VIII 14, 2016: 740756.

[22] [22] GODARD C, MAC AODHA O, BROSTOW G J. Unsupervised monocular depth estimation with leftright consistency[C]Proceedings of the IEEE Conference on Computer Vision Pattern Recognition, 2017: 270279.

[23] [23] GODARD C, MAC AODHA O, FIRMAN M, et al. Digging into selfsupervised monocular depth estimation[C]Proceedings of the IEEECVF International Conference on Computer Vision, 2019: 38283838.

[24] [24] ZHOU T, BROWN M, SNAVELY N, et al. Unsupervised learning of depth egomotion from video[C]Proceedings of the IEEE Conference on Computer Vision Pattern Recognition, 2017: 18511858.

[25] [25] WATSON J, MAC AODHA O, PRISACARIU V, et al. The tempal opptunist: Selfsupervised multiframe monocular depth[C]Proceedings of the IEEECVF Conference on Computer Vision Pattern Recognition, 2021: 11641174.

[26] LEE J H, HAN M-K, KO D W et al. From big to small: Multi-scale local planar guidance for monocular depth estimation[J]. arXiv, 1907.10326cs, 2019.

[27] CHEN L-C, PAPANDREOU G, KOKKINOS I et al. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40, 834-848(2017).

[28] [28] BHAT S F, ALHASHIM I, WONKA P. Adabins: Depth estimation using adaptive bins[C]Proceedings of the IEEECVF Conference on Computer Vision Pattern Recognition, 2021: 40094018.

[29] BHAT S F, BIRKL R, WOFK D et al. Zoedepth: Zero-shot transfer by combining relative and metric depth[J]. arXiv, 2302.12288cs, 2023.

[30] [30] SILBERMAN N, HOIEM D, KOHLI P, et al. Indo segmentation suppt inference from rgbd images[C]Proceedings of the Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Flence, Italy, October 713, 2012, Proceedings, Part V 12, 2012: 746760.

[31] [31] GEIGER A, LENZ P, URTASUN R. Are we ready f autonomous driving the kitti vision benchmark suite[C]Proceedings of the Conference on Computer Vision Pattern Recognition, 2012: 33543361.

[32] GEIGER A, LENZ P, STILLER C et al. Vision meets robotics: The kitti dataset[J]. The International Journal of Robotics Research, 32, 1231-1237(2013).

[33] [33] GAIDON A, WANG Q, CABON Y, et al. Virtual wlds as proxy f multiobject tracking analysis[C]Proceedings of the IEEE Conference on Computer Vision Pattern Recognition, 2016: 43404349.

[34] [34] YANG L, KANG B, HUANG Z, et al. Depth anything: Unleashing the power of largescale unlabeled data[C]Proceedings of the IEEECVF Conference on Computer Vision Pattern Recognition. 2024: 1037110381.

[35] RANFTL R, LASINGER K, HAFNER D et al. A new dataset for monocular depth estimation under viewpoint shifts[J]. arXiv, 2409.17851cs, 2024.

[36] RANFTL R, LASINGER K, HAFNER D et al. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44, 1623-1637(2020).

[37] CHOI S, GOPAKUMAR M, PENG Y et al. Neural 3D holography: Learning accurate wave propagation models for 3D holographic virtual and augmented reality displays[J]. ACM Transactions on Graphics, 40, 1-12(2021).

[38] HU M, YIN W, ZHANG C et al. Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation[J]. ACM Transactions on Graphics, 46, 10579-10596(2024).

[39] [39] KE B, OBUKHOV A, HUANG S, et al. Repurposing diffusionbased image generats f monocular depth estimation[C]Proceedings of the IEEECVF Conference on Computer Vision Pattern Recognition, 2024: 94929502.

[40] HE X, GUO D, LI H et al. Distill any depth: Distillation creates a stronger monocular depth estimator[J]. arXiv, 2502.19204cs, 2025.

[41] LAGA H, JOSPIN L V, BOUSSAID F et al. A survey on deep learning techniques for stereo-based depth estimation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44, 1738-1764(2020).

[42] GHOSH S, GALLEGO G. Event-based stereo depth estimation: A survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1-20(2025).

[43] HIRSCHMULLER H. Stereo processing by semiglobal matching and mutual information[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30, 328-341(2007).

[44] [44] BARRON J T, ADAMS A, SHIH Y, et al. Fast bilateralspace stereo f synthetic defocus[C]Proceedings of the IEEE Conference on Computer Vision Pattern Recognition, 2015: 44664474.

[45] [45] MAYER N, ILG E, HAUSSER P, et al. A large dataset to train convolutional wks f disparity, optical flow, scene flow estimation[C]Proceedings of the Proceedings of the IEEE Conference on Computer Vision Pattern Recognition, 2016: 40404048.

[46] [46] CHANG JR, CHEN YS. Pyra stereo matching wk[C]Proceedings of the IEEE Conference on Computer Vision Pattern Recognition, 2018: 54105418.

[47] [47] KENDALL A, MARTIROSYAN H, DASGUPTA S, et al. Endtoend learning of geometry context f deep stereo regression[C]Proceedings of the IEEE International Conference on Computer Vision, 2017: 6675.

[48] [48] LI Z, LIU X, DRENKOW N, et al. Revisiting stereo depth estimation from a sequencetosequence perspective with transfmers[C]Proceedings of the IEEECVF International Conference on Computer Vision, 2021: 61976206.

[49] WILSON B, QI W, AGARWAL T et al. Argoverse 2: Next generation datasets for self-driving perception and forecasting[J]. arXiv, 2301.00493cs, 2023.

[50] [50] HUANG X, CHENG X, GENG Q, et al. The apolloscape dataset f autonomous driving[C]Proceedings of the IEEE Conference on Computer Vision Pattern Recognition Wkshops, 2018: 954960.

[51] [51] MAYER N, ILG E, HAUSSER P, et al. A large dataset to train convolutional wks f disparity, optical flow, scene flow estimation[C]Proceedings of the IEEE Conference on Computer Vision Pattern Recognition, 2016: 40404048.

[52] [52] SSTEIN D, SZELISKI R. Highaccuracy stereo depth maps using structured light[C]2003 IEEE Computer Society Conference on Computer Vision Pattern Recognition, 2003.

[53] [53] KHAMIS S, FANELLO S, RHEMANN C, et al. Stereo: Guided hierarchical refinement f realtime edgeaware depth prediction[C]Proceedings of the European Conference on Computer Vision, 2018: 573590.

[54] [54] PaddlePaddle. PaddleDepth: A Toolkit f Depth Infmation Argumentation[CPOL]. GitHub, 2023(20230503)[20250718]. https:github.comPaddlePaddlePaddleDepth.

[55] CHEN C P, WU X, WANG J et al. Penta-channel waveguide-based near-eye display with two-dimensional pupil expansion[J]. Displays, 88, 102999(2025).

[56] [56] GAO T, ZOU D, CHEN C P, et al. Online lane mapping based on multisens SLAM CatmullRom splines [J] Measurement Science Technology, 2025, 36: 026318.

Tools

Get Citation

Copy Citation Text

Haiyang HU, Chaoping CHEN, Tianmu GAO, Baoen HAN, Yunfan YANG, Yi LIU, Xiaojun WU. Recent progress in research and applications of monocular and binocular depth estimation (invited)[J]. Infrared and Laser Engineering, 2025, 54(7): 20250157

Download Citation

EndNote(RIS)BibTex Plain Text

Set citation alerts for article

Save article for my favorites

Paper Information

Category: Special issue—Advanced display technology and applications

Received: Mar. 10, 2025

Accepted: --

Published Online: Aug. 29, 2025

The Author Email:

DOI:10.3788/IRLA20250157

Topics

laser devices and laser physics

Lasers and Laser Optics

Laser physics

laser manufacturing

Instrumentation, Measurement and Metrology