A Survey of Monocular Depth Estimation Methods Based on Deep Learning

Yiquan Wu; Haobo Xie

doi:10.3788/AOSOL250454

Category

Period

Method

Advantage

Limitation

Depth cues-based

Before 2009

Shadows， focus/defocus

Low cost， simple

inefficient， non-real-time

Traditional machine learning-based

2009‒2014

Parametric/non-parametric

Low computation

Texture-dependent， non-real-time

Deep learning-based

After 2014

Supervised， unsupervised， semi-supervised

High accuracy， multi-scene， real-time

High computation， data-hungry

Ref.

Method

Contribution

Limitation

［29］

Multi-scale depth network

Preserve details without superpixels

Scale estimation errors

［31］

ResNet， BerHu loss

Address deep network issues

Limited generalization

［33］

Skip connections

Prevent inter-layer info loss

‒

［35］

Laplacian pyramid

Adaptive multi-scale fusion

Lack global consistency

［39］

Transformer extractor

Global receptive field

Weak local connectivity

［40］

Multi-Transformer blocks

Integrate local‒global info， restores details

Miss long-range depth relations

［41］

Transformer + CNN

Enhances local features

High computation

［42］

Cross-distillation

Leverage long-range‒local depth， reduces cost

‒

［43］

Depth gradient refinement

Improve depth gradient continuity

‒

［46］

CRF superpixel refinement

Integrate color， texture， position

Struggle with complex scenes

［47］

Fully connected CRF

Fuse multiple features

High computation

［48］

Semantic integration

More accurate far-depth estimation

Limited receptive field

［49］

Transformer + CNN

Enhance long-range depth modeling

‒

［50］

Scene-aware estimation

Depth at semantic-segment level

Lack structural consistency

［51］

Edge-assisted depth

High-detail maps

Boundary artifacts

［52］

PatchFusion

Merge coarse‒global and fine‒local depth

Need real-time optimization

［54］

Adaptive depth distribution

Lightweight， suited for indoors

‒

［55］

BerHu sum loss

Multi-scale fusion

Unstable with outliers

［56］

Hybrid loss

Noise robustness

Focus only on loss optimization

［57］

3D geometric loss

Faster computation， easy integration

‒

［59］

Relative depth

Introduce related dataset

View synthesis artifacts

［60］

Structural ranking loss

Reduce artifacts

Inaccurate absolute depth

［61］

Two-stage relative‒absolute depth

Improve absolute depth

‒

［62］

Depth classification

Lower model complexity

Higher uncertainty at greater depth

［63］

Progressive binning

Reduce uncertainty

Lack mathematical reliability

［64］

Discrete classification

Ensure depth probability distribution

Ignore depth ordering

［65］

Constrained ordinal regression

Maintain depth order

Inefficient

［66］

Adaptive depth binning

Eliminate quantization artifacts

Fixed bin count

［69］

LocalBins

Iterative grouping refinement

High complexity

［70］

IEBins

Maintain stable bin counts

Weak detail capture

［71］

CaBins

Improve detail accuracy

‒

［73］

GAN-based

First to apply GAN in depth estimation

Poor texture recovery

［74］

UNet + multi-skip

Recover low-dim textures

Lack conditional guidance

［75］

cGAN

Enhance depth estimation and reconstruction

Sensitive to noisy conditions

Ref.

Method

Key contribution

Limitation

［84］

Gradient smoothness & photometric loss

First unsupervised training with image pairs

Low accuracy， occlusion issues， local optima

［85］

Left-right consistency

Solve occlusion problem

Texture-copying artifacts

［86］

Pyramid processing

Mitigate multi-scale object impact

Depth holes in low-scale disparity maps

［87］

Local plane parameter module

Address local differentiability

Limited dataset usage

［88］

Dense feature fusion

Smoother depth maps with clearer edges

Limited dataset usage

［92］

Coupled depth-pose network

First video-based unsupervised framework

Struggle with dynamics and occlusion

［93］

Multi-scale appearance matching loss

Reduce depth artifacts

Require known camera intrinsics

［95］

Geometry-aware self-discovered mask

Solve scale inconsistency & occlusion

Lack absolute scale factor

［96］

Overlapping & blank masks

Improve accuracy

Poor in dynamic scenes

［97］

VO joint estimation

Enhance dynamic object estimation

Scale ambiguity

［98］

Depth-pose consistency loss & attention

Reduce scale ambiguity

Complex model

［100］

3D geometric constraints

No need for camera parameters

Ignore moving objects

［101］

Minimal instance photometric residual

Remove dynamic object influence

Depth detail improvement needed

［102］

Occlusion-aware feature fusion

Integrate spatial context

‒

［104］

Direct VO as pose predictor

Solve scale ambiguity

Poor for deformable objects

［106］

Competitive collaboration framework

Integrate multiple tasks

Scale inconsistency

［107］

Global loss constraint

Solve scale inconsistency

‒

［108］

Cross-view completion

Improve generalization

High computation cost

［109］

Shared feature pyramid

Reduce network parameters

Ignore sequence context

［110］

ConvLSTM

Utilize sequence context

‒

［111］

3D packing-unpacking blocks

Preserve fine details

Depend on accurate speed measurements

［112］

Recurrent spatiotemporal network

Enhance self-motion estimation

High computation cost

［114］

Joint framework （GeoNet）

Reduce occlusion， blur， and dynamic object effects

Optical flow depends on depth & pose networks

［115］

Independent optical flow network

Prevent error propagation

Occlusion issues

［116］

Multi-task joint estimation

Mitigate occlusion

Limited scene flow accuracy

［117］

Independent motion estimation network

More accurate scene flow

Complex structure

［118］

Rigid vs. full scene flow

Reduce dynamic object influence

Limited by brightness constancy assumption

［121］

Semantic-guided depth （SGDepth）

Reduce dynamic object impact

Require known intrinsics

［123］

Pixel-level semantic priors

Improve depth consistency

Limited receptive field

［124］

Shared encoder with multi-task feature extraction

Enhance weak-texture estimation

‒

［126］

Neighboring frames as auxiliary input

No need for calibrated videos

Sensitive to noise & local minima

［127］

Self-cross-attention layers

Improve feature matching

Ignore dynamic objects

［128］

Dynamic object motion disentanglement

Reduce dynamic object effects

Struggle in low-texture scenes

［129］

Learnable PatchMatch

Improve performance in low-texture or brightness variations

Underutilize multi-view geometry

［130］

Mono-depth as geometric prior for multi-view cost volume

Robust under multi-view ambiguity

‒

［132］

GAN-based depth estimation

First GAN-based unsupervised approach

Limited accuracy

［133］

GAN for depth & pose

Improve accuracy

Fail on dynamic objects

［134］

GAN with temporal correlation

Solve depth blur & pose inaccuracy

Affected by occlusion & view changes

［135］

Boolean masking

Mitigate occlusion & view changes

Ineffective in low-light

［136］

Adversarial domain adaptation

Enhance low-light performance

Struggle with highlights & shadows

［137］

Domain separation network

Handle illumination changes

Poor for single-day‒night scenes

［138］

Multi-scale GAN

Improve depth resolution

High computation cost

［139］

Diffusion-based model

Enhance network stability

‒

［140］

Lightweight model

Significantly reduce computation

‒

Ref.

Method

Main contribution

Limitation

［147］

Sparse depth supervision

Combine supervised and unsupervised constraints

Limited accuracy

［148］

Left-right consistency loss

Improve accuracy

Rely on a large amount of labeled disparity data

［149］

Geometric constraints

Reduce dependence on labeled data

Efficiency needs improvement

［150］

Student-teacher strategy

Enhance training

‒

［151］

Motion cues + view changes

Improve scale awareness

Poor performance on dynamic objects

［152］

Joint depth-semantic optimization

First to integrate semantic information

Limited dataset availability

［153］

Symmetric depth estimation

No need for paired depth data with semantic labels

Domain shift issue

［154］

Dual-branch framework

Avoid data collection challenges

Domain shift issue

［155］

Domain-aware adversarial learning

Mitigate cross-dataset differences

Does not fully utilize global context

［156］

Symbiotic Transformer + NearFarMix

Integrate global and local information， reduce overfitting

‒

［157］

Cross-channel attention module

Enhance cross-task channel interactions

‒

［158］

Manhattan world assumption

Ensure real-time performance

Limited applicability in complex environments

［159］

Dual-discriminator GAN

First GAN-based semi-supervised depth estimation model

Limited accuracy

［161］

Stacked recurrent GANs

Reduce errors

‒

Metric

Formula

Description

R_MSE

R_{M S E} = \sqrt[]{\frac{1}{| N |} \sum_{i \in N} ‖ d_{i} - d_{i}^{*} ‖^{2}}

Measure absolute error， and it is sensitive to outliers

R_EL

R_{E L} = \sqrt[]{\frac{1}{| N |} \sum_{i \in N} {‖l o g d_{i} - l o g d_{i}^{*}‖}^{2}}

Log transformation reduces the influence of outliers

A_bs

A_{b s} = \frac{1}{| N |} \sum_{i \in N} \frac{| d_{i} - d_{i}^{*} |}{d_{i}^{*}}

Measure relative error， but differences may not be significant

S_q

S_{q} = \frac{1}{| N |} \sum_{i \in N} \frac{{‖d_{i} - d_{i}^{*}‖}^{2}}{d_{i}^{*}}

Squaring increases the error difference

A_cc

A_{c c} = m a x (\frac{d_{i}}{d_{i}^{*}}, \frac{d_{i}^{*}}{d_{i}}) = δ < t

Reflect prediction accuracy， where

t \in {1.25,1 . 25^{2}, 1 . 25^{3}}

Ref.

Supervision type

R_MSE

R_EL

A_bs

S_q

A_cc

δ < 1.25

δ < 1 . 25^{2}

δ < 1 . 25^{3}

［29］

Supervised

7.156

0.270

0.190

1.515

0.692

0.899

0.967

［35］

Supervised

3.802

0.151

0.090

0.546

0.902

0.972

0.990

［39］

Supervised

2.573

0.092

0.062

‒

0.959

0.995

0.999

［42］

Supervised

2.032

0.076

0.050

0.142

0.977

0.997

0.999

［43］

Supervised

2.041

‒

0.049

‒

0.979

‒

［61］

Supervised

‒

0.057

‒

［62］

Supervised

3.605

0.187

0.107

‒

0.898

0.966

0.984

［65］

Supervised

0.094

‒

0.180

‒

0.894

‒

［66］

Supervised

2.360

0.088

0.059

0.190

0.964

0.995

0.999

［71］

Supervised

2.322

0.088

0.057

0.186

0.964

0.995

0.999

［74］

Supervised

2.702

0.116

0.072

0.325

0.951

0.989

0.996

［84］

Unsupervised

5.104

0.273

0.169

1.080

0.740

0.904

0.962

［85］

Unsupervised

5.927

0.247

0.148

1.344

0.803

0.922

0.964

［88］

Unsupervised

5.063

‒

0.122

0.939

0.850

0.947

0.976

［92］

Unsupervised

6.856

0.283

0.208

1.768

0.678

0.885

0.957

［93］

Unsupervised

4.863

0.193

0.115

0.903

0.877

0.959

0.981

［95］

Unsupervised

5.439

0.217

0.137

1.089

0.830

0.942

0.975

［97］

Unsupervised

4.813

0.192

0.117

0.863

0.871

0.959

0.982

［98］

Unsupervised

4.831

0.196

0.119

0.886

0.866

0.955

0.980

［100］

Unsupervised

6.220

0.250

0.163

1.240

0.762

0.916

0.968

［104］

Unsupervised

5.583

0.228

0.151

1.257

0.810

0.936

0.974

［107］

Unsupervised

5.515

0.211

0.148

1.098

0.819

0.934

0.973

［108］

Unsupervised

4.210

‒

0.098

‒

0.900

0.969

0.986

［109］

Unsupervised

4.727

‒

0.123

0.797

‒

［111］

Unsupervised

4.601

0.189

0.111

0.785

0.878

0.960

0.982

［112］

Unsupervised

4.524

0.188

0.108

0.767

0.891

0.974

0.991

［115］

Unsupervised

5.507

0.223

0.150

1.124

0.806

0.933

0.973

［117］

Unsupervised

4.853

0.175

0.106

0.888

0.879

0.965

0.987

［123］

Unsupervised

4.839

0.196

0.114

0.846

0.853

0.953

0.982

［126］

Unsupervised

4.459

0.176

0.098

0.770

0.900

0.965

0.983

［128］

Unsupervised

4.458

0.175

0.096

0.720

0.897

0.964

0.984

［130］

Unsupervised

4.216

0.169

0.089

0.663

0.904

‒

［133］

Unsupervised

5.448

0.216

0.150

1.141

0.808

0.939

0.975

［134］

Unsupervised

5.564

0.229

0.150

1.127

0.823

0.936

0.974

［138］

Unsupervised

5.632

0.234

0.144

0.148

0.795

0.927

0.971

［147］

Semi-supervised

3.518

0.179

0.108

0.595

0.875

0.964

0.988

［148］

Semi-supervised

3.464

0.126

0.078

0.417

0.923

0.984

0.995

［150］

Semi-supervised

3.162

0.162

0.091

0.505

0.901

0.969

0.986

［151］

Semi-supervised

3.349

0.118

0.074

0.355

0.930

‒

［152］

Semi-supervised

6.526

0.222

0.143

2.161

0.850

0.939

0.972

［154］

Semi-supervised

4.361

0.217

0.116

0.956

0.848

0.948

0.974

［156］

Semi-supervised

0.289

‒

0.080

‒

0.948

‒

［158］

Semi-supervised

3.032

0.112

0.070

0.314

0.945

0.974

0.990

［159］

Semi-supervised

4.195

0.170

‒

0.093

‒

Ref.

Supervision type

R_MSE

R_EL

A_bs

S_q

A_cc

δ < 1.25

δ < 1 . 25^{2}

δ < 1 . 25^{3}

［29］

Supervised

0.907

0.285

0.215

0.212

0.611

0.887

0.971

［31］

Supervised

0.573

0.195

0.127

‒

0.811

0.953

0.988

［35］

Supervised

0.514

‒

0.111

‒

0.878

0.977

0.994

［39］

Supervised

0.357

‒

0.110

‒

0.904

0.988

0.998

［42］

Supervised

0.316

‒

0.088

‒

0.933

0.992

0.998

［43］

Supervised

0.310

‒

0.086

‒

0.937

‒

［46］

Supervised

0.821

‒

0.232

‒

0.621

0.886

0.968

［48］

Supervised

0.329

0.125

0.098

‒

0.917

0.983

0.996

［50］

Supervised

0.497

0.174

0.128

‒

0.845

0.966

0.990

［54］

Supervised

0.416

‒

0.121

‒

0.864

0.974

0.995

［57］

Supervised

0.552

‒

0.164

‒

0.768

0.940

0.984

［59］

Supervised

1.130

0.390

0.360

0.460

‒

［61］

Supervised

0.277

‒

0.077

‒

0.953

0.995

0.999

［62］

Supervised

0.540

‒

0.141

‒

0.819

0.965

0.992

［65］

Supervised

0.297

‒

0.340

‒

0.667

‒

［66］

Supervised

0.364

‒

0.103

‒

0.903

0.984

0.997

［71］

Supervised

0.401

‒

0.120

‒

0.866

0.978

0.996

［73］

Supervised

0.527

‒

0.134

0.103

0.822

0.971

0.993

［74］

Supervised

0.403

0.045

0.109

‒

0.871

0.979

0.993

［98］

Unsupervised

0.592

‒

0.165

‒

0.753

0.934

0.981

［109］

Unsupervised

0.377

‒

0.095

‒

0.905

0.980

0.995

Ref.

Supervision type

R_MSE

R_EL

A_bs

S_q

A_cc

δ < 1.25

δ < 1 . 25^{2}

δ < 1 . 25^{3}

［50］

Supervised

6.917

0.414

0.227

3.800

0.801

0.913

0.950

［100］

Unsupervised

7.580

0.334

0.267

2.686

0.577

0.840

0.937

［126］

Unsupervised

6.223

0.170

0.114

1.193

0.875

0.967

0.989

［128］

Unsupervised

5.867

0.157

0.103

1.000

0.895

0.974

0.991

［157］

Semi-supervised

7.284

‒

0.142

1.553

0.824

0.956

0.987

［158］

Semi-supervised

7.163

0.411

0.289

3.884

0.661

0.854

0.935

Ref.

Supervision type

R_MSE

R_EL

A_bs

S_q

［31］

Supervised

4.460

‒

0.176

‒

［46］

Supervised

10.270

‒

0.279

‒

［74］

Supervised

3.788

0.058

0.154

‒

［92］

Unsupervised

10.470

0.478

0.383

5.321

［93］

Unsupervised

7.417

‒

0.322

3.589

［97］

Unsupervised

7.391

0.167

0.334

3.520

［98］

Unsupervised

7.609

‒

0.511

4.229

［104］

Unsupervised

8.090

0.204

0.387

4.720

［109］

Unsupervised

8.388

‒

0.367

4.267

［115］

Unsupervised

6.890

0.416

0.331

2.698

［123］

Unsupervised

7.646

‒

0.338

3.427

［150］

Semi-supervised

5.136

‒

0.107

0.265

［154］

Semi-supervised

6.497

0.158

0.402

4.181

［161］

Semi-supervised

66.370

‒

0.203

‒

Dataset

Year

Image type

Sensor type

Annotation type

Scene type

Image counts

Resolution /（pixel×pixel）

KITTI^［162］

2012

Stereo

LiDAR

Sparse

Driving scene

4.4×10⁴

1024×320

NYU Depth-V2^［163］

2012

Monocular

Microsoft Kinect

Dense

Indoor

1449

640×480

Cityscapes^［164］

2016

Monocular

Stereo Camera

Disparity

Urban driving

5×10³

2048×1024

Make3D^［165］

2008

Monocular

Laser Scanner

Dense

Outdoor scene

534

2272×1704

DIODE^［166］

2019

Stereo

Laser Scanner

Dense

Indoor/outdoor

2.55×10⁴

768×1024

Middlebury 2014^［167］

2014

Stereo

DSLR Camera

Dense

Indoor detail

2872×1984

Driving Stereo^［168］

2019

Stereo

LiDAR

Sparse

Driving scene

1.82×10⁵

1762×800

Yiquan Wu, Haobo Xie. A Survey of Monocular Depth Estimation Methods Based on Deep Learning[J]. Acta Optica Sinica (Online), 2025, 2(13): 1305001

Category: Optical Information Acquisition, Display, and Processing

Received: Apr. 25, 2025

Accepted: May. 15, 2025

Published Online: Jul. 7, 2025

The Author Email: Yiquan Wu (nuaaimage@163.com)

DOI:10.3788/AOSOL250454

CSTR:32394.14.AOSOL250454

Table 1. Comparison of several active and passive depth acquisition methods

Table 1. Comparison of several active and passive depth acquisition methods

Table 2. Comparison between this work and existing review papers

Table 2. Comparison between this work and existing review papers

Table 3. Evolution of monocular depth estimation methods

Table 3. Evolution of monocular depth estimation methods

Table 4. Monocular depth estimation methods based on supervised learning

Table 4. Monocular depth estimation methods based on supervised learning

Table 5. Monocular depth estimation methods based on unsupervised learning

Table 5. Monocular depth estimation methods based on unsupervised learning

Table 6. Semi-supervised monocular depth estimation methods

Table 6. Semi-supervised monocular depth estimation methods

Table 7. Evaluation metrics

Table 7. Evaluation metrics

Table 8. Quantitative evaluation results on the KITTI dataset

Table 8. Quantitative evaluation results on the KITTI dataset

Table 9. Quantitative evaluation results on the NYU Depth-V2 dataset

Table 9. Quantitative evaluation results on the NYU Depth-V2 dataset

Table 10. Quantitative evaluation results on the Cityscapes dataset

Table 10. Quantitative evaluation results on the Cityscapes dataset

Table 11. Quantitative evaluation results on the Make3D dataset

Table 11. Quantitative evaluation results on the Make3D dataset

Table 12. Overview of depth estimation datasets

Table 12. Overview of depth estimation datasets