Acta Optica Sinica (Online), Volume. 2, Issue 13, 1305001(2025)

A Survey of Monocular Depth Estimation Methods Based on Deep Learning

Yiquan Wu* and Haobo Xie
Author Affiliations
  • College of Electronic and Information Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, Jiangsu , China
  • show less
    Figures & Tables(16)
    Overall framework of this paper
    Workflow of monocular depth estimation based on deep learning
    An example framework of monocular depth estimation based on supervised learning[29]
    Schematic diagram of geometric relationships between image pairs
    • Table 1. Comparison of several active and passive depth acquisition methods

      View table
      View in Article

      Table 1. Comparison of several active and passive depth acquisition methods

      CategoryMethodPrincipleAdvantageDisadvantage
      ActiveTime-of-flight (ToF)Compute distance based on light travel timeSimple principle, unaffected by ambient lightHigh measurement cost
      Structured lightEstimate depth using phase information of deformed lightEffective for complex surfacesHigh measurement cost, sensitive to ambient light
      PassiveStereo depth estimationDisparity matching from stereo imagesUnrestricted by training dataStruggle in low-texture scenes, poor performance at long distances
      Multi-view depth estimationExtract disparity information from multiple imagesProvide richer viewpoint information, effectively handle occlusionHigh computational complexity, poor real-time performance
      Monocular depth estimationEstimate depth from a single imageRequire only one camera, low cost, easy to integrateRely on learning models, sensitive to training data and scene variations
    • Table 2. Comparison between this work and existing review papers

      View table
      View in Article

      Table 2. Comparison between this work and existing review papers

      Ref.YearSupervision classificationRecent workMethod comparisonQuantitative analysisApplication overview
      72018Supervised onlyNoNoYesNo
      82019NoneNoNoNoNo
      102019Supervised, unsupervisedNoNoYesNo
      112020Supervised, self-supervised, semi-supervisedNoYesYesYes
      92021Supervised, unsupervisedNoYesYesYes
      122022NoneNoNoYesNo
      132022Supervised, unsupervisedNoNoYesNo
      182022Supervised, unsupervised, semi-supervisedNoYesLimitedNo
      142022Supervised, unsupervised, semi-supervisedNoNoYesNo
      152023NoneYesNoYesNo
      162024Unsupervised onlyNoYesYesYes
      172024Unsupervised onlyYesYesYesNo
      192024Supervised, self-supervised, semi-supervisedYesYesYesYes
      This work2025Supervised, unsupervised, semi-supervisedYesYesYesYes
    • Table 3. Evolution of monocular depth estimation methods

      View table
      View in Article

      Table 3. Evolution of monocular depth estimation methods

      CategoryPeriodMethodAdvantageLimitation
      Depth cues-basedBefore 2009Shadows, focus/defocusLow cost, simpleinefficient, non-real-time
      Traditional machine learning-based2009‒2014Parametric/non-parametricLow computationTexture-dependent, non-real-time
      Deep learning-basedAfter 2014Supervised, unsupervised, semi-supervisedHigh accuracy, multi-scene, real-timeHigh computation, data-hungry
    • Table 4. Monocular depth estimation methods based on supervised learning

      View table
      View in Article

      Table 4. Monocular depth estimation methods based on supervised learning

      Ref.MethodContributionLimitation
      29Multi-scale depth networkPreserve details without superpixelsScale estimation errors
      31ResNet, BerHu lossAddress deep network issuesLimited generalization
      33Skip connectionsPrevent inter-layer info loss
      35Laplacian pyramidAdaptive multi-scale fusionLack global consistency
      39Transformer extractorGlobal receptive fieldWeak local connectivity
      40Multi-Transformer blocksIntegrate local‒global info, restores detailsMiss long-range depth relations
      41Transformer + CNNEnhances local featuresHigh computation
      42Cross-distillationLeverage long-range‒local depth, reduces cost
      43Depth gradient refinementImprove depth gradient continuity
      46CRF superpixel refinementIntegrate color, texture, positionStruggle with complex scenes
      47Fully connected CRFFuse multiple featuresHigh computation
      48Semantic integrationMore accurate far-depth estimationLimited receptive field
      49Transformer + CNNEnhance long-range depth modeling
      50Scene-aware estimationDepth at semantic-segment levelLack structural consistency
      51Edge-assisted depthHigh-detail mapsBoundary artifacts
      52PatchFusionMerge coarse‒global and fine‒local depthNeed real-time optimization
      54Adaptive depth distributionLightweight, suited for indoors
      55BerHu sum lossMulti-scale fusionUnstable with outliers
      56Hybrid lossNoise robustnessFocus only on loss optimization
      573D geometric lossFaster computation, easy integration
      59Relative depthIntroduce related datasetView synthesis artifacts
      60Structural ranking lossReduce artifactsInaccurate absolute depth
      61Two-stage relative‒absolute depthImprove absolute depth
      62Depth classificationLower model complexityHigher uncertainty at greater depth
      63Progressive binningReduce uncertaintyLack mathematical reliability
      64Discrete classificationEnsure depth probability distributionIgnore depth ordering
      65Constrained ordinal regressionMaintain depth orderInefficient
      66Adaptive depth binningEliminate quantization artifactsFixed bin count
      69LocalBinsIterative grouping refinementHigh complexity
      70IEBinsMaintain stable bin countsWeak detail capture
      71CaBinsImprove detail accuracy
      73GAN-basedFirst to apply GAN in depth estimationPoor texture recovery
      74UNet + multi-skipRecover low-dim texturesLack conditional guidance
      75cGANEnhance depth estimation and reconstructionSensitive to noisy conditions
    • Table 5. Monocular depth estimation methods based on unsupervised learning

      View table
      View in Article

      Table 5. Monocular depth estimation methods based on unsupervised learning

      Ref.MethodKey contributionLimitation
      84Gradient smoothness & photometric lossFirst unsupervised training with image pairsLow accuracy, occlusion issues, local optima
      85Left-right consistencySolve occlusion problemTexture-copying artifacts
      86Pyramid processingMitigate multi-scale object impactDepth holes in low-scale disparity maps
      87Local plane parameter moduleAddress local differentiabilityLimited dataset usage
      88Dense feature fusionSmoother depth maps with clearer edgesLimited dataset usage
      92Coupled depth-pose networkFirst video-based unsupervised frameworkStruggle with dynamics and occlusion
      93Multi-scale appearance matching lossReduce depth artifactsRequire known camera intrinsics
      95Geometry-aware self-discovered maskSolve scale inconsistency & occlusionLack absolute scale factor
      96Overlapping & blank masksImprove accuracyPoor in dynamic scenes
      97VO joint estimationEnhance dynamic object estimationScale ambiguity
      98Depth-pose consistency loss & attentionReduce scale ambiguityComplex model
      1003D geometric constraintsNo need for camera parametersIgnore moving objects
      101Minimal instance photometric residualRemove dynamic object influenceDepth detail improvement needed
      102Occlusion-aware feature fusionIntegrate spatial context
      104Direct VO as pose predictorSolve scale ambiguityPoor for deformable objects
      106Competitive collaboration frameworkIntegrate multiple tasksScale inconsistency
      107Global loss constraintSolve scale inconsistency
      108Cross-view completionImprove generalizationHigh computation cost
      109Shared feature pyramidReduce network parametersIgnore sequence context
      110ConvLSTMUtilize sequence context
      1113D packing-unpacking blocksPreserve fine detailsDepend on accurate speed measurements
      112Recurrent spatiotemporal networkEnhance self-motion estimationHigh computation cost
      114Joint framework (GeoNet)Reduce occlusion, blur, and dynamic object effectsOptical flow depends on depth & pose networks
      115Independent optical flow networkPrevent error propagationOcclusion issues
      116Multi-task joint estimationMitigate occlusionLimited scene flow accuracy
      117Independent motion estimation networkMore accurate scene flowComplex structure
      118Rigid vs. full scene flowReduce dynamic object influenceLimited by brightness constancy assumption
      121Semantic-guided depth (SGDepth)Reduce dynamic object impactRequire known intrinsics
      123Pixel-level semantic priorsImprove depth consistencyLimited receptive field
      124Shared encoder with multi-task feature extractionEnhance weak-texture estimation
      126Neighboring frames as auxiliary inputNo need for calibrated videosSensitive to noise & local minima
      127Self-cross-attention layersImprove feature matchingIgnore dynamic objects
      128Dynamic object motion disentanglementReduce dynamic object effectsStruggle in low-texture scenes
      129Learnable PatchMatchImprove performance in low-texture or brightness variationsUnderutilize multi-view geometry
      130Mono-depth as geometric prior for multi-view cost volumeRobust under multi-view ambiguity
      132GAN-based depth estimationFirst GAN-based unsupervised approachLimited accuracy
      133GAN for depth & poseImprove accuracyFail on dynamic objects
      134GAN with temporal correlationSolve depth blur & pose inaccuracyAffected by occlusion & view changes
      135Boolean maskingMitigate occlusion & view changesIneffective in low-light
      136Adversarial domain adaptationEnhance low-light performanceStruggle with highlights & shadows
      137Domain separation networkHandle illumination changesPoor for single-day‒night scenes
      138Multi-scale GANImprove depth resolutionHigh computation cost
      139Diffusion-based modelEnhance network stability
      140Lightweight modelSignificantly reduce computation
    • Table 6. Semi-supervised monocular depth estimation methods

      View table
      View in Article

      Table 6. Semi-supervised monocular depth estimation methods

      Ref.MethodMain contributionLimitation
      147Sparse depth supervisionCombine supervised and unsupervised constraintsLimited accuracy
      148Left-right consistency lossImprove accuracyRely on a large amount of labeled disparity data
      149Geometric constraintsReduce dependence on labeled dataEfficiency needs improvement
      150Student-teacher strategyEnhance training
      151Motion cues + view changesImprove scale awarenessPoor performance on dynamic objects
      152Joint depth-semantic optimizationFirst to integrate semantic informationLimited dataset availability
      153Symmetric depth estimationNo need for paired depth data with semantic labelsDomain shift issue
      154Dual-branch frameworkAvoid data collection challengesDomain shift issue
      155Domain-aware adversarial learningMitigate cross-dataset differencesDoes not fully utilize global context
      156Symbiotic Transformer + NearFarMixIntegrate global and local information, reduce overfitting
      157Cross-channel attention moduleEnhance cross-task channel interactions
      158Manhattan world assumptionEnsure real-time performanceLimited applicability in complex environments
      159Dual-discriminator GANFirst GAN-based semi-supervised depth estimation modelLimited accuracy
      161Stacked recurrent GANsReduce errors
    • Table 7. Evaluation metrics

      View table
      View in Article

      Table 7. Evaluation metrics

      MetricFormulaDescription
      RMSERMSE=1|N|iNdi-di*2Measure absolute error, and it is sensitive to outliers
      RELREL=1|N|iNlogdi-logdi*2Log transformation reduces the influence of outliers
      AbsAbs=1|N|iN|di-di*|di*Measure relative error, but differences may not be significant
      SqSq=1|N|iNdi-di*2di*Squaring increases the error difference
      AccAcc=maxdidi*,di*di=δ<tReflect prediction accuracy, where t{1.25,1.252,1.253}
    • Table 8. Quantitative evaluation results on the KITTI dataset

      View table
      View in Article

      Table 8. Quantitative evaluation results on the KITTI dataset

      Ref.Supervision typeRMSERELAbsSqAcc
      δ<1.25δ<1.252δ<1.253
      29Supervised7.1560.2700.1901.5150.6920.8990.967
      35Supervised3.8020.1510.0900.5460.9020.9720.990
      39Supervised2.5730.0920.0620.9590.9950.999
      42Supervised2.0320.0760.0500.1420.9770.9970.999
      43Supervised2.0410.0490.979
      61Supervised0.057
      62Supervised3.6050.1870.1070.8980.9660.984
      65Supervised0.0940.1800.894
      66Supervised2.3600.0880.0590.1900.9640.9950.999
      71Supervised2.3220.0880.0570.1860.9640.9950.999
      74Supervised2.7020.1160.0720.3250.9510.9890.996
      84Unsupervised5.1040.2730.1691.0800.7400.9040.962
      85Unsupervised5.9270.2470.1481.3440.8030.9220.964
      88Unsupervised5.0630.1220.9390.8500.9470.976
      92Unsupervised6.8560.2830.2081.7680.6780.8850.957
      93Unsupervised4.8630.1930.1150.9030.8770.9590.981
      95Unsupervised5.4390.2170.1371.0890.8300.9420.975
      97Unsupervised4.8130.1920.1170.8630.8710.9590.982
      98Unsupervised4.8310.1960.1190.8860.8660.9550.980
      100Unsupervised6.2200.2500.1631.2400.7620.9160.968
      104Unsupervised5.5830.2280.1511.2570.8100.9360.974
      107Unsupervised5.5150.2110.1481.0980.8190.9340.973
      108Unsupervised4.2100.0980.9000.9690.986
      109Unsupervised4.7270.1230.797
      111Unsupervised4.6010.1890.1110.7850.8780.9600.982
      112Unsupervised4.5240.1880.1080.7670.8910.9740.991
      115Unsupervised5.5070.2230.1501.1240.8060.9330.973
      117Unsupervised4.8530.1750.1060.8880.8790.9650.987
      123Unsupervised4.8390.1960.1140.8460.8530.9530.982
      126Unsupervised4.4590.1760.0980.7700.9000.9650.983
      128Unsupervised4.4580.1750.0960.7200.8970.9640.984
      130Unsupervised4.2160.1690.0890.6630.904
      133Unsupervised5.4480.2160.1501.1410.8080.9390.975
      134Unsupervised5.5640.2290.1501.1270.8230.9360.974
      138Unsupervised5.6320.2340.1440.1480.7950.9270.971
      147Semi-supervised3.5180.1790.1080.5950.8750.9640.988
      148Semi-supervised3.4640.1260.0780.4170.9230.9840.995
      150Semi-supervised3.1620.1620.0910.5050.9010.9690.986
      151Semi-supervised3.3490.1180.0740.3550.930
      152Semi-supervised6.5260.2220.1432.1610.8500.9390.972
      154Semi-supervised4.3610.2170.1160.9560.8480.9480.974
      156Semi-supervised0.2890.0800.948
      158Semi-supervised3.0320.1120.0700.3140.9450.9740.990
      159Semi-supervised4.1950.1700.093
    • Table 9. Quantitative evaluation results on the NYU Depth-V2 dataset

      View table
      View in Article

      Table 9. Quantitative evaluation results on the NYU Depth-V2 dataset

      Ref.Supervision typeRMSERELAbsSqAcc
      δ<1.25δ<1.252δ<1.253
      29Supervised0.9070.2850.2150.2120.6110.8870.971
      31Supervised0.5730.1950.1270.8110.9530.988
      35Supervised0.5140.1110.8780.9770.994
      39Supervised0.3570.1100.9040.9880.998
      42Supervised0.3160.0880.9330.9920.998
      43Supervised0.3100.0860.937
      46Supervised0.8210.2320.6210.8860.968
      48Supervised0.3290.1250.0980.9170.9830.996
      50Supervised0.4970.1740.1280.8450.9660.990
      54Supervised0.4160.1210.8640.9740.995
      57Supervised0.5520.1640.7680.9400.984
      59Supervised1.1300.3900.3600.460
      61Supervised0.2770.0770.9530.9950.999
      62Supervised0.5400.1410.8190.9650.992
      65Supervised0.2970.3400.667
      66Supervised0.3640.1030.9030.9840.997
      71Supervised0.4010.1200.8660.9780.996
      73Supervised0.5270.1340.1030.8220.9710.993
      74Supervised0.4030.0450.1090.8710.9790.993
      98Unsupervised0.5920.1650.7530.9340.981
      109Unsupervised0.3770.0950.9050.9800.995
    • Table 10. Quantitative evaluation results on the Cityscapes dataset

      View table
      View in Article

      Table 10. Quantitative evaluation results on the Cityscapes dataset

      Ref.Supervision typeRMSERELAbsSqAcc
      δ<1.25δ<1.252δ<1.253
      50Supervised6.9170.4140.2273.8000.8010.9130.950
      100Unsupervised7.5800.3340.2672.6860.5770.8400.937
      126Unsupervised6.2230.1700.1141.1930.8750.9670.989
      128Unsupervised5.8670.1570.1031.0000.8950.9740.991
      157Semi-supervised7.2840.1421.5530.8240.9560.987
      158Semi-supervised7.1630.4110.2893.8840.6610.8540.935
    • Table 11. Quantitative evaluation results on the Make3D dataset

      View table
      View in Article

      Table 11. Quantitative evaluation results on the Make3D dataset

      Ref.Supervision typeRMSERELAbsSq
      31Supervised4.4600.176
      46Supervised10.2700.279
      74Supervised3.7880.0580.154
      92Unsupervised10.4700.4780.3835.321
      93Unsupervised7.4170.3223.589
      97Unsupervised7.3910.1670.3343.520
      98Unsupervised7.6090.5114.229
      104Unsupervised8.0900.2040.3874.720
      109Unsupervised8.3880.3674.267
      115Unsupervised6.8900.4160.3312.698
      123Unsupervised7.6460.3383.427
      150Semi-supervised5.1360.1070.265
      154Semi-supervised6.4970.1580.4024.181
      161Semi-supervised66.3700.203
    • Table 12. Overview of depth estimation datasets

      View table
      View in Article

      Table 12. Overview of depth estimation datasets

      DatasetYearImage typeSensor typeAnnotation typeScene typeImage countsResolution /(pixel×pixel)
      KITTI1622012StereoLiDARSparseDriving scene4.4×1041024×320
      NYU Depth-V21632012MonocularMicrosoft KinectDenseIndoor1449640×480
      Cityscapes1642016MonocularStereo CameraDisparityUrban driving5×1032048×1024
      Make3D1652008MonocularLaser ScannerDenseOutdoor scene5342272×1704
      DIODE1662019StereoLaser ScannerDenseIndoor/outdoor2.55×104768×1024
      Middlebury 20141672014StereoDSLR CameraDenseIndoor detail332872×1984
      Driving Stereo1682019StereoLiDARSparseDriving scene1.82×1051762×800
    Tools

    Get Citation

    Copy Citation Text

    Yiquan Wu, Haobo Xie. A Survey of Monocular Depth Estimation Methods Based on Deep Learning[J]. Acta Optica Sinica (Online), 2025, 2(13): 1305001

    Download Citation

    EndNote(RIS)BibTexPlain Text
    Save article for my favorites
    Paper Information

    Category: Optical Information Acquisition, Display, and Processing

    Received: Apr. 25, 2025

    Accepted: May. 15, 2025

    Published Online: Jul. 7, 2025

    The Author Email: Yiquan Wu (nuaaimage@163.com)

    DOI:10.3788/AOSOL250454

    CSTR:32394.14.AOSOL250454

    Topics