Edge enhanced depth perception with binocular meta-lens

Xiaoyuan Liu; Jingcheng Zhang; Borui Leng; Yin Zhou; Jialuo Cheng; Takeshi Yamaguchi; Takuo Tanaka; Mu Ku Chen

doi:10.29026/oes.2024.230033

Introduction

Spatial computing¹ and the emerging meta-verse represent a paradigm shift in how humans interact with a machine. Spatial computing refers to integrating digital information and virtual objects into the physical world, creating a mixed reality where the boundaries between the digital and physical realms are blurred. Common augmented reality devices rely on spatial computing to perceive the depth of the real physical world while embedding virtual objects into real scenes three-dimensionally². One of the key technologies of spatial computing is its depth perception capability, which bridges the gap between the physical and digital realms. This promises intuitive and natural interaction with virtual objects. Therefore, digital information can be correctly placed and manipulated in the scene following physical laws. However, the weight and volume of traditional depth sensing systems result in a lack of comfort in human-computer interaction wearable devices, which contain many sensors (mainly cameras and LiDAR). At the same time, the space occupied by bulky sensors also limits battery life, causing the device to need to be recharged frequently. Advancements in portable and accurate imaging and depth sensing systems are crucial for next-generation human-computer interaction wearable devices.

Complementing spatial computing, binocular meta-lens³ offers a breakthrough approach to depth sensing and imaging with the advantages of being lightweight^4,5, thin, and compact. Meta-lens create advanced optical functionalities that surpass the limitations of traditional optics^6,7, such as wavefront shaping⁸, polarization control⁹⁻¹¹, and spectral manipulation^12,13. Meta-lens utilizes nanoantennas to manipulate light¹⁴, offering an opportunity for engineering optical properties such as thinness, flatness, broadband capability¹⁵, high diffraction efficiency¹⁶, extreme depth-of-field¹⁷, and compatibility with complementary metal-oxide-semiconductor (CMOS) technology. By leveraging the unique properties of meta-optics, this compact and miniaturized optical meta-device allows for capturing three-dimensional information from the surrounding environment. Binocular meta-lens enable precise and accurate depth perception, similar to human binocular vision. In recent years, the support of artificial intelligence has increasingly promoted the development of meta-devices in terms of inverse design^18,19, prompt data analysis^20,21, optical computation^22,23, and intelligent reconfigurable meta-devices^24,25. These advancements pave the way for compact, lightweight, and highly efficient optical systems seamlessly integrated into spatial computing devices, enhancing their performance and enabling novel applications.

The principle underlying depth acquisition in binocular imaging relies on presenting a stereo-image pair exhibiting discernible disparities²⁶. Disparity denotes the horizontal displacement between corresponding pixels in the left and right images. Traditional binocular disparity computation pipeline often entails the utilization of block matching algorithms for calculating matching losses²⁷. The combination of deep learning and photonics has been widely researched in recent years, encompassing applications such as orbital angular momentum communication²⁸, optical neural networks²⁹, optical encryption³⁰, enhancing holographic data storage (HDS)³¹, photonic inverse design³² and hyperspectral imaging³³. Nonetheless, convolutional neural networks (CNNs) have garnered greater preference owing to their inherent advantages of rapidity, precision, and operational simplicity in processing. Despite significant advancements in accuracy and speed achieved by various binocular stereo systems, finding accurate corresponding points within inherently ill-posed regions for depth computation remains challenging, such as textureless areas and reflective surfaces³⁴. Ambiguous depth prediction has a serious impact on subsequent machine decision-making. Edge is the typical representation of texture. There must be texture feature points in the edge area for stereo matching. Numerous studies have explored edge detection techniques utilizing meta-lenses, each with distinct characteristics. For instance, the Green function^35,36, and spiral phase³⁷ have been employed to enable edge detection using a single meta-lens. Another approach involves utilizing meta-lens arrays for three-dimensional (3D) edge detection³⁸. Polarization control has been leveraged for switchable bright field imaging and edge detection capabilities^39,40. Edge detection by the Pancharatnam–Berry phase⁴¹ has emerged as a noteworthy technique, demonstrating potential in quantum applications⁴². Edge-based depth perception offers superior fidelity in the estimation of depth. Within the framework of depth edge views, non-textured regions that lack prominent edges or transitions are efficiently discarded. This filtering process reduces the impact of unreliable or ambiguous depth information originating from textureless regions, thereby enhancing the overall accuracy and reliability of depth estimation. By focusing on edges that signify depth discontinuities, edge-based depth perception provides a more resilient and accurate depiction of depth.

We develop an edge-enhanced depth perception based on binocular meta-lens for spatial computing. The whole system is miniaturized, intelligent, lightweight and compact. Its physical working mechanism consists of a binocular lens, a 532 nm filter, and a CMOS sensor. Each meta-lens, measuring 2.6 mm in diameter, weighs 2.45×10⁻⁵ g and occupies a volume of 3.98×10⁻⁶ cm³. The weight of the Sapphire substrate is 0.115 g with a volume of 0.0288 cm³. Thin and flat nature make it simple in both physical system configuration and image processing pipeline. Without preprocessing, the raw captured image is processed directly by our proposed pyramid stereo-matching neural network, H-Net, to obtain the disparity. A novel symmetric H-module with an attention mechanism allows the H-Net to dynamically allocate resources based on the significance of contextual features of each view and the correlation between the left and right views. With depth-sensing results, an edge enhancement is performed to filter the feature information that detects the 3D space gradients.

Figure 1 demonstrates the edge-enhanced depth perception system schematic with our binocular meta-lens. There are two letter objects in front of the binocular stereo-vision meta-lens. The application scenario shown in Fig. 1 has ill-posed regions, such as the letter objects' unpatterned backgrounds and untextured surfaces. But with the support of a proposed neural network for comprehensive context analysis and a Canny edge detector for filtering, an edge-enhanced depth perception view is realized, perceiving both intensity and depth discontinuities simultaneously.

Figure 1.Schematic of the edge-enhanced spatial computing with binocular meta-lens. There are two letter objects in front of the binocular meta-lens, which are texture-less and have no background. A binocular meta-lens is designed and fabricated to develop the stereo vision system for texture-less spatial computing scenarios. An edge-enhanced depth perception is realized with the support of a proposed neural network.

Download full size

View all figures

The convergence of spatial computing and meta-optics holds immense potential for transforming our daily lives. From augmented and virtual reality experiences that blend seamlessly with our physical surroundings to smart glasses that provide personalized information overlays, edge-enhanced spatial computing powered by meta-optics promises to revolutionize how we perceive and interact with the world around us. This integration can lead to breakthroughs in robotics, autonomous systems, underwater exploration, and medical imaging, where accurate depth perception is crucial for navigation, object recognition, and scene reconstruction.

Methods

Simulation and fabrication

We utilize the commercial simulation software COMSOL Multiphysics® to design and analyze the unit cells of the meta-lens. We set periodic boundary conditions for the x and y directions and a perfect match layer (PML) boundary condition for the z-direction. The meta-lens consists of unit cells of gallium nitride (GaN) cylindrical nanopillars on a sapphire substrate. The diameter of the nanopillars varies across the meta-lens. The refractive index of the sapphire substrate is set to 1.77, while the refractive index of GaN at the working wavelength is 2.42. Using this configuration, we calculate the cylindrical nanopillars' simulated transmission spectra and phase shift, as shown in Supplementary information Fig. S1. The meta-atom arrangement layout for fabrication is designed according to the focusing phase distribution

1 $φ (x, y, λ) = - [\frac{2 π}{λ} (\sqrt{x^{2} + y^{2} + f^{2}} - f)],$

in which $φ (x, y, λ)$ is the phase compensation requirement at the $(x, y)$ position under the illumination of wavelength $λ = 532 nm$ , $f$ is the desired focal length of 10.0 mm. The target diameter of each meta-lens is 2.6 mm.

The proposed binocular meta-lens is fabricated by adopting the following process (see details in Supporting Information Fig. S2): A 750-nm-thick GaN is firstly deposited on a sapphire substrate via metalorganic chemical vapor deposition (MOCVD). A 200-nm-thick SiO₂ film, which serves as the hard mask for pattern transfer to the GaN layer with a high aspect ratio, is subsequently deposited using an E-gun evaporator. A PMMA layer is spin-coated on the SiO₂ film, followed by pre-baking at 180 °C for 3 min. A layer of conductive polymer is then spin-coated on the PMMA to avoid charge accumulation. The PMMA layer is exposed under EBL (ELS-HS50, ELIONIX INC.) for pattern definition. After being immersed in DI water to remove the conductive polymer layer, the patterned sample is developed with methyl isobutyl ketone (MIBK)/ isopropyl alcohol (IPA) for 75 s and is rinsed in IPA for 20 s. An additional Cr layer with 40 nm thickness is deposited on the patterned sample using an E-gun evaporator. Followed by the lift-off process in Acetone, the pattern is transferred into the Cr layer. Taking the Cr layer as the hard mask, the SiO₂ layer is etched by inductively coupled plasma reactive ion etching (ICP-RIE) with CF₄ gas. Chromium etchant is adopted to remove the remaining Cr. A second ICP-RIE with a mixture of Ar and Cl₂ is applied for pattern transfer from the patterned SiO₂ film to the GaN film. After removing the residual SiO₂ using a buffered oxide etch (BOE) solution, the desired GaN nanostructure on the sapphire substrate is finally realized.

Figure 2(a) demonstrates the optical image of fabricated binocular meta-lens. The fabrication process of the well structure was characterized based on scanning electron microscope (SEM) images. There is no cracks or pores on the fabricated nanopillars, as shown in the top-view SEM image of Fig. 2(b). From the zoomed-in tilted view of the nanopillar SEM image in Fig. 2(c), the good collimation of the 750-nm high nanopillars can be observed with precise etching. The physical dimension analysis of the binocular sample is divided into two parts: the sapphire substrate and two GaN meta-lens. Each meta-lens, measuring 2.6 mm in diameter with a volume of 4.25×10⁻⁶ cm³, weighs 2.61×10⁻⁵ g, which is lighter than one percent of the weight of a hair. The weight of the sapphire substrate is 0.115 g and occupies a volume of 0.0288 cm³. Even though the sapphire substrate brings much more occupation, the overall weight and volume are still tiny and ignorable.

Figure 2.Optical and SEM images of fabricated binocular meta-lens. (a) Optical image of the binocular meta-lens. (b) The zoomed-in top-view SEM image of the meta-lens. (c) The zoomed-in tilted-view SEM image at the edge of the meta-lens.

Download full size

View all figures

For disparity computation, we propose a pyramid stereo-matching neural network (named H-Net) with a novel "H"-shaped attention module (H-Module), as shown in Fig. 3(a). The H-Net follows an end-to-end learning framework from stereo input images to disparity map prediction without any other pre- or post-processing. The global context aggregation is vital to derive the disparity information from stereo image pairs. Besides the conventional encoder-decoder architecture and pyramid pooling, H-Net adopts cross-pixel interaction and cross-view interaction to enable the utilization of contextual information and the integration of diverse perspectives (see details in Supplementary information Section 4). Compared with the conventional block matching method⁴³ and two advanced neural networks^34,44, H-Net demonstrates significant performance improvements and more comprehensive analysis. (see details in Supplementary information Section 5) With the backbone of PSMNet³⁴, the head of H-Net is a Siamese network⁴⁵, whose two branch networks are weight sharing. These head Siamese CNNs utilize residual blocks⁴⁶ to extract features and weight-sharing spatial pyramid pooling (SPP) modules³⁴ to aggregate context information. The output left and right feature maps from the head backbone (Siamese CNNs) are integrated by the proposed H-Module. The introduction of H-Module with attention mechanism^47,48 allows the model to dynamically allocate its attention or resources based on the relevance or significance of specific features or contexts. H-Module is a symmetric processing pipeline composed of four cross-pixel interaction blocks and one cross-view interaction block. Cross-pixel interaction is the mutual interaction or influence between pixels in an image or visual representation. It involves considering the relationships and dependencies between neighboring pixels to capture contextual information and improve the understanding or analysis of the image. As illustrated in Fig. 3(b), the left and right feature maps are flattened and projected through separate fully connected layers into three essential vectors: Query, Key, and Value. The similarity or correlation between Query and Key is computed using the inner product, yielding weight coefficients for each Key corresponding to its associated Value, known as cross-pixel attention. The Value is then weighted and aggregated based on attention coefficients to obtain enhanced features. Corresponding attention calculation equation⁴⁹ is

Figure 3.Disparity computation with neural network. (a) Architecture overview of proposed neural network H-Net with H-Module. The stereo images are processed by weight-sharing backbones to extract features. These features are then combined using cross-pixel interaction and cross-view interaction in an H-Module. A 4D cost volume is created from the left and right image features, which is then used in a 3D CNN for depth estimation. A disparity regression module is performed before the final disparity map prediction. (b) Detailed pipelines of the cross-pixel interaction. The left and right feature maps are flattened and processed through separate fully connected layers to generate Query, Key, and Value vectors. The inner product is utilized to compute the similarity between Query and Key, resulting in weight coefficients for each Key. These coefficients are used for cross-pixel attention, associating each Key with its corresponding Value. The weighted Values are aggregated to produce enhanced features. (c) Detailed pipelines of the cross-view interaction. The difference from the cross-pixel interaction is the inner product of Key and Query vector comes from different stereo views.

Download full size

View all figures

2 $A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V,$

3 $s o f t m a x (x_{i}) = \frac{e^{x_{i}}}{\sum_{j = 1}^{n} e^{x_{j}}},$

where $Q$ is the Query vector, $K$ is the Key vector, $V$ is the Value vector, $\sqrt{d_{k}}$ serves as a scale to control the result range, $d_{k}$ is the dimension of Query vector and Key vector, and $s o f t m a x$ is a normalization function utilized to transform a vector of numerical values into a vector of probability distributions. This transformation ensures that the probability associated with each value is directly proportional to its relative proportion within the original vector.

Cross-view interaction refers to the interaction or integration of information from multiple views or perspectives. In multi-view analysis, cross-view interaction aims to leverage information from different viewpoints or modalities to enhance the overall understanding or interpretation of the scene. Detailed processing steps are depicted in Fig. 3(c), which is similar to cross-pixel interaction. The difference is that the calculation of cross-view attention is based on the Query and Key from different features. Specifically, the Query of the left feature map is computed with the Key of the right feature map through inner product and vice versa. This interaction involves feature matching and data fusion, allowing the alignment and combination of information from different views. The attention mechanism enhances the model's ability to capture dependencies, focus on relevant information, and leverage contextual relationships within the visual data (see the ablation study details in Supplementary information Section 5.4 Ablation study). The enhanced left and right feature maps from H-Module are concatenated as a 4D cost volume. Three repeated encoder-decoder architectures are utilized in the 3D CNN module to further comprehensively understand the contextual information. Before the final prediction of the disparity map, a disparity regression⁵⁰ is performed with a soft attention mechanism. For the disparity map $D = {d_{a}}_{a = 0}^{A_{max}}$ , each final disparity value $\hat{d_{a}}$ is the original depth value $d_{a}$ weighted by its probability. The disparity regression is performed as the equation below

4 $\hat{d_{a}} = \sum_{a = 0}^{A_{max}} d_{a} \cdot s o f t m a x (- c_{a}),$

where $\hat{d_{a}}$ is the final predicted disparity, $c_{a}$ is the corresponding cost from cost volume for each disparity $d_{a}$ , $a$ is the annotation number associated with each disparity value $d_{a}$ in disparity map $D$ , $A_{max}$ is the maximum value of $a$ within the range of annotations, $s o f t m a x$ function is discussed in Eq. (3). We adopt the smooth L1 loss as the loss function for its fast convergence and robustness to outliers. The final loss is averaged over the N-pixel disparity map, as shown in Eq. (5).

5 $L o s s (D, \hat{D}) = \frac{1}{N} \sum_{n}^{N} s m o o t h_{L 1} (d_{n} - {\hat{d}}_{n}),$

in which

6 $s m o o t h_{L 1} (x) = {\begin{array}{l} 0.5 x^{2}, if | x | \\ l t; 1 \\ | x | - 0.5, otherwise \end{array},$

where $D$ is the ground truth disparity map, $\hat{D}$ is the predicted disparity map, $N$ is the number of pixels in the disparity map, $d_{n}$ is the ground truth disparity data for pixel $n$ , and ${\hat{d}}_{n}$ is the predicted disparity data for pixel $n$ . H-Net was trained on the stereo vision dataset KITTI 2012⁵¹. We employed the Adam Optimizer ( $β_{1} = 0.9$ , $β_{2} = 0.999$ ). The learning rate was 0.001 for the first 10 epochs and 0.0001 for the rest. The batch size was 3 on a Nvidia GeForce RTX 3090 GPU. After 800 epochs (64,000 iterations) of training, the final model converged with a training loss of approximately 0.3 (see details in Supplementary information Fig. S8).

The depth map is calculated based on the predicted disparity map. The depth calculation formula³ is

7 $d e p t h = \frac{f b}{p s \cdot | \hat{D} + U_{offs} + O_{offs} |},$

in which

8 $O_{offs} = \frac{b}{p s} - | x_{1} - x_{0} |,$

where focal length $f$ is 10 mm, baseline $b$ is measured 4.056 mm, the side length of the physical pixel on CMOS sensor $p s$ is 3.45 µm, misalignment of lens and sensor on the x-axis $U_{offs}$ is 0, the principal point offset along the x-axis $O_{offs}$ is calculated as -396.6 pixels with x coordinate of left image center $x_{0} = 1232,$ and the x coordinate of the right image center $x_{1} = 2789$ (see more details in Supplementary information Fig. S3). The edge image is derived from the raw captured stereo image with a Canny edge detector⁵², which approximates the first derivative of a Gaussian operator. Through the lower bound cut-off suppression and edge tracking by hysteresis, the detected edges are constrained to be one pixel wide and located at the center discontinuous area without false noise edge points. There are no edges in the non-textured regions in images with uniform intensity distribution. These ill-posed regions will cause ambiguous depth prediction because of the feature-matching calculation mechanism. Under the guidance of the edge image, these ill-posed regions on the depth map are discarded. The edge-enhanced depth perception is the depth map filtered by logical conjunction (AND) operations on edge images. Both the discontinuity of intensity and depth are preserved with high fidelity and accuracy.

Results and discussion

The optical performance of the fabricated meta-lens is derived under 532 nm illumination. The measured intensity profile of left and right meta-lenses along the propagation direction is presented in Fig. 4(a). The corresponding measured focal lengths of left and right meta-lenses are 10.048 mm and 10.046 mm, which matches the designed focal length of 10.0 mm. The diameter of a single meta-lens is 2.6 mm, and the metalens' numerical aperture (NA) is about 0.13. The measured full-width at half-maximum (FWHM) of the focal spots of both meta-lens along X- and Y-axes range from 2.21 to 2.36 μm, with the minimum measurement accuracy of 0.2809 μm per division. Therefore, the averaged FWHM is 2.26 ± 0.14 μm, which is close to the diffraction-limited system with an FWHM of 2.1 μm (FWHM = 0.514λ/NA). The modulation transfer function (MTF), the Fourier transform of the point spread function (PSF), was also calculated, which further confirms that the fabricated meta-lens is a diffraction-limited lens (see more details in Supplementary information Fig. S4). The measured focusing efficiency is 73.86% at the working wavelength of 532 nm. The focusing efficiency is calculated by dividing the total light power of the focal point area at the focal plane by the total input light power of the bare substrate surface (the selected area is equal to the size of the meta-lens). Several experiments were performed to characterize the 2.6 mm meta-lens using a commercial measurement system (AR-Meta-P, IDEAOPTICS INC.). The phase profile of the fabricated meta-lens was measured to check the agreement between the calculated phase profile and the fabricated phase profile. The detailed experimental setup for meta-lens phase measurement was demonstrated in our previous work⁵³. The simulated and experimental phase distribution maps at the central region of the meta-lens are depicted in Fig. 4(b) and 4(c), respectively, which are in good agreement with each other. The small disparities can be attributed to the fabrication defects and the spherical aberrations in the measurement system. More theoretical and the measured phase profile comparison results are depicted in Supplementary information Fig. S5.

Figure 4.Characterization of binocular meta-lens. (a) X-Z plane focusing profiles of left and right meta-lens under 532 nm of wavelength. The measured focal lengths of left and right meta-lenses are 10.048 mm and 10.046 mm, respectively, which are denoted by yellow dashed lines. (b) Designed phase distribution of the meta-lens. (c) Corresponding measured phase distribution of the meta-lens in (b).

Download full size

View all figures

Various imaging and depth sensing experiments are conducted to test the performance of edge-enhanced depth perception of our binocular meta-lens. The configuration of the binocular meta-lens camera for imaging is shown in Fig. S7 in the Supplementary information. Figure 5 demonstrates the raw captured images, depth sensing results, edge-enhanced depth maps, and the integration results of raw images and 3D edges. The raw captured image $I_{raw}$ is cropped from the common stereoscopic region of the left image. Proposed H-Net outputs corresponding disparity map of the stereo images. Through Eq. (5), the depth map ${\hat{D}}_{epth}$ is calculated accordingly and illustrated in pseudocolor, as shown in the second column of Fig. 5. The 2D edge images that represent the spatial intensity discontinuity are derived from the raw captured image (first column) with the Canny operator. The 2D edge image is converted into a binary matrix $E_{b}$ , in which the edge pixel is 1, otherwise it is 0. The edge-enhanced depth map $D E$ is calculated by the Hadamard product of the depth map ${\hat{D}}_{epth}$ and the binary edge matrix $E_{b}$ , which is similar to a logical conjunction (AND) operation. The specific calculation equation is

Figure 5.Edge-enhanced depth perception of various objects. The first column is the raw left image. The second column is the corresponding depth map. The third column is the edge-enhanced depth map. The second and third columns use the same color bar on the right of the third column. The fourth column is the integration image of the raw image and edge-enhanced depth map. (a) Two pieces of transparent plastic paper printed with "RIKEN" and "CITYU" in black letters are placed at 16.0 cm and 12.8 cm, respectively. (b) A piece of sketch paper printed with a tilted three-dimensional building is located at 17.3 cm as the background. The front ends of the two toy cars are approximately 12.9 cm and 15.7 cm, respectively. (c) The two architectural sketches are at 13.5 cm and 16.5 cm, respectively. (d) The background architecture sketch is positioned at 17.3 cm. The depth of a toy car's body ranges from 12.5 cm to 15.5 cm.

Download full size

View all figures

9 $D E = {\hat{D}}_{epth} ⊙ E_{b} .$

The edge-enhanced depth maps $D E$ are displayed in the third column of Fig. 5 in pseudocolor. The non-edge regions with 0 values are set to be black. The integration images $I_{integ}$ in the fourth column of Fig. 5 are merged using the following expression:

10 $I_{integ} = 1.2 D E + 0.8 I_{raw} .$

The integration images aim to demonstrate the fidelity of edge-enhanced depth perception in spatial intensity and depth discontinuity detection.

Figure 5(a) depicts a scenario with ill-posed regions. Two black letter objects, "RIKEN" and "CITYU," printed on transparent plastic papers, are positioned at 16.0 cm and 12.8 cm, respectively. The letter carrier is transparent plastic paper. The background is a white wall without any texture. The absence of texture makes it difficult to establish reliable correspondences between image points in the left and right views, leading to unreliable or ambiguous depth estimates (see the middle region of the depth map in Fig. 5(a). Such unreliable and ambiguous depth estimates will cause severe trouble for decision-making tasks. In edge-enhanced depth perception, the 3D edge data agree well with the ground truth with the completed preservation of essential details of the scene. Figure 5(b) demonstrates a multi-object traffic scene with two toy cars located at about 12.9 cm and 15.7 cm. An architecture sketch background providing is placed at 17.3 cm. Figure 5(c) shows two architecture sketches with false 3D feelings positioned at 13.5 cm and 16.5 cm, respectively. With edge-enhanced depth perception, the planar false 3D objects do not deceive the system. Figure 5(d) displays a toy car with a continuous depth change, ranging from 12.5 cm to 15.5 cm. All depth sensing results are correct, demonstrating the accuracy capability of our H-Net. The edge-enhanced depth results discard all uniform regions and amplify the 3D feature details with high confidence.

Conclusions

Spatial computing has attracted growing attention from both academia and industry, driven by the rising popularity of the metaverse. A portable and accurate imaging and depth sensing system is of vital importance for next-generation virtual reality devices. In this work, we demonstrate an edge-enhanced depth perception system based on binocular meta-lens, which is intelligent, lightweight, and compact for spatial computing. The miniaturized system contains a binocular meta-lens, a 532 nm filter, and a CMOS sensor. The binocular meta-lens only weighs about 0.115 g with 0.0288 cm³ volume consumption. The imaging system based on our meta-lens minimizes the discomfort caused by the weight and volume of wearable devices to users. We propose a stereo-matching neural network with a novel H-Module for the disparity computation. The H-Module introduces the attention mechanism into the Siamese network. The symmetric architecture with cross-pixel interaction and cross-view interaction enables a more comprehensive analysis of the contextual information in stereo images. The edge enhancement based on the spatial intensity discontinuity discards the ill-posed regions in the image, where ambiguous depth prediction will be generated due to the lack of texture information. With the support of deep learning, our edge-enhanced provides a prompt, intelligent response in less than 0.15 seconds. This edge-enhanced depth perception system will facilitate accurate 3D scene modeling to promote the development of machine vision, autonomous driving, and robotics.

Acknowledgements

We are grateful for financial supports from the Research Grants Council of the Hong Kong Special Administrative Region, China [Project No. C5031-22G; CityU11310522; CityU11300123], the Department of Science and Technology of Guangdong Province [Project No. 2020B1515120073], City University of Hong Kong [Project No. 9610628] and, JST CREST (Grant No. JPMJCR1904) .

XY Liu, T Tanaka, and MK Chen. organized the project. XY Liu, BR Leng, JC Zhang, Y Zhou, JL Cheng, and MK Chen conceived the principle, numerical design, and characterization of the metasurface and meta-system. T Yamaguchi and T Tanaka conceived the fabrication of metasurface. XY Liu, and MK Chen built up the optical system for measurement and the deep learning model and collected the data with experiments for analysis. All authors discussed the results, prepared the manuscripts, and commented on the manuscript.

The authors declare no competing financial interests.

Supplementary information for this paper is available at https://doi.org/10.29026/oes.2024.230033

Category: Research Articles

Received: Sep. 26, 2023

Accepted: Dec. 18, 2023

Published Online: Dec. 13, 2024

The Author Email: Tanaka Takuo (TTanaka), Chen Mu Ku (MKChen)

DOI:10.29026/oes.2024.230033

微信扫一扫：分享