The research team tested the metalens system’s capabilities across various challenging scenarios. For example, with transparent backgrounds, plastic sheets printed with "RIKEN" and "CITYU" placed at 16.0 cm and 12.8 cm were accurately distinguished. Similarly, in a scene with a 3D building sketch at 17.3 cm, along with toy cars positioned at 12.9 cm and 15.7 cm, the system precisely captured the depth differences (Fig. 2). Further tests included architectural sketches at 13.5 cm and 16.5 cm, as well as a toy car body ranging from 12.5 cm to 15.5 cm in depth. The system's ability to identify depth variations, coupled with a processing time of less than 0.15 seconds, highlights its potential for real-time applications in areas like AR, robotics, autonomous navigation, and other time-sensitive uses.
The researchers developed H-Net, a stereo-matching neural network that follows an end-to-end learning framework from stereo input images to disparity map prediction. They could address the limitations of traditional stereo-matching algorithms, which frequently struggle in ambiguous environments such as flat, textureless surfaces or reflective areas where it is challenging to capture accurate depth data. To overcome these difficulties, H-Net’s architecture incorporates a novel “H-Module,” which includes cross-pixel interaction and cross-view interaction mechanisms to aggregate contextual information and combine diverse perspectives effectively. These interactions allow the system to dynamically prioritize contextual information from stereo views, enhancing edge detection within the images.

The fields of spatial computing1 and augmented reality (AR) are on the cusp of a transformative evolution, as researchers and engineers develop technologies to seamlessly blend the virtual and physical worlds. To make this vision a reality, devices must accurately perceive depth, allowing virtual objects to integrate naturally within real-world environments. Methods for achieving depth perception include stereo cameras2, structured light3, and time-of-flight systems4. Among these, stereo cameras are widely used because they do not require an active light source, resulting in lower power consumption. However, traditional lenses tend to be thick and heavy, which is a significant drawback.
The physical operation of the device relies on a binocular metalens, a 532 nm filter, and a CMOS sensor, to function like human binocular vision (Fig. 1). This novel metalens employs an array of nanopillars, each crafted from gallium nitride (GaN) on a sapphire substrate. Unlike conventional lenses, which often sacrifice form for function, these metalenses are exceptionally thin, lightweight, and efficient. Each metalens has a diameter of 2.6 mm and a volume of 4.25×10−6 cm3, with a weight of 2.61×10−5 g, making it lighter than one percent of a typical hair's weight. This enables a device to capture depth data with greater accuracy and clarity, all within a design optimized for portability and comfort.

Figure 2.Edge-enhanced depth perception of various objects. The first image displays the original left view, while the second image shows the corresponding depth map. The third image presents the edge-enhanced depth map, using a shared color scale shown to the right of the third image. The fourth image is the integration of the raw image with the edge-enhanced depth map to provide a combined view.

Figure 1.Schematic of the edge-enhanced spatial computing with binocular metalens.
As a result, the H-Module filters out unreliable depth data from non-textured regions, resulting in a more precise and stable depth map. This edge-based approach to depth perception, enhanced by cross-pixel and cross-view interaction, minimizes errors in depth prediction. Traditional depth-sensing algorithms often struggle in low-texture regions, as these areas lack sufficient feature points for depth mapping. With the H-Module in H-Net, contextual information is preserved and processed through the 4D cost volume created from left and right image features, and depth estimation is refined through a 3D CNN, followed by a disparity regression module before final disparity map prediction. This approach allows the system to consistently produce accurate depth perception even in complex environments, preserving edges that indicate depth shifts and achieving a more realistic 3D model of the scene.
Metasurfaces have emerged as a promising alternative to conventional optical components5−8. These ultra-thin, lightweight surfaces can manipulate light at a subwavelength scale, enabling advanced optical functionalities that were previously unattainable with traditional lenses. In depth-sensing applications, metasurfaces have been effectively applied to point spread function (PSF) engineering9 and structured light10, 11, showing great potential for developing more compact and efficient depth perception systems. As the demand for lightweight and compact depth cameras grows, research on metasurface-based depth perception has accelerated. In a recent work published in Opto-Electronic Science12, X. Liu et al. has recently introduced a groundbreaking binocular metalens-based depth perception system. This compact and lightweight solution promises to enhance next-generation wearable devices, bringing us closer to a more immersive and practical spatial computing experience.
The fusion of spatial computing and meta-optics represents a new era, enabling devices like AR glasses and drones to seamlessly integrate digital data with the physical world. This compact, efficient system lays the foundation for immersive and practical devices, driving spatial computing’s potential in daily life. Innovations like the binocular metalens system are paving the way for real-time, high-quality 3D modeling across fields as diverse as robotics, healthcare, and virtual reality, marking an essential step toward a more integrated technological future.