Real-time physics-informed neural network image reconstruction for a see-through camera via an AR lightguide

Tom Glosemeyer; Yuchen Ma; Robert Kuschmierz; Jiachen Wu; Liangcai Cao; Jürgen W. Czarske

doi:10.3788/AI.2025.10018

1. Introduction

Augmented reality (AR) is revolutionizing areas such as education, entertainment, and medical services by merging real-world and computer-generated content^[1–7]. As the primary sensor for environment perception, cameras are a critical component in AR devices^[8–11]. However, their inherent opacity conflicts with the see-through nature of AR displays. To address this, several transparent camera technologies have been developed. These can be broadly categorized into two paths: one using inherently semi-transparent photodetectors, for example, by stacking graphene light-sensing layers on high-transmission substrates^[12], and the other employing conventional opaque photodetectors whose peripheral circuits are concealed using beam-splitting elements^[13] or light modulators^[14]. Another promising direction is computational imaging, which leverages hardware like luminescent concentrator films^[15], windows with roughened edges^[16], and holographic optical elements^[17–22]. Despite these innovations, a common bottleneck remains: the imaging resolution is often constrained by low-density photodetector arrays or the necessity for computationally intensive reconstruction algorithms.

A recent hardware platform, LightguideCam, offers a compact and flexible solution for a high-resolution see-through computational camera by guiding a portion of the incoming light to an image sensor, forming a blurred image^[23]. However, reconstructing a clear image from this blurred measurement using traditional model-based iterative algorithms like the fast iterative shrinkage-thresholding algorithm (FISTA)^[24–26] is computationally demanding and too slow for real-time applications. This challenge has motivated a shift toward data-driven methods, as deep learning has become a powerful tool for solving such inverse problems in computational imaging^[27–30]. Once a neural network is trained, it can perform image reconstruction at very high speeds^[31,32]. The field has rapidly evolved from early architectures like U-Net^[33,34] to more efficient physics-informed neural networks^[35,36], which combine the advantages of data-driven and model-based approaches to improve performance and data efficiency^[37–41].

In this work, we address the reconstruction speed and quality challenge for the LightguideCam. We employ a physics-informed deep learning framework that replaces the slow model-based FISTA algorithm. This new approach achieves a dramatic increase in reconstruction speed and a significant improvement in image quality. The simple design of the LightguideCam, combined with the high-performance reconstruction network, offers a compelling solution for seamless integration into AR display devices, with great potential for applications such as eye-gaze tracking and eye-position-perspective photography.

2. Methods

2.1. Principle of LightguideCam

The core of the see-through camera system is a lightguide that enables simultaneous viewing of a scene and imaging of it, as illustrated in Fig. 1. The system can be described by a forward model that accounts for the spatially varying point spread function (PSF) introduced by the camera optics [Fig. 1(a)]. For this model, the object image is treated as a 2D array of pixels, $v [x, y]$ . Each pixel has a corresponding PSF, $h [x^{'}, y^{'}; x, y]$ , that describes its projection onto the camera sensor, where $[x^{'}, y^{'}]$ are the image space coordinates. Assuming incoherence between object pixels, a measurement $b [x^{'}, y^{'}]$ can be expressed as a linear combination of the PSFs from all object pixels: $b [x^{'}, y^{'}] = \sum_{x, y} v [x, y] h [x^{'}, y^{'}; x, y] = A v .$ (1)Here, the matrix $A$ represents the system’s complete response, mapping the object $v$ to the measurement $b$ . The inverse problem, which is the central challenge, is to recover $v$ given $b$ and an estimate of $A$ [Fig. 1(b)]. In the previously published work on LightguideCam^[23], the iterative FISTA algorithm was applied to solve this problem^[24]. However, reconstructing a single image required several minutes, which is far too slow for real-time applications. To overcome this limitation, we employ a deep-learning-based approach for high-speed, high-quality reconstruction.

Figure 1.Principle of the LightguideCam computational imaging system. (a) Optical path for the LightguideCam. Light from the object is split by the lightguide. Some light passes through directly to the user’s eye, while the rest is guided to an image sensor, forming a blurred image due to the system’s spatially varying point spread function (PSF). The deep neural network (DNN) reconstructs the original object from this sensor measurement. (b) General schematic of a computational imaging system, where a DNN is trained to solve the inverse problem of recovering a clean object image from a measurement with artifacts.

Download full size

View all figures

2.2. Dataset Acquisition

Training a deep neural network (DNN) requires a large dataset of paired measurements and ground truth images. The experimental setup used for this data acquisition is shown in Fig. 2(a). A high-resolution OLED display is placed at a working distance of 70 cm from the LightguideCam. A partially reflective mirror array lightguide is employed in the system, equipped with a concave mirror for focusing. A cooling fan is used to maintain a stable sensor temperature, preventing an increase in thermal noise during the lengthy data capture process. As outlined in Fig. 2(b), images are sequentially loaded onto the display to serve as the ground truth, and the corresponding distorted images are captured by the camera’s image sensor. For the training dataset, we used approximately 6000 images sourced from the Div2k^[42] and Flickr2k^[43] datasets, with some images augmented by rotation. The captured data were randomly split, with 90% used for training and 10% for validation and testing. While it is possible to simulate training data using the forward model^[23], using experimentally captured data ensures that the network learns to invert the true physical properties of the system, leading to superior reconstruction quality.

Figure 2.Experimental setup and data acquisition. (a) Photograph of the experimental setup. An OLED display, placed 70 cm from the LightguideCam, projects ground truth images. The LightguideCam consists of a lightguide and an image sensor. A cooling fan is used to stabilize the sensor temperature and reduce thermal noise. (b) Schematic of the dataset collection process. Ground truth images are loaded onto the display, and the corresponding distorted images are captured by the image sensor, forming the paired dataset for network training.

Download full size

View all figures

2.3. Network Structure and Analysis

For high-speed reconstruction, we use a physics-informed neural network, the Multi-Wiener Net (M. W. Net), based on a multi-Wiener deconvolution approach^[36], whose architecture is detailed in Fig. 3(a). The network consists of two main stages. First, a deconvolution layer processes the input measurement $b$ to generate multiple intermediate solutions. This layer contains a bank of $M$ parallel Wiener filters. For each filter $i$ , the intermediate reconstruction in the Fourier domain, ${\hat{V}}_{i} (u, v)$ , is calculated as ${\hat{V}}_{i} (u, v) = \frac{H_{i}^{*} (u, v) B (u, v)}{| H_{i} (u, v {) |}^{2} + λ_{i}}, i = 1, 2, \dots, M,$ (2)where $(u, v)$ are the frequency space coordinates, $B$ is the Fourier transform of the measurement, $H_{i}$ is the Fourier transform of the $i$ th PSF, and $λ_{i}$ is a corresponding regularization parameter. Crucially, while the PSFs are initialized by capturing the system’s response to point sources across the field of view (FoV) [Fig. 3(b), left], both the PSFs ( $H_{i}$ ) and the regularization terms ( $λ_{i}$ ) are learnable parameters. They are updated during training, allowing the network to refine the physical model and better approximate the complex, shift-variant nature of the optical system, as seen in the transition from initial to trained PSFs in Fig. 3(b).

Figure 3.Reconstruction network architecture and training analysis. (a) The architecture of the physics-informed Multi-Wiener Net. The input measurement $b$ is transformed into the frequency domain and processed by a bank of parallel Wiener filters, which use learnable PSFs ( $H_{i}$ ) and regularization parameters ( $λ_{i}$ ). The resulting intermediate reconstructions are transformed back to the image domain and fed into a U-Net, which outputs the final reconstruction $v$ . The network is trained end-to-end by minimizing the mean squared error (MSE) between the reconstruction and the ground truth. (b) Visualization of the network’s deconvolution stage. From left to right: the initial calibrated PSFs, the PSFs after training, and the corresponding intermediate reconstructions. (c) Training and validation loss curves for the R, G, and B color channels, showing convergence over 60 epochs.

Download full size

View all figures

In the second stage, the $M$ intermediate reconstructions are transformed back to the spatial domain via an inverse Fourier transform and are then concatenated and passed to a convolutional neural network (CNN). This CNN, based on the U-Net architecture^[33], is modified to accept multiple input images and regress to a single output image. Its function is to intelligently fuse the information from all intermediate images, combining regions of high fidelity from each and suppressing artifacts to produce the final clean reconstruction.

The specialized architecture of the M. W. Net is the direct reason for its superior performance. Unlike purely data-driven “black box” models, it incorporates a physics-informed deconvolution layer where each Wiener filter learns to invert the blur for a specific region of the image. The subsequent U-Net is then tasked with the simpler more well-defined problem of fusing these intermediate, high-fidelity reconstructions and suppressing noise. This structured approach, which explicitly models the system’s spatially variant physics, is the key to achieving superior reconstruction fidelity, particularly at the edges of the FoV where traditional methods often fail. To assess the performance gain by including the physics priors in the M. W. Net, a standard U-Net was trained on the same dataset and with the same hyperparameters for comparison. For the test dataset evaluated in Fig. 4, the mean peak signal-to-noise ratio (PSNR) was reduced by 2 dB.

Figure 4.Quantitative and qualitative comparison of reconstruction results. (a)–(c) Violin plots comparing the distribution of peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), and correlation coefficient (CC) across the test dataset for the raw measurement, FISTA reconstruction, and our M. W. Net reconstruction. The red bars indicate the mean values. (d) A representative visual comparison, showing from left to right: the raw blurred measurement, the FISTA reconstruction, the M. W. Net reconstruction, and the ground truth image. (e) Corresponding residue images, calculated as the absolute difference between each reconstruction and the ground truth. A darker image signifies a smaller error. (f) Pixel-wise SSIM maps. Warmer colors (yellow/red) indicate higher structural similarity to the ground truth, while cooler colors (blue) indicate lower structural similarity.

Download full size

View all figures

The entire network is trained end-to-end using the Adam optimizer to minimize the mean squared error (MSE) between the final reconstruction and the ground truth image. We trained a separate network for each color channel. For an image size of 2048 pixel × 2048 pixel, each network has 51 million learnable parameters. Training was performed on an NVIDIA RTX A6000 GPU, taking 31 to 56 h for 30 to 60 epochs. The loss curves in Fig. 3(c) demonstrate successful convergence for all three color channels. The training process was halted as soon as the validation loss did not improve for 10 consecutive epochs to prevent overfitting. In conjunction with the diverse training dataset, this enables good generalization ability by the network.

3. Results

To evaluate the performance of our proposed method, the trained M. W. Net was benchmarked against the raw sensor measurement and the previously used iterative FISTA algorithm^[23]. The comparison was conducted on a test dataset not seen during training, with the results summarized in Fig. 4 and Table 1.

Table 1. Comparison of Reconstruction Approaches for LightguideCam Test Data.^a

View table
View all Tables
Table 1. Comparison of Reconstruction Approaches for LightguideCam Test Data.^a

Metric Measurement FISTA M. W. Net
PSNR [dB] 12.23 ± 1.40 12.63 ± 1.30 20.32 ± 1.40
SSIM 0.47 ± 0.11 0.42 ± 0.12 0.68 ± 0.13
CC 0.79 ± 0.09 0.72 ± 0.08 0.92 ± 0.04
Time [s] – 1067 ± 42 0.273 ± 0.004

The quantitative metrics, illustrated in the violin plots in Figs. 4(a)–4(c), show a dramatic improvement in image quality. As detailed in Table 1, the M. W. Net achieves a mean PSNR of 20.32 dB, a significant increase of approximately 7.7 dB over FISTA (12.63 dB) [Fig. 4(a)]. Similarly, Fig. 4(b) shows the mean structural similarity index measure (SSIM) score jumps to 0.68 for our method, compared to 0.42 for FISTA. This high degree of structural fidelity is further confirmed by the Pearson’s correlation coefficient (CC) shown in Fig. 4(c), which increases to a mean of 0.92. These quantitative gains are reflected visually in the example reconstruction in Fig. 4(d). Our network’s output is significantly sharper and more detailed than the FISTA reconstruction, appearing much closer to the ground truth. The superior accuracy is explicitly visualized by the residue images in Fig. 4(e), where the substantially darker result for the M. W. Net indicates a much smaller reconstruction error. Furthermore, the pixel-wise SSIM map in Fig. 4(f) reveals that the M. W. Net provides uniformly high reconstruction quality across the entire FoV, whereas the FISTA method exhibits noticeable degradation in quality, particularly toward the edges.

Beyond image quality, the most transformative advantage of the M. W. Net is the reconstruction speed. As shown in Table 1, the iterative FISTA algorithm required an average of 1067 s (nearly 18 min) to reconstruct a single image. In stark contrast, our neural network completes the task in just 0.273 s. This constitutes an acceleration by a factor of almost 4000, enabling real-time applications with imaging rates of up to 4 frames per second.

To test the network’s performance on three-dimensional scenes, we captured images of objects placed at varying depths, as shown in Fig. 5. The scenes include playing cards at discrete depths and a continuous object with depth variation. Since the network was trained using a dataset captured at a single plane (70 cm), the reconstructions sharply resolve objects located at or near this training depth, showing the generalization ability of the network. Objects at other depths remain blurry, demonstrating the depth-selective nature of the current network. This highlights a potential avenue for future work, where extending the network by capturing PSFs from multiple depth planes to initialize additional learnable Wiener filters and using a 3D-U-Net to combine the intermediate outputs could enable digital refocusing across different depths^[36].

Figure 5.Reconstruction of 3D scenes with varying depths. The top row shows the raw captured measurement, and the bottom row shows the corresponding reconstruction from our network. The network was trained only on images from a single depth plane (70 cm). Consequently, it sharply reconstructs objects at that depth while objects at other discrete depths or parts of a continuous object outside the focal plane remain blurred.

Download full size

View all figures

4. Discussion and Conclusion

This work presents a physics-informed deep learning framework that enables high-speed and high-fidelity image reconstruction for the LightguideCam see-through camera. Our approach utilizes an M. W. Net that integrates a learnable, spatially variant deconvolution layer with a U-Net for fusion and denoising to efficiently solve the complex inverse problem. This method yields a transformative, nearly 4000-fold increase in reconstruction speed while simultaneously providing a significant quality enhancement of almost 8 dB in PSNR compared to previous iterative algorithms. This leap in performance unlocks the potential for real-time, interactive applications for see-through computational cameras.

The practical implementation of this deep learning approach involves several considerations for which clear optimization paths exist. The current need for a large training dataset, for instance, can be significantly mitigated through advanced data augmentation strategies or by leveraging generative models to synthesize diverse training scenarios from a limited number of physical measurements. Similarly, while the current model is specialized for a specific hardware configuration, future work can enhance its flexibility. By employing transfer learning or meta-learning techniques, the network could be rapidly adapted to new optical systems with minimal recalibration data, greatly reducing the retraining overhead. Finally, the present reliance on high-performance GPUs can be addressed through model compression techniques such as quantization and pruning, paving the way for efficient deployment on resource-constrained edge devices and mobile platforms.

Looking ahead, the fusion of advanced computational algorithms with novel optical hardware opens up a vast design space. This framework could be extended beyond 3D imaging to perform more complex real-time tasks, such as glare suppression or hyperspectral analysis, by tailoring the network to different physical priors. These advancements will accelerate the development of imperceptible smart cameras, seamlessly integrating AR into our daily lives.

Acknowledgments

Acknowledgment. This work was supported by the German Research Foundation (No. CZ 55/48-1), the National Natural Science Foundation of China (No. 62235009), and the National Key Research and Development Program of China (No. 2021YFB2802000).

Category: Research Article

Received: Jul. 1, 2025

Accepted: Aug. 27, 2025

Published Online: Jan. 24, 2025

The Author Email: Tom Glosemeyer (tom.glosemeyer@tu-dresden.de)

DOI:10.3788/AI.2025.10018

Table 1. Comparison of Reconstruction Approaches for LightguideCam Test Data.a

Table 1. Comparison of Reconstruction Approaches for LightguideCam Test Data.a

Table 1. Comparison of Reconstruction Approaches for LightguideCam Test Data.^a

Table 1. Comparison of Reconstruction Approaches for LightguideCam Test Data.^a