Physics-aware cross-domain fusion aids learning-driven computer-generated holography

Ganzhangqin Yuan; Mi Zhou; Fei Liu; Mu Ku Chen; Kui Jiang; Yifan Peng; Zihan Geng

doi:10.1364/PRJ.527405

1. INTRODUCTION

Holographic technology, with its unique ability to reconstruct images, stands at the forefront of advancing optical displays, offering realism and immersive qualities [1 –10]. The quest for dynamic, high-quality computer-generated holography (CGH) has long been driven by the dual research in optical engineering and computational science [11 –15]. Despite remarkable progress [16,17], the challenge of efficiently generating phase-only holograms (POHs) remains, primarily due to the intricate nature of modulating light with spatial light modulators (SLMs) to achieve desired visual effects without compromising on quality or computational demand [17].

To date, the typical methods for generating POHs can be divided into iterative and non-iterative ones. Iterative methods, including the Gerchberg–Saxton (GS) algorithm [18 –20], non-convex optimization algorithms [21,22], and the stochastic gradient descent method (SGD) [23,24], facilitate accurate hologram generation but fall short in terms of efficiency, hindering real-time applications. In contrast, non-iterative methods such as double phase-amplitude coding (DPAC) and its variants [22,25 –28], enable quicker hologram generation in fewer steps, but this comes from a significant quality decline. Therefore, a compromise between fidelity and efficiency is inevitable.

Recent advancements in computational techniques, particularly deep learning, have provided promising solutions to longstanding problems in CGH [29 –35]. These approaches, leveraging the vast computational power and flexibility of neural networks, have shown potential in transcending traditional barriers, enabling more control over the hologram generation process. By automating the intricate phase modulation required for POHs, deep learning models have facilitated significant strides towards achieving real-time, high-fidelity holographic displays.

During the exploration of the combination of neural networks with CGH, the UNet architecture [36] has emerged as the processing backbone in most cases due to its effective feature-extraction capabilities and encoder–decoder design [29 –31,34,35,37,38]. However, adapting UNet to the specific requirements of CGH presents a significant challenge. The fundamental issue stems from the fact that the UNet’s architecture may not be optimally designed for the intricate cross-domain transformations inherent in CGH. These transformations involve converting color images from the amplitude domain to phase-only holograms in the phase domain, a critical process that demands precise handling to ensure fidelity and accuracy in hologram reconstruction [39]. Furthermore, the traditional activation functions, borrowed from 2D image processing, map output values to a fixed interval through nonlinear variations. This oversimplification of phase values overlooks their intrinsic periodic nature, leading to potential inaccuracies in the reconstructed holograms.

In recognition of these challenges, our study introduces an approach that combines the computational power of neural networks with the rigorous requirements of optical physics. The cross-domain fusion network (CDFN) is designed to facilitate the efficient and accurate conversion from the amplitude domain to the phase domain. By dissecting the cross-domain conversion process into multiple stages and incorporating the physics-aware mapping function, CDFN significantly enhances the quality and efficiency of hologram generation. This innovation addresses the technical hurdles inherent in CGH and aligns with the pursuit of optical excellence. Figure 1 presents a visual comparison between conventional approaches and our approach. The results of various experiments demonstrate the capability of our method to reconstruct high-quality 2K color images both in simulation and real-world experiments.

Figure 1.(a) Workflow of conventional methods based on the vanilla UNet architecture. (b) Our proposed cross-domain fusion network highlights the role of our multi-stage conversion architecture and the infinity phase mapper (IPM) in accomplishing the cross-domain transformation task. The multi-stage conversion architecture employs multiple feature maps to facilitate a gradual transformation between two domains. Concurrently, the infinity phase mapper is designed to accommodate the periodic nature of phase values, ensuring the preservation of the physical consistency.

Download full size

View all figures

2. METHOD

A. Hologram Generation Workflow

Figure 2 delineates the workflow of our proposed method. The algorithm begins by inputting the amplitude image into the first sub-network. The next step involves combining the amplitude image with the initial phase image output from the first sub-network to represent the complex hologram at the object plane. After this integration, the algorithm utilizes the angular spectrum method (ASM) to simulate the forward propagation to compute the complex hologram at the SLM plane, which simulates the physical process of light propagation, bridging the digital world with the real-world holograms. The complex hologram at the SLM plane is then fed into the second sub-network. This network is trained to translate the complex hologram into the final POH, which can be loaded into the SLM.

Figure 2.Hologram generation workflow of our proposed method and the corresponding network architecture. The feature maps representing the various conversion stages are color-coded diversely. The first sub-network ${CDFN}_{1}$ is used to predict the initial phase and the second sub-network ${CDFN}_{2}$ is used to predict POHs.

Download full size

View all figures

B. Multi-stage Network Architecture

The two sub-networks in our algorithm share a similar structure, which is a derivative of the UNet used in HoloNet [30]. But considering the cross-domain conversion problem in CGH, we break down the cross-domain conversion process into multiple stages by constructing more intermediate feature maps and thus designing a multi-stage network architecture.

The multi-stage architecture is motivated by the need for mapping from the amplitude domain to a different phase domain in POH generation. Traditional UNet architecture directly forwards high-resolution feature maps from the encoder to the decoder. While this approach is efficient in many applications, it is less effective in the CGH context due to the differing nature of amplitude values in the encoder and phase values in the decoder [39].

To tackle this issue, our network consists of more intermediate feature maps representing multiple stages, which serve as a buffer and enable a more gradual conversion. These stages resolve the discrepancies between the encoder inputs (amplitude values) and the final decoder outputs (phase values). As depicted in Fig. 2, the feature maps representing the various conversion stages are color-coded diversely. This design guarantees a smoother and more cohesive conversion from the amplitude domain to the phase domain, minimizing sudden changes and improving the results.

To prove the effectiveness of the multi-stage design, we try to exclude the impact from different numbers of network parameters by keeping a similar number of parameters with HoloNet [30]. Specifically, we downsize the top-down feature maps’ channel numbers from [32, 64, 128, 256] to [16, 32, 64, 128]. As shown in Table 2 presented later, our algorithm’s network has fewer parameters than HoloNet.

C. Infinity Phase Mapper

In CGH, it is necessary for the network to predict phase values when generating POH. To get a phase value in the range of $[- π, π]$ , existing algorithms append a traditional activation function, e.g., a hardTanh function in HoloNet [30], after the final output layer of the network to restrict the value in the required range. However, this approach will introduce inconsistencies in the predicted phase values. Ideally, the phase angle, represented by the network’s raw output value, could span an infinity range $[- \infty, \infty]$ . A well-defined mapping function should consider the periodicity of the phase angle and map the value in $[- \infty, \infty]$ to a corresponding value in $[- π, π]$ . But most activation functions simply do a truncating and map values beyond $[- π, π]$ to the endpoints $- π$ and $π$ , resulting in a misrepresented mapping of phase values that significantly impacts the results.

To address this problem, we propose a physics-aware mapping function. The primary objective of this function is to ensure a more accurate mapping of output values, carefully mapping each value within the desired $[- π, π]$ range. The specific formulation of this function is detailed below: $ϕ_{o} = M (ϕ_{i}) = atan 2 (\sin ϕ_{i}, \cos ϕ_{i}), atan 2 (y, x) = (\begin{array}{l} \arctan (\frac{y}{x}) & if x > 0 \\ \arctan (\frac{y}{x}) + π & if y \geq 0, x < 0 \\ \arctan (\frac{y}{x}) - π & if y < 0, x < 0 \\ + \frac{π}{2} & if y > 0, x = 0 \\ - \frac{π}{2} & if y < 0, x = 0 \\ undefined & if y = 0, x = 0 \end{array},$ (1)where the phase mapping function, denoted as $M$ , is defined by the input phase value $ϕ_{i}$ and the output phase value $ϕ_{o}$ . The function $atan 2 (\cdot)$ is a modified version of the arctan function that considers all four quadrants when computing the angle from the positive $x$ -axis to the point ( $x$ , $y$ ).

This specific phase mapping function, named the infinity phase mapper, translates any phase value in the infinite range into its corresponding value in $[- π, π]$ . This approach avoids the limitations of implementing hard truncation. The graphical representation of the infinity phase mapper is shown in Fig. 3.

Figure 3.Schematic of infinity phase mapper (IPM) $M$ , which maps infinite phase values $ϕ_{i}$ to their corresponding points $ϕ_{o}$ within the $[- π, π]$ interval.

Download full size

View all figures

D. Deep Supervision in the Learning Process

To improve our method’s performance, we have incorporated a deep supervision strategy into the training process. When training the network, we supervise the loss between the reconstructed image from the POH and the target image. Additionally, we supervise the losses between the reconstructed images from top multi-stage feature maps and the target image. By applying supervision to the intermediate feature maps in the initial stages of the network, the network is encouraged to learn more task-relevant features early in the model. This is critical because features learned in the initial stage form the foundation for subsequent stages. Deep supervision ensures that these foundational features are sensitive to cross-domain tasks, enhancing the overall performance of the network for cross-domain transformations. The loss function employed under this deep supervision paradigm can be formulated as follows: $L_{deep} (I, {\hat{I}}_{i}) = \sum_{i = 1}^{D} α_{i} ‖ {\hat{I}}_{i} - I ‖^{2},$ (2)where the target amplitude image is denoted as $I$ , and the amplitude image reconstructed at the $i$ th level is represented as ${\hat{I}}_{i}$ . $D$ is the same as the number of the network depth, which is 4 in the implementation. The weighting coefficient $α_{i}$ is used to balance the losses at different levels, with $i$ indexing the intermediate feature images, which is 0.25, 0.5, 0.75, and 1.0 in the implementation.

Moreover, because of the importance of the final POH, special attention is given during the training. An extra perceptual loss $L_{p}$ [40] is computed between the reconstructed image from the final POH and the target image. Therefore, the overall loss function is formulated as follows: $L_{overall} (I, {\hat{I}}_{i}) = L_{p} (I, {\hat{I}}_{D}) + L_{deep} (I, {\hat{I}}_{i}) .$ (3)

This overall loss, taking into account both the deep supervision loss from multi-stage feature maps and the perceptual loss related to the final POH, guarantees the algorithm’s effectiveness in producing high-quality POHs.

3. EXPERIMENT

The experiments are performed using a 24 GB NVIDIA RTX 3090 GPU. The DIV2K dataset [41] is utilized for both training and testing. The training set consists of 800 images and the testing set consists of 100. During the training process, data augmentation techniques, such as image flipping and rotation, are employed to facilitate the learning process, similar to HoloNet [30]. The training is limited to a maximum of 20 epochs and the learning rate is set as 0.001. For the ASM model used in our algorithm, specific parameters are defined as follows: the diffraction distance is set as 0.2 m, the SLM’s pixel pitch is set as $6.4 μ m$ , and the wavelengths for the red, green, and blue lights are 638, 520, and 450 nm, respectively.

A. Numerical Simulation

As shown in Fig. 4, we first compare the numerical simulation results of our proposed CDFN with other methods such as non-iterative method DPAC [26], iterative method SGD [16], and learning-based methods HoloNet [30] and CCNN [38] in the DIV2K dataset. For consistency, we first resize RGB images to $1600 \times 880$ pixels and then add zero padding to $1920 \times 1072$ pixels. Figure 4 shows enlarged sections to better highlight the differences between these methods. Our results demonstrate CDFN’s ability to capture fine details and ensure structural integrity, proving its effectiveness and superiority.

Figure 4.Comparison of numerically reconstructed color images. From left to right: results of double phase-amplitude coding (DPAC), stochastic gradient descent method (SGD), HoloNet, CCNN, and our proposed CDFN, respectively (PSNR in dB).

Download full size

View all figures

Table 1 presents the quantitative comparison of various methods. Among all methods, SGD achieves the best evaluation results in PSNR, while CDFN excels in SSIM. Among the learning-based methods, under the same training conditions, our proposed CDFN achieves the highest performance, surpassing the current state-of-the-art method CCNN. This demonstrates that our method has a certain competitive advantage over current CGH approaches.Table 1.

Quantitative Results of Different Methods Tested on the DIV2K Testing Dataset, which Consists of 100 Images (Color Channels)^a

	DPAC	SGD	HoloNet	CCNN	CDFN (Ours)
PSNR/SSIM	19.97/0.689	32.69/0.942	29.87/0.926	30.72/0.927	31.68/0.944

Evaluation metrics include PSNR (dB) and SSIM.

Table 2 compares the computational efficiency of SGD, HoloNet, and our CDFN method, including metrics like computational time, parameter count, and floating-point operations (Flops). CDFN and HoloNet, as learning-based methods, show a significant speed advantage over the traditional SGD algorithm. Despite CDFN’s use of a greater number of intermediate feature maps—which increases its computational demand—it optimizes performance by reducing convolution channel numbers. Consequently, CDFN boasts fewer parameters (210,000) than HoloNet (287,000), underscoring that its enhanced performance stems from an effective network design rather than mere parameter count.Table 2.

Efficiency Comparison for Three CGH Methods Tested on the DIV2K Testing Dataset, which Consists of 100 Images with a Resolution of $1920 \times 1072$ Pixels^a

	SGD	HoloNet	CDFN (Ours)
Time (s)	16.851	0.010	0.012
Parameter quantity	–	$2.87 \times 10^{5}$	$2.10 \times 10^{5}$
FLOPs	–	$3.29 \times 10^{11}$	$3.39 \times 10^{11}$

Evaluation metrics include methods’ average inference time, number of parameters, and the floating point of operations (FLOPs). The evaluation is performed using a 24 GB NVIDIA RTX 3090 GPU.

Our CDFN method represents a significant advancement in CGH technology, striking a balance between reconstruction accuracy and computational efficiency. This is evidenced by its performance in PSNR and SSIM evaluations. CDFN distinguishes itself by employing multi-stage feature maps. This strategic approach allows CDFN to outperform the state-of-the-art methods. Consequently, CDFN has established itself as a sophisticated and effective CGH method.

B. Ablation Study

In our ablation study, we test the diverse strategies integral to our proposed method, beginning with an examination of the pure multi-stage (MS) architecture, illustrated in Fig. 5. This foundational aspect of our approach, even in its most basic form, showcases high reconstruction accuracy, outperforming HoloNet.

Figure 5.Comparison of simulated reconstruction images in our ablation study (green channel). From left to right: reconstruction results of HoloNet, multi-stage architecture (MS), multi-stage architecture with infinity phase mapper (MS w IPM), multi-stage architecture with deep supervision (MS w DS), multi-stage architecture with infinity phase mapper and deep supervision (MS w IPM&DS) (PSNR in dB).

Download full size

View all figures

Quantitative evidence, as detailed in Table 3, underscores this performance advantage. The MS architecture attained a PSNR of 30.88 dB and an SSIM of 0.932, surpassing the corresponding metrics of HoloNet which are 30.15 and 0.926, respectively.Table 3.

Ablation Study with Average PSNR (dB) and SSIM Metrics on the DIV2K Testing Dataset, which Consists of 100 Images (Green Channel)

	HoloNet	MS	MS w IPM	MS w DS	MS w IPM&DS
PSNR/SSIM	30.15/0.926	30.88/0.932	32.13/0.948	31.65/0.940	32.26/0.948

Our analysis extends to evaluating enhancements within the multi-stage architecture through the incorporation of an infinity phase mapper (MS w IPM) and deep supervision (MS w DS), individually and in combination (MS w IPM&DS). The addition of the infinity phase mapper alone elevates the reconstruction accuracy, achieving a PSNR of 32.13 dB and an SSIM of 0.948. This improvement underscores the crucial contribution of the phase mapping function to the algorithm’s performance.

Further integration of deep supervision into the architecture (MS w DS) marks another leap in reconstruction quality, with PSNR increasing to 31.65 dB and SSIM to 0.940. This indicates that deep supervision can improve the quality of reconstruction.

The peak performance is realized when both the phase mapping function and deep supervision are applied together (MS w IPM&DS), culminating in a PSNR of 32.26 dB and an SSIM of 0.948. These outstanding results not only highlight the individual strengths of each strategy but also the superior reconstruction accuracy achieved through their combined implementation, showcasing a significant synergistic effect.

To further explore the impact of the IPM on reconstruction quality, we analyze the distribution of phase values in POHs generated under different experimental conditions, as shown in Fig. 6. Notably, the use of the hardTanh activation function, as implemented in HoloNet and CDFN w/o IPM, introduces a discontinuity around $π$ . This discontinuity arises from hardTanh’s insensitivity to phase periodicity. However, in the CDFN w IPM, this discontinuity is mitigated, leading to enhanced reconstruction quality. The improved metrics with the implementation of IPM, as shown in Table 3, support this finding and reinforce our hypothesis that the IPM enhances the accuracy of POH generation by considering phase periodicity. These results also illustrate the unique properties of holographic data compared to traditional 2D image data, emphasizing the need for specialized consideration.

Figure 6.Normalized phase value distribution of a randomly picked POH. From left to right: results of HoloNet, CDFN without infinity phase mapper (CDFN w/o IPM), CDFN with infinity phase mapper (CDFN w IPM). Note that we shift the value in $[- π, π]$ to $[0, 2 π]$ in this figure for the easier observation of values around $π$ . The arrows highlight that the IPM can generate a continuous phase value distribution around the $π$ value, where the conventional active function generates a gap.

Download full size

View all figures

C. Optical Experiment

Our algorithm’s effectiveness is further validated through optical reconstruction experiments. In the experiments, we load the computed $1920 \times 1072$ pixels POHs into a $1920 \times 1080$ SLM and capture images using a Sony A7 Mark III camera. The optical display system setup is illustrated in Fig. 7. The laser source used in this setup is the FISBA READY Beam, which emits light at wavelengths of 638, 520, and 450 nm for the red, green, and blue channels, respectively. After passing through a collimating lens, the beam is split into two paths by a beam splitter. The incident beam is then modulated by a HOLOEYE LETO-3-CFS-127 SLM, which has a resolution of $1920 \times 1080$ pixels and a pixel pitch of 6.4 μm. The modulated beam is reflected, and an aperture is used to filter out undesired high diffraction orders. The target plane is situated at a distance of 0.2 m from the SLM, and the image plane is determined by the thin lens equation.

Figure 7.Holographic display setup. (a) Schematic diagram of the optical display system setup. (b) Photograph of the optical display system setup.

Download full size

View all figures

Figure 8 illustrates the optical reconstruction results obtained with green light. To enhance clarity, magnified patches of the reconstructed images are also presented. In this experiment, we evaluate the performance of our CDFN in comparison to other methods like GS [18], DPAC [26], SGD [23], and HoloNet [30]. These optical experiments consistently match our simulation expectations. DPAC’s reconstructions struggle with accurately depicting fine details, as seen in the parrot’s feathers and the butterfly’s intricacy. The SGD method’s reconstructions are marred by significant speckle noise, obfuscating delicate features. In contrast, our CDFN method overcomes these issues, presenting reconstructions with enhanced clarity.

Figure 8.Optical reconstruction images in the green channel. From left to right: results of Gerchberg–Saxton algorithm (GS), double phase-amplitude coding (DPAC), HoloNet, CCNN, and CDFN, respectively.

Download full size

View all figures

For full-color imaging, as shown in Fig. 9, we sequentially loaded the SLM with POHs for each color channel, synchronously activating the corresponding laser color. The detailed results, showcased in Fig. 10, affirm our method’s capability to reconstruct full-color images with high fidelity. The experimental outcomes not only underscore the practical effectiveness of our approach but also its potential for advancing optical reconstruction technologies.

Figure 9.Optical reconstruction images from $1920 \times 1072$ pixels POHs on a $1920 \times 1080$ SLM in red, green, blue, and color channels. The images are directly captured by a camera. The color image is obtained by synchronizing the three-color laser source and sequentially loading different POHs. From left to right: results in red, green, blue, and color channels, respectively.

Download full size

View all figures

Figure 10.Optical reconstruction images of our method in color channels directly captured by a camera and the corresponding phase-only holograms. (a) Cropped-zoomed patch of the first color image in (b) for visualization. (b) Captured color images. (c) The corresponding phase-only holograms of (b).

Download full size

View all figures

To evaluate our method’s generalization ability across different data types, we test the model trained on the color DIV2K dataset using binary USAF images. Figure 11 shows the results from both simulation and optical experiments. The PSNR and SSIM metric results for the three images are as follows: 18.37 dB and 0.604, 17.03 dB and 0.643, and 20.71 dB and 0.781, respectively. While these results confirm the applicability of our method from color to binary images, they also highlight certain challenges. Notably, some speckle noise can be observed in the reconstructions. This occurs because binary images, with their high contrast between black and white regions, introduce significant high-frequency components that are difficult for a network trained on color images to accurately reconstruct. Despite this, our method demonstrates robustness, as identifiable areas are evident in the reconstruction results. These findings have prompted us to explore further refinements to our method to enhance its generalizability across different data types.

Figure 11.Results of our method applied to binary images. (a) Simulated images. (b) Zoom-in patches. (c) Experimental images. Note that target binary images are not in our training set.

Download full size

View all figures

4. CONCLUSION

In conclusion, our study represents an advancement in the field of computer-generated holography by introducing the cross-domain fusion network (CDFN) as a solution for generating high-quality and efficient phase-only holograms. By addressing the complexities of cross-domain conversion, CDFN stands as a testament to the potential of integrating computational methods with traditional optical principles. The inclusion of the infinite phase mapper, which incorporates an understanding of optical physics, ensures that the generated holograms maintain a high level of fidelity to the original optical phenomena. This work signifies not only technological progress but also a deeper integration between the realms of computation and physical reality, promising a future where holographic displays can achieve unprecedented levels of realism.

The proposed physics-aware CDFN encounters scalability limitations at higher resolutions and requires substantial computational resources, specifically over 24 GB of GPU memory. Additionally, the infinity phase mapper, while innovative, may not completely model all elements of holographic displays. Future efforts will focus on exploring computational models for greater efficiency [42], supporting higher resolutions [43], and refining physical assumptions to broaden the CDFN’s applicability in CGH. This aims to not only improve current metrics but also expand the network’s utility in broader holographic imaging and display challenges. Finally, while this work does not delve into the details of 3D CGH [44], we note that the proposed method could potentially extend to these domains but leave this question for future investigations.

Category: Holography, Gratings, and Diffraction

Received: Apr. 19, 2024

Accepted: Aug. 17, 2024

Published Online: Nov. 12, 2024

The Author Email: Zihan Geng (geng.zihan@sz.tsinghua.edu.cn)

DOI:10.1364/PRJ.527405

CSTR:32188.14.PRJ.527405