Speckle-free holography with a diffraction-aware global perceptual model

Yiran Wei; Yiyun Chen; Mi Zhou; Mu Ku Chen; Shuming Jiao; Qinghua Song; Xiao-Ping Zhang; Zihan Geng

doi:10.1364/PRJ.523650

1. INTRODUCTION

A holographic display retains the amplitude and phase information of the object and can physically reproduce the complete light wave emitted by the object, resulting in realistic displays [1 –10]. Computer-generated holography (CGH), which eliminates the need for complex optical-interference-based recording processes and enables the display of physically non-existent objects, is the dominant technology for holographic displays [11 –18].

Contemporary mainstream CGH algorithms utilizing artificial neural networks mainly employ convolutional neural networks (CNNs) [5,19 –34]. CNNs have demonstrated a robust performance in image processing tasks owing to their translational invariance and the inductive bias of local receptive fields [35]. However, in the hologram generation process, as each pixel on the hologram can act as a point light source affecting the entire reconstructed image, the network’s capacity to learn non-local features significantly impacts the quality of the generated hologram [31]. Unfortunately, the receptive fields of current CNN-based CGH models are typically much smaller than the image size, and thus cannot effectively learn long-range dependencies.

A more versatile neural network model, the Transformer, has been proposed and demonstrated superior performance over CNNs in a multitude of computer vision tasks. This superiority is primarily attributed to the Transformer’s capacity to capture long-range dependencies in images through a global self-attention mechanism [36 –38]. With the global self-attention mechanism, the Transformer is capable of learning the comprehensive relationships between holographic images and target images within a single network layer. Despite the potent performance of transformers, their computational demands far exceed those of CNNs, limiting their applicability to high-resolution images.

In this work, we propose a diffraction-aware CGH model called Holomer. Through the sliding window self-attention mechanism and embedding-based feature dimensionality reduction, Holomer is capable of learning the complex relationship between target images and holograms with fewer layers compared to some existing methods [10,24,34]. At the same time, we propose a metric called perceptive index (PI). It evaluates the network-based CGH model’s ability to learn the inverse diffraction process. Holomer achieves a PSNR and an SSIM of 35.59 dB and 0.93, respectively, on the DIV2K dataset. The fidelity of the generated holograms with a resolution of $1920 \times 1024$ is evaluated through numerical simulations and experimental validation.

2. METHOD

Holomer generates holograms using a two-stage architecture as shown in Fig. 1. The target phase generator estimates the phase of the target amplitude. Afterward, the predicted phase and the target amplitude are propagated to the SLM plane using an angular spectrum method [39,40]. The amplitude and phase propagated to the SLM plane are converted to a phase-only hologram through a phase-only hologram encoder. During the training phase, the generated phase-only hologram undergoes further propagation to the imaging plane through ASM. The resulting reconstructed amplitude of the hologram is compared to the target amplitude, and the mean squared error (MSE) between the amplitude of the two images is computed as a loss. This loss is utilized to update the parameters of both the target phase generator and the phase-only hologram encoder. The MSE loss is calculated as follows: $Loss = \frac{1}{N} \sum_{i = 1}^{N} {‖ A_{rec, i} - A_{tar, i} ‖}^{2},$ (1)where $A_{rec, i}$ and $A_{tar, i}$ represent the amplitude values of the reconstructed and target images at the $i$ th pixel, respectively, and $N$ is the total number of pixels.

$(a) Schematic of Holomer’s (our method) two-stage architecture and its training pipeline. (b) Network architecture schematic for the target phase generator and the phase-only hologram encoder. The target phase generator receives a single-channel target amplitude as input. The input of the phase-only hologram encoder is dual-channel amplitude and phase images. The U-shaped network architecture ensures the consistency between output and input dimensions, enhancing the network’s capability to learn multi-level features. (c) Holomer block schematic; each Holomer block contains two modules, window multihead self-attention (W-MSA) and sliding window multihead self-attention (SW-MSA). The two diagrams correspond to the receptive field of the two modules. By employing the sliding window self-attention mechanism, the Holomer block significantly enhances its receptive field, thereby improving its ability to learn long-range features of the diffraction process.$

Figure 1.(a) Schematic of Holomer’s (our method) two-stage architecture and its training pipeline. (b) Network architecture schematic for the target phase generator and the phase-only hologram encoder. The target phase generator receives a single-channel target amplitude as input. The input of the phase-only hologram encoder is dual-channel amplitude and phase images. The U-shaped network architecture ensures the consistency between output and input dimensions, enhancing the network’s capability to learn multi-level features. (c) Holomer block schematic; each Holomer block contains two modules, window multihead self-attention (W-MSA) and sliding window multihead self-attention (SW-MSA). The two diagrams correspond to the receptive field of the two modules. By employing the sliding window self-attention mechanism, the Holomer block significantly enhances its receptive field, thereby improving its ability to learn long-range features of the diffraction process.

Download full size

View all figures

Holomer maintains resolution consistency between the generated hologram and input image through a Unet architecture as depicted in Fig. 1(b). The model incorporates a residual connection from Unet [35] for efficient training. The Holomer block, with window self-attention and sliding window self-attention, first calculates the self-attention of each independent window. Following the self-attention process, the windows are shifted by half pixels, both vertically and horizontally, introducing a cross-window connection as shown in Fig. 1(c). This shifting mechanism ensures that subsequent layers capture interactions beyond the initially defined windows, enabling the model to effectively integrate local and global information.

According to the angular spectrum theory [41], the reconstructed image of a hologram $I (x, y, z)$ at a specific depth $z$ can be obtained by convolving the hologram $h (x, y)$ with the diffraction impulse response (DIR) $t (x, y, k, z)$ , in which $k = \frac{2 π}{λ}$ represents the wave number, and $λ$ denotes the wavelength of the light source. The computational process can be referred to in Fig. 2: $I (x, y, z) = h (x, y) * t (x, y, k, z) .$ (2)

$(a), (b) Illustration of the amplitude and phase distribution of diffraction impulse response (DIR), respectively. In the computation of the DIR, the wavelength employed is 520 nm, the pixel size is 6.4 μm, and the propagation distance is 20 cm. (c) The schematic illustration for image reconstruction through the convolution of DIR with a hologram. To fulfill the conditions of linear convolution, it is necessary to zero-pad the hologram to twice its original size before proceeding with the convolution. (d) A comparative schematic illustration contrasting the receptive fields of Holomer (ours) and a single-layer CNN, along with their respective coverage of the DIR region size. (e) The curve illustrating the variation in the proportion of the sum of intensities over the region covered by the receptive field to the total DIR intensity with respect to changes in the receptive field size, annotated with markers for a single-layer CNN and Holomer. The DIR coverage of Holomer exhibits a notable improvement compared to CNN.$

Figure 2.(a), (b) Illustration of the amplitude and phase distribution of diffraction impulse response (DIR), respectively. In the computation of the DIR, the wavelength employed is 520 nm, the pixel size is 6.4 μm, and the propagation distance is 20 cm. (c) The schematic illustration for image reconstruction through the convolution of DIR with a hologram. To fulfill the conditions of linear convolution, it is necessary to zero-pad the hologram to twice its original size before proceeding with the convolution. (d) A comparative schematic illustration contrasting the receptive fields of Holomer (ours) and a single-layer CNN, along with their respective coverage of the DIR region size. (e) The curve illustrating the variation in the proportion of the sum of intensities over the region covered by the receptive field to the total DIR intensity with respect to changes in the receptive field size, annotated with markers for a single-layer CNN and Holomer. The DIR coverage of Holomer exhibits a notable improvement compared to CNN.

Download full size

View all figures

Since the angular spectrum method is a rigorous solution to the Rayleigh-Sommerfeld diffraction formula, we can derive the diffraction impulse response (DIR) $t (x, y, k, z)$ as follows: $t (x, y, k, z) = \frac{1}{2 π} \frac{z}{r} (\frac{1}{r} - j k) \frac{e^{j k r}}{r} .$ (3)

In the formula, $r = \sqrt{x^{2} + y^{2} + z^{2}}$ denotes the distance between the source point and the observation point.

The current common CNN-based CGH models are limited by the smaller receptive field of CNNs, which cannot learn the whole diffraction process in a single layer. However, this can be achieved using a global self-attention mechanism: $Attention (Q, K, V) = Softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V,$ (4)where $Q$ , $K$ , and $V$ are the query, key, and value matrices derived from multiplying the input features $X$ with three learnable parameter matrices $W^{K}$ , $W^{Q}$ , and $W^{V}$ , and $d_{k}$ is the dimensionality of the key. Despite the extraordinary performance of the Transformer in computing holograms, its computational demands are substantial. Calculating self-attention for a $1920 \times 1080$ image requires a staggering 17,060 GB of GPU memory resources, making it unfeasible.

Therefore, there is a trade-off between computational efficiency and the receptive field in model design. Our proposed Holomer employs two designs to address this issue. First, an embedding layer is used in the input layer to reduce the dimension of the target image, where each pixel in the low-dimensional feature map contains information from 16 pixels of the original image. Second, the main structure of the model employs window self-attention layers and sliding-window self-attention layers, with each window size being $16 \times 16$ . After computing the window self-attention, the windows are shifted to compute sliding window self-attention, which expands the perceptive field by a factor of 1.5. The receptive field of each Holomer block is $16 \times 16 \times 1.5$ , which is 43 times that of the convolutional layer.

To directly compare the learning capabilities of CNN and Holomer concerning the diffraction process, we propose a metric called perceptive index (PI), which is the ratio of the intensity integral of the DIR covered by receptive fields of single module $I_{recep}$ to the total intensity integral of the DIR $I_{total}$ . The definition is as follows: $P I = \frac{I_{recep}}{I_{total}} .$ (5)

A higher value of PI indicates that the receptive field of the module covers a larger area of the DIR, signifying stronger learning capabilities for the diffraction model. The calculated results are shown in Fig. 2(e). The DIR used for this computation is based on a pixel size of 6.4 μm, a holographic image resolution of $256 \times 256 pixels$ , and a propagation distance of 20 cm. It is noteworthy that using a larger receptive field can indeed enhance the learning capabilities, but this also increases the computational complexity. Additionally, due to the spherical wave characteristics of the energy distribution in DIR, the intensity at positions far from the central point is quite low, imposing minimal impact on the reconstructed image. Therefore, the performance improvement gained from enlarging the receptive field has diminishing returns. Considering these trade-offs, we set the window size of Holomer to $16 \times 16$ .

3. RESULTS

A. Performance Comparison

In order to validate Holomer’s performance in generating holograms, we demonstrate the reconstructed images through numerical simulations and optical experiments. The dataset used for training and testing is DIV2K [42], the training and test sets contain 800 and 100 high-resolution images, respectively, and the device used to train the network is an NVIDIA RTX 3090 GPU. We compare the performance of Holomer with an iterative algorithm and model-driven self-supervised models, including the Wirtinger algorithm [10], HoloNet [24], and CCNN [34].

Table 1 compares the performance of Wirtinger, HoloNet, CCNN, and Holomer on the DIV2K test dataset. In the simulation, we trained three different networks to generate holograms at wavelengths of 450, 520, and 638 nm, respectively. The reconstruction distance is 20 cm. Each network is trained using the images from the training set corresponding to the respective color channel. Holomer’s PSNR improves by 1.76 dB to 6.14 dB over the previous algorithm, and its SSIM improves by 0.02 to 0.11. The presented metrics unequivocally demonstrate the pronounced enhancement in holographic fidelity achieved by Holomer in contrast to the aforementioned algorithms. Additionally, the accompanying Fig. 3 juxtaposes the outcomes of the four algorithms alongside Holomer’s full-color simulation, utilizing images sourced from the test set of the DIV2K dataset. The holographic reconstructions generated by the Wirtinger algorithm exhibit superior detail quality, albeit compromised by the pronounced speckle noise resulting from its stochastic initialization. The holographic reconstruction quality produced by Holomer stands as paramount among all learning-based models, concurrently surpassing the metrics of all presented algorithms.Table 1.

Performance Comparison of CGH Algorithms

Algorithm	Perceptive Index	PSNR (dB)	SSIM
Wirtinger	None	31.88	0.82
HoloNet	0.6%	33.83	0.91
CCNN	0.6%	29.45	0.84
Holomer (ours)	32.6%	35.59	0.93

Figure 3.A comparative demonstration of numerical simulation results encompasses Wirtinger, HoloNet, CCNN, and our proposed Holomer, utilizing images sourced from the DIV2K validation set. In specific experiments, Wirtinger underwent 200 iterations, while HoloNet, CCNN, and Holomer underwent training for an equivalent duration on the DIV2K training set.

Download full size

View all figures

B. Experimental Demonstration

In the experiments, the laser source employed is the FISBA READYBeam, operating at wavelengths of 450, 520, and 638 nm. The spatial modulator used is the HOLOEYE LETO-3-CFS-127 SLM with a pixel size of 6.4 μm. The experiment setup is shown in Fig. 5(a).

The display efficacy of holograms generated by Holomer was validated through optical reconstruction, with the results depicted in Fig. 4. The reconstructed images indicate that iterative algorithms are severely impacted by significant speckle noise. Additionally, CGH generation models based on CNNs are constrained by small receptive fields, unable to eliminate the influence of speckle noise. In contrast, Holomer, leveraging a diffraction-inspired module, successfully eradicates background speckle noise. Furthermore, Holomer exhibits superior local learning capabilities, such as enhanced detail representation and brightness fidelity, when compared to other methods. Experimental results demonstrate that Holomer achieves higher resolution in the reconstructed holographic images, showcasing improved detail texture, and enhanced contrast in darker regions.

Figure 4.Captured reconstructed images of the green channel (520 nm) for Wirtinger, HoloNet, CCNN, and Holomer methods, along with the corresponding ground truth image.

Download full size

View all figures

$(a) Schematic diagram of the experimental setup used to verify the optical reconstruction of Holomer-generated holograms using an aperture to eliminate diffracted images from other levels. (b), (c) Optical reconstruction images of our method in color channels directly captured by a camera. (d), (e) The corresponding color holograms loaded on the SLM.$

Figure 5.(a) Schematic diagram of the experimental setup used to verify the optical reconstruction of Holomer-generated holograms using an aperture to eliminate diffracted images from other levels. (b), (c) Optical reconstruction images of our method in color channels directly captured by a camera. (d), (e) The corresponding color holograms loaded on the SLM.

Download full size

View all figures

To further substantiate the practical utility of our proposed methodology, the holograms generated in simulation are synthesized into a full-color holographic image via a temporal multiplexing approach. The holograms of the three RGB channels are sequentially loaded on the SLM and the light sources are synchronized to the corresponding wavelengths to achieve full-color display. Experimental results, as illustrated in Fig. 5, utilize images sourced from the DIV2K dataset. It is observable that the full-color display rendered by our method exhibits high contrast and clarity. The ringing artifacts at the edge occur because of signal truncation at the edges of the SLM and the SLM’s limited aperture [43].

4. CONCLUSION

In summary, we propose Holomer, a diffraction-aware deep learning model for the generation of high-quality phase-only holograms. Leveraging the sliding window self-attention mechanism and embedding-based feature dimensionality reduction, the model’s receptive field is significantly enhanced. A larger receptive field in Holomer aligns the inference with the diffraction process of holography, resulting in higher-quality holograms. Meanwhile, we propose a metric called perceptive index for evaluating the learning ability of network-based CGH models for diffraction processes. On the DIV2K dataset, Holomer achieved a PSNR of 35.59 dB and an SSIM of 0.93. Optical reconstruction validates that our method surpasses traditional iterative algorithms and those based on convolutional neural networks in terms of reconstruction quality. The effectiveness of full-color reconstruction demonstrates Holomer’s practical utility in the application domain of computer-generated holography.

Category: Holography, Gratings, and Diffraction

Received: Mar. 14, 2024

Accepted: Jul. 21, 2024

Published Online: Oct. 10, 2024

The Author Email: Zihan Geng (geng.zihan@sz.tsinghua.edu.cn)

DOI:10.1364/PRJ.523650

CSTR:32188.14.PRJ.523650

微信扫一扫：分享