Computer-generated holography (CGH) based on neural networks has been actively investigated in recent years, and convolutional neural networks (CNNs) are frequently adopted. A convolutional kernel captures local dependencies between neighboring pixels. However, in CGH, each pixel on the hologram influences all the image pixels on the observation plane, thus requiring a network capable of learning long-distance dependencies. To tackle this problem, we propose a CGH model called Holomer. Its single-layer perceptual field is 43 times larger than that of a widely used convolutional kernel, thanks to the embedding-based feature dimensionality reduction and multi-head sliding-window self-attention mechanisms. In addition, we propose a metric to measure the networks’ learning ability of the inverse diffraction process. In the simulation, our method demonstrated noteworthy performance on the DIV2K dataset at a resolution of , achieving a PSNR and an SSIM of 35.59 dB and 0.93, respectively. The optical experiments reveal that our results have excellent image details and no observable background speckle noise. This work paves the path of high-quality hologram generation.
【AIGC One Sentence Reading】:Holomer, a CGH model with embedding-based reduction and self-attention, excels in high-quality hologram generation without speckle noise, achieving PSNR of 35.59 dB and SSIM of 0.93.
【AIGC Short Abstract】:We introduce Holomer, a CGH model leveraging embedding-based dimensionality reduction and multi-head sliding-window self-attention, expanding the perceptual field 43 times that of a 3×3 kernel. Holomer excels in learning long-distance dependencies and inverse diffraction, achieving high PSNR and SSIM on the DIV2K dataset and producing speckle-free holograms with excellent details.
Note: This section is automatically generated by AI . The website and platform operators shall not be liable for any commercial or legal consequences arising from your use of AI generated content on this website. Please be aware of this.
1. INTRODUCTION
A holographic display retains the amplitude and phase information of the object and can physically reproduce the complete light wave emitted by the object, resulting in realistic displays [1–10]. Computer-generated holography (CGH), which eliminates the need for complex optical-interference-based recording processes and enables the display of physically non-existent objects, is the dominant technology for holographic displays [11–18].
Contemporary mainstream CGH algorithms utilizing artificial neural networks mainly employ convolutional neural networks (CNNs) [5,19–34]. CNNs have demonstrated a robust performance in image processing tasks owing to their translational invariance and the inductive bias of local receptive fields [35]. However, in the hologram generation process, as each pixel on the hologram can act as a point light source affecting the entire reconstructed image, the network’s capacity to learn non-local features significantly impacts the quality of the generated hologram [31]. Unfortunately, the receptive fields of current CNN-based CGH models are typically much smaller than the image size, and thus cannot effectively learn long-range dependencies.
A more versatile neural network model, the Transformer, has been proposed and demonstrated superior performance over CNNs in a multitude of computer vision tasks. This superiority is primarily attributed to the Transformer’s capacity to capture long-range dependencies in images through a global self-attention mechanism [36–38]. With the global self-attention mechanism, the Transformer is capable of learning the comprehensive relationships between holographic images and target images within a single network layer. Despite the potent performance of transformers, their computational demands far exceed those of CNNs, limiting their applicability to high-resolution images.
Sign up for Photonics Research TOC Get the latest issue of Advanced Photonics delivered right to you!Sign up now
In this work, we propose a diffraction-aware CGH model called Holomer. Through the sliding window self-attention mechanism and embedding-based feature dimensionality reduction, Holomer is capable of learning the complex relationship between target images and holograms with fewer layers compared to some existing methods [10,24,34]. At the same time, we propose a metric called perceptive index (PI). It evaluates the network-based CGH model’s ability to learn the inverse diffraction process. Holomer achieves a PSNR and an SSIM of 35.59 dB and 0.93, respectively, on the DIV2K dataset. The fidelity of the generated holograms with a resolution of is evaluated through numerical simulations and experimental validation.
2. METHOD
Holomer generates holograms using a two-stage architecture as shown in Fig. 1. The target phase generator estimates the phase of the target amplitude. Afterward, the predicted phase and the target amplitude are propagated to the SLM plane using an angular spectrum method [39,40]. The amplitude and phase propagated to the SLM plane are converted to a phase-only hologram through a phase-only hologram encoder. During the training phase, the generated phase-only hologram undergoes further propagation to the imaging plane through ASM. The resulting reconstructed amplitude of the hologram is compared to the target amplitude, and the mean squared error (MSE) between the amplitude of the two images is computed as a loss. This loss is utilized to update the parameters of both the target phase generator and the phase-only hologram encoder. The MSE loss is calculated as follows: where and represent the amplitude values of the reconstructed and target images at the th pixel, respectively, and is the total number of pixels.
Figure 1.(a) Schematic of Holomer’s (our method) two-stage architecture and its training pipeline. (b) Network architecture schematic for the target phase generator and the phase-only hologram encoder. The target phase generator receives a single-channel target amplitude as input. The input of the phase-only hologram encoder is dual-channel amplitude and phase images. The U-shaped network architecture ensures the consistency between output and input dimensions, enhancing the network’s capability to learn multi-level features. (c) Holomer block schematic; each Holomer block contains two modules, window multihead self-attention (W-MSA) and sliding window multihead self-attention (SW-MSA). The two diagrams correspond to the receptive field of the two modules. By employing the sliding window self-attention mechanism, the Holomer block significantly enhances its receptive field, thereby improving its ability to learn long-range features of the diffraction process.
Holomer maintains resolution consistency between the generated hologram and input image through a Unet architecture as depicted in Fig. 1(b). The model incorporates a residual connection from Unet [35] for efficient training. The Holomer block, with window self-attention and sliding window self-attention, first calculates the self-attention of each independent window. Following the self-attention process, the windows are shifted by half pixels, both vertically and horizontally, introducing a cross-window connection as shown in Fig. 1(c). This shifting mechanism ensures that subsequent layers capture interactions beyond the initially defined windows, enabling the model to effectively integrate local and global information.
According to the angular spectrum theory [41], the reconstructed image of a hologram at a specific depth can be obtained by convolving the hologram with the diffraction impulse response (DIR) , in which represents the wave number, and denotes the wavelength of the light source. The computational process can be referred to in Fig. 2:
Figure 2.(a), (b) Illustration of the amplitude and phase distribution of diffraction impulse response (DIR), respectively. In the computation of the DIR, the wavelength employed is 520 nm, the pixel size is 6.4 μm, and the propagation distance is 20 cm. (c) The schematic illustration for image reconstruction through the convolution of DIR with a hologram. To fulfill the conditions of linear convolution, it is necessary to zero-pad the hologram to twice its original size before proceeding with the convolution. (d) A comparative schematic illustration contrasting the receptive fields of Holomer (ours) and a single-layer CNN, along with their respective coverage of the DIR region size. (e) The curve illustrating the variation in the proportion of the sum of intensities over the region covered by the receptive field to the total DIR intensity with respect to changes in the receptive field size, annotated with markers for a single-layer CNN and Holomer. The DIR coverage of Holomer exhibits a notable improvement compared to CNN.
Since the angular spectrum method is a rigorous solution to the Rayleigh-Sommerfeld diffraction formula, we can derive the diffraction impulse response (DIR) as follows:
In the formula, denotes the distance between the source point and the observation point.
The current common CNN-based CGH models are limited by the smaller receptive field of CNNs, which cannot learn the whole diffraction process in a single layer. However, this can be achieved using a global self-attention mechanism: where , , and are the query, key, and value matrices derived from multiplying the input features with three learnable parameter matrices , , and , and is the dimensionality of the key. Despite the extraordinary performance of the Transformer in computing holograms, its computational demands are substantial. Calculating self-attention for a image requires a staggering 17,060 GB of GPU memory resources, making it unfeasible.
Therefore, there is a trade-off between computational efficiency and the receptive field in model design. Our proposed Holomer employs two designs to address this issue. First, an embedding layer is used in the input layer to reduce the dimension of the target image, where each pixel in the low-dimensional feature map contains information from 16 pixels of the original image. Second, the main structure of the model employs window self-attention layers and sliding-window self-attention layers, with each window size being . After computing the window self-attention, the windows are shifted to compute sliding window self-attention, which expands the perceptive field by a factor of 1.5. The receptive field of each Holomer block is , which is 43 times that of the convolutional layer.
To directly compare the learning capabilities of CNN and Holomer concerning the diffraction process, we propose a metric called perceptive index (PI), which is the ratio of the intensity integral of the DIR covered by receptive fields of single module to the total intensity integral of the DIR . The definition is as follows:
A higher value of PI indicates that the receptive field of the module covers a larger area of the DIR, signifying stronger learning capabilities for the diffraction model. The calculated results are shown in Fig. 2(e). The DIR used for this computation is based on a pixel size of 6.4 μm, a holographic image resolution of , and a propagation distance of 20 cm. It is noteworthy that using a larger receptive field can indeed enhance the learning capabilities, but this also increases the computational complexity. Additionally, due to the spherical wave characteristics of the energy distribution in DIR, the intensity at positions far from the central point is quite low, imposing minimal impact on the reconstructed image. Therefore, the performance improvement gained from enlarging the receptive field has diminishing returns. Considering these trade-offs, we set the window size of Holomer to .
3. RESULTS
A. Performance Comparison
In order to validate Holomer’s performance in generating holograms, we demonstrate the reconstructed images through numerical simulations and optical experiments. The dataset used for training and testing is DIV2K [42], the training and test sets contain 800 and 100 high-resolution images, respectively, and the device used to train the network is an NVIDIA RTX 3090 GPU. We compare the performance of Holomer with an iterative algorithm and model-driven self-supervised models, including the Wirtinger algorithm [10], HoloNet [24], and CCNN [34].
Table 1 compares the performance of Wirtinger, HoloNet, CCNN, and Holomer on the DIV2K test dataset. In the simulation, we trained three different networks to generate holograms at wavelengths of 450, 520, and 638 nm, respectively. The reconstruction distance is 20 cm. Each network is trained using the images from the training set corresponding to the respective color channel. Holomer’s PSNR improves by 1.76 dB to 6.14 dB over the previous algorithm, and its SSIM improves by 0.02 to 0.11. The presented metrics unequivocally demonstrate the pronounced enhancement in holographic fidelity achieved by Holomer in contrast to the aforementioned algorithms. Additionally, the accompanying Fig. 3 juxtaposes the outcomes of the four algorithms alongside Holomer’s full-color simulation, utilizing images sourced from the test set of the DIV2K dataset. The holographic reconstructions generated by the Wirtinger algorithm exhibit superior detail quality, albeit compromised by the pronounced speckle noise resulting from its stochastic initialization. The holographic reconstruction quality produced by Holomer stands as paramount among all learning-based models, concurrently surpassing the metrics of all presented algorithms.
Performance Comparison of CGH Algorithms
Algorithm
Perceptive Index
PSNR (dB)
SSIM
Wirtinger
None
31.88
0.82
HoloNet
0.6%
33.83
0.91
CCNN
0.6%
29.45
0.84
Holomer (ours)
32.6%
35.59
0.93
Figure 3.A comparative demonstration of numerical simulation results encompasses Wirtinger, HoloNet, CCNN, and our proposed Holomer, utilizing images sourced from the DIV2K validation set. In specific experiments, Wirtinger underwent 200 iterations, while HoloNet, CCNN, and Holomer underwent training for an equivalent duration on the DIV2K training set.
In the experiments, the laser source employed is the FISBA READYBeam, operating at wavelengths of 450, 520, and 638 nm. The spatial modulator used is the HOLOEYE LETO-3-CFS-127 SLM with a pixel size of 6.4 μm. The experiment setup is shown in Fig. 5(a).
The display efficacy of holograms generated by Holomer was validated through optical reconstruction, with the results depicted in Fig. 4. The reconstructed images indicate that iterative algorithms are severely impacted by significant speckle noise. Additionally, CGH generation models based on CNNs are constrained by small receptive fields, unable to eliminate the influence of speckle noise. In contrast, Holomer, leveraging a diffraction-inspired module, successfully eradicates background speckle noise. Furthermore, Holomer exhibits superior local learning capabilities, such as enhanced detail representation and brightness fidelity, when compared to other methods. Experimental results demonstrate that Holomer achieves higher resolution in the reconstructed holographic images, showcasing improved detail texture, and enhanced contrast in darker regions.
Figure 4.Captured reconstructed images of the green channel (520 nm) for Wirtinger, HoloNet, CCNN, and Holomer methods, along with the corresponding ground truth image.
Figure 5.(a) Schematic diagram of the experimental setup used to verify the optical reconstruction of Holomer-generated holograms using an aperture to eliminate diffracted images from other levels. (b), (c) Optical reconstruction images of our method in color channels directly captured by a camera. (d), (e) The corresponding color holograms loaded on the SLM.
To further substantiate the practical utility of our proposed methodology, the holograms generated in simulation are synthesized into a full-color holographic image via a temporal multiplexing approach. The holograms of the three RGB channels are sequentially loaded on the SLM and the light sources are synchronized to the corresponding wavelengths to achieve full-color display. Experimental results, as illustrated in Fig. 5, utilize images sourced from the DIV2K dataset. It is observable that the full-color display rendered by our method exhibits high contrast and clarity. The ringing artifacts at the edge occur because of signal truncation at the edges of the SLM and the SLM’s limited aperture [43].
4. CONCLUSION
In summary, we propose Holomer, a diffraction-aware deep learning model for the generation of high-quality phase-only holograms. Leveraging the sliding window self-attention mechanism and embedding-based feature dimensionality reduction, the model’s receptive field is significantly enhanced. A larger receptive field in Holomer aligns the inference with the diffraction process of holography, resulting in higher-quality holograms. Meanwhile, we propose a metric called perceptive index for evaluating the learning ability of network-based CGH models for diffraction processes. On the DIV2K dataset, Holomer achieved a PSNR of 35.59 dB and an SSIM of 0.93. Optical reconstruction validates that our method surpasses traditional iterative algorithms and those based on convolutional neural networks in terms of reconstruction quality. The effectiveness of full-color reconstruction demonstrates Holomer’s practical utility in the application domain of computer-generated holography.
[23] N. Muramatsu, C. W. Ooi, Y. Itoh. Deepholo: recognizing 3D objects using a binary-weighted computer-generated hologram. SIGGRAPH Asia 2017 Posters, 1-2(2017).
[35] O. Ronneberger, P. Fischer, T. Brox. U-net: convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Proceedings, Part III, 234-241(2015).
[36] A. Vaswani, N. Shazeer, N. Parmar. Attention is all you need. Proceedings of the 31st International Conference on Neural Information Processing Systems, 6000-6010(2017).
[38] Z. Liu, Y. Lin, Y. Cao. Swin transformer: hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF International Conference on Computer Vision, 10012-10022(2021).
[41] J. W. Goodman. Introduction to Fourier Optics(2005).
[42] E. Agustsson, R. Timofte. NTIRE 2017 challenge on single image super-resolution: dataset and study. Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 1122-1131(2017).
Yiran Wei, Yiyun Chen, Mi Zhou, Mu Ku Chen, Shuming Jiao, Qinghua Song, Xiao-Ping Zhang, Zihan Geng, "Speckle-free holography with a diffraction-aware global perceptual model," Photonics Res. 12, 2418 (2024)