High-quality hologram generation based on a complex-valued hierarchical multi-fusion neural network

Jiahui Fu; Wenqiang Wan; Yunrui Wang; Yanfeng Su

doi:10.3788/COL202523.090501

1. Introduction

With the rapid development of virtual reality (VR), augmented reality (AR), and three-dimensional (3D) display technologies, computer-generated holography (CGH), as a core technology for achieving high-quality visual experiences, has become a research hotspot in both academia and industry^[1–5]. Traditional iterative algorithms for CGH, such as Gerchberg-Saxton (GS) and stochastic gradient descent (SGD) algorithms, are time-consuming because they require several tens of iterations to achieve high-quality reconstructed images, which is unsuitable for the real-time holographic display^[6–9]. Non-iterative algorithms, such as double phase amplitude encoding (DPAC) and error diffusion, have relatively faster computational speeds. Nevertheless, the reconstructed images often suffer from severe speckle noise and artifacts, seriously degrading the image quality and visual effect^[10–12]. Therefore, traditional CGH algorithms face a paramount challenge in striking a balance between the quality of reconstructed images and computational efficiency.

The rise of deep learning technology has opened up a new avenue for solving this problem. Deep learning CGH based on data-driven methods utilizes a supervised training strategy by constructing large-scale labeled datasets, which not only increases the computational burden but also limits the generalization ability of high-quality holograms by the quality of the labeled training datasets^[13–15]. In contrast, model-driven CGH approaches facilitate unsupervised learning of holographic latent representations through the integration of physical diffraction models within the network architecture. This methodology eliminates the requirement for labeled training datasets while enabling direct hologram prediction through physics-informed neural computation^[16]. Wu et al. developed Holo-encoder based on an autoencoder neural network architecture, which can autonomously learn the potential encoding of phase-only holograms in an unsupervised manner^[17]. Its innovation lies in the self-learning ability of the autoencoder, but it still has limitations in reconstructing complex scenes. Peng et al. proposed a new neural network architecture, HoloNet, that optimizes the diffraction model and adopts the camera-in-the-loop strategy to effectively reduce artifacts in optical reconstructions and improve the quality of holograms^[18]. Zheng et al. applied a hybrid domain loss training strategy to enhance the quality of reconstructed images^[19]. Zhou et al. introduced an end-to-end compression-aware framework for computer-generated holography, specifically designed to overcome the challenges associated with holographic data storage and transmission. This approach integrates data compression directly into the hologram synthesis pipeline, optimizing both reconstruction quality and computational efficiency^[20]. Liu et al. proposed a novel approach that integrates deep learning methodologies to enhance the conventional error diffusion algorithm, with the objective of enabling efficient generation of high-quality holograms^[21]. Liu et al. developed an innovative lightweight local and global self-attention network (LGSANet) architecture that implements a self-attention mechanism for hierarchical feature extraction in holographic reconstruction^[22]. By the deployment of this optimized network architecture, the proposed solution achieves a significant reduction in computational complexity while enhancing the quality and generation speed of holograms.

However, most existing deep learning methods are based on real-valued neural network architectures. When dealing with complex-valued light wave fields, they typically treat them as two-channel spatial domain images, mapping the amplitude and phase components onto real-valued convolution kernels by concatenating them along the channel dimension^[23,24]. This encoding approach fails to fully capitalize on the computational properties of complex amplitudes, restricting the further enhancement of reconstructed image quality. To surmount this obstacle, Zhong et al. introduced a complex-valued convolutional neural network (CCNN) to directly learn the complex-valued wave field in the spatial light modulator (SLM) plane and generate higher-quality phase-only holograms^[25]. Qin et al. proposed a complex-valued generative adversarial network (CV-GAN) to effectively enhance hologram quality and generation speed by employing a complex-valued U-Net generator and an adversarial learning mechanism, offering a novel approach for real-time high-quality CGH^[26]. Zhang et al. developed a novel residual-block-based complex-valued convolutional neural network (ResC-CNN) that significantly improves complex-domain feature representation in computer-generated holography. This architecture enhances the network’s capability to process phase and amplitude information while optimizing the learning efficiency through residual connections^[27]. In addition, complex amplitude modulation (CAM) methods have brought new perspectives to the encoding of complex amplitude information^[28,29]. These methods encode the target complex amplitude into two distinct real-valued phase components. Then, through the application of complementary two-dimensional binary gratings, these components are sampled and fused to generate a double-phase hologram (DPH). Nevertheless, CAM methods may introduce additional noise or artifacts during the encoding and decoding processes, which can degrade the final hologram quality. On the other hand, the field of deep-learning-based image super-resolution provides a relevant perspective for holographic applications^[30]. In the holographic domain, these super-resolution techniques share fundamental similarities with conventional image processing in terms of information representation and reconstruction. Consequently, the theoretical frameworks and methodologies developed for image super-resolution can be leveraged to derive valuable insights for advancing holographic technology. Such cross-disciplinary integration may facilitate novel solutions to the existing challenges in hologram generation.

In this paper, we propose a novel complex-valued hierarchical multi-fusion neural network (CHMFNet) for high-fidelity hologram generation. The proposed framework builds upon a U-Net backbone, incorporating a complex-valued multi-level perceptron (CMP) module that significantly enhances complex feature representation through optimized convolutional operations and advanced activation functions. This architectural innovation enables precise extraction and processing of intricate holographic patterns inherent in input data, facilitating superior feature representation and reconstruction accuracy. Furthermore, the integration of an innovative complex-valued hierarchical multi-fusion (CHMF) block into the architecture enables multi-level hierarchical processing and advanced feature fusion and refinement. Compared with existing complex-valued neural network methods, the CHMF integrates two specialized components: a hologram-related complex local and non-local aggregation (HCLNA) module and a hologram-oriented complex progressive local feature (HCPLF) module, enabling superior feature modulation and optimization. The HCLNA module’s unique capability to simultaneously process local and non-local features within a complex-valued framework represents a substantial improvement in capturing intricate spatial relationships inherent in holograms. Complementing this functionality, the HCPLF module implements progressive feature refinement, significantly enhancing the network’s capacity for complex hologram generation tasks. These architectural innovations not only overcome fundamental limitations of conventional approaches but also establish new directions for hologram generation algorithm development. Simulation and experimental results confirm that CHMFNet generates holograms with enhanced detail precision and superior image quality over conventional methods. These capabilities demonstrate its strong potential for holographic display applications.

2. Principles and Methods

2.1. CHMFNet algorithm principle

Figure 1 illustrates the fundamental principle of the proposed network architecture. Our architecture adopts the complex-valued neural networks to directly process complex-valued data, which are more suitable for preserving the phase and amplitude information of optical fields^[31]. The architecture incorporates two distinct CHMFNets. The first operates as a complex amplitude generator, responsible for predicting the complex amplitude distribution at the target plane and refining the diffraction results to serve as input for the second network. The second network functions as a phase optimizer, which generates the complex amplitude at the SLM plane, enabling the collaborative generation of the final hologram. Both networks play a pivotal role in the computation of holograms.

Figure 1.Fundamental principle of the proposed network architecture.

Download full size

View all figures

For a given input image, the intensity information $A$ of its initial complex amplitude is multiplied by a zero-phase component and converted into a complex-valued representation. This serves as the input to the CHMFNet1 model, which generates the complex amplitude $I_{1}$ at the target plane. This process can be mathematically expressed as $I_{1} = A \exp (i φ),$ (1)where $φ$ denotes the phase predicted by CHMFNet1. The diffraction simulation from the target plane to the SLM plane is performed using the forward propagation angular spectrum diffraction method (ASDM)^[32], which can be represented as $H_{1} = F^{- 1} {F [I_{1}] \cdot T_{1} (f_{x}, f_{y})},$ (2) $T_{1} (f_{x}, f_{y}) = {\begin{matrix} \exp (i \frac{2 π}{λ} z \sqrt{1 - λ^{2} f_{x}^{2} - λ^{2} f_{y}^{2}}), & if \sqrt{f_{x}^{2} + f_{y}^{2} <} \frac{1}{λ} \\ 0, & otherwise \end{matrix},$ (3)where $F []$ denotes the Fourier transform and $F^{- 1} []$ represents the inverse Fourier transform. $T_{1}$ $(f_{x}, f_{y})$ corresponds to the transfer function of ASDM. $f_{x}$ and $f_{y}$ represent the spatial frequencies along the $x$ - and $y$ -axes, respectively, while $λ$ and $z$ represent the wavelength and diffraction distance, respectively. The complex amplitude $H_{1}$ , generated by CHMFNet1, serves as the input to the second network. Subsequently, CHMFNet2 predicts and optimizes the complex amplitude $I_{2}$ at the SLM plane, which is then translated into a phase-only hologram. The image reconstruction of the predicted complex amplitude $I_{2}$ from the SLM plane to the image plane can be calculated using the backward propagation ASDM. The mean square error (MSE) loss function is used to calculate the average squared difference between the initial input amplitude and the reconstructed amplitude. To enhance the model’s accuracy, the convolution kernels are iteratively updated based on the loss value through the Adam optimization algorithm, minimizing the discrepancy between the predicted and target values^[33].

2.2. Framework of CHMFNet

The overall network architecture integrates multiple core modules that operate in a synergistic manner. Its design is meticulously tailored to address the processing and optimization of the complex optical properties inherent in holograms. Figure 2 illustrates the schematic architecture of the proposed CHMFNet. The network architecture is built upon a U-Net framework^[34], comprising four downsampling blocks (DBs), four upsampling blocks (UBs), and skip connections (SCs). Each DB integrates a complex-valued convolutional layer (CConv) with $3 \times 3$ kernels, a complex-valued rectified linear unit (CRelu) activation layer, a CMP module, and a CHMF block. The initial three UBs (noted as UB1) incorporate a CConv layer with $4 \times 4$ kernels, a CRelu layer, and a CHMF block, while the final UB (noted as UB2) contains a CConv layer with $4 \times 4$ kernels and a CHMF block, ensuring comprehensive complex amplitude output generation.

Figure 2.Schematic representation of the proposed CHMFNet.

Download full size

View all figures

The CConv layer is widely employed in neural networks for processing complex-valued data, with its operational mechanism grounded in specific mathematical formulations. Given an input complex tensor $x$ , let $x_{r}$ and $x_{i}$ denote its real and imaginary components, respectively. Similarly, for the convolution kernel $k$ , let $k_{r}$ and $k_{i}$ represent the real and imaginary parts, respectively. The convolution operation is independently performed on the real and imaginary components of the input tensor. The resultant real ( $y_{r}$ ) and imaginary ( $y_{i}$ ) components are computed through the following mathematical formulations: $y_{r} = (x_{r} * k_{r} - x_{i} * k_{i}),$ (4) $y_{i} = (x_{r} * k_{i} + x_{i} * k_{r}) .$ (5)

These outputs are then combined to form the complex convolution output: $CConv (Y) = y_{r} + i \cdot y_{i},$ (6)where $Y$ represents the input complex-valued matrix. The CConv layer performs precise extraction and processing of both amplitude and phase information from complex optical data for hologram generation. By effectively capturing these features, it generates essential representations for downstream network modules, enabling accurate holographic reconstruction. The CRelu layer integrates nonlinearity into the neural network architecture by applying the ReLU activation function to both the real and imaginary components of the input complex-valued matrix. This nonlinear transformation enables the network to learn more complex relationships in the feature space, enhancing its ability to model the complex optical phenomena in hologram generation.

The CMP module serves as a critical component in the network’s feature-processing pipeline. During initialization, it automatically configures parameters based on input dimensions and a predefined growth rate. The module initially processes input features through a grouped complex-valued convolutional layer, where the grouping mechanism is determined by the channel-wise correlation patterns of the input features. This architecture enables selective feature extraction by capturing distinct feature representations across different groups, thereby enhancing the specificity and efficiency of feature learning. Following initial feature extraction, the intermediate features undergo dimensionality transformation through a $1 \times 1$ complex-valued convolutional layer. Subsequently, the CRelu activation function is applied to introduce nonlinear transformations. The feature extraction process through the grouped complex-valued convolutional layer is mathematically described by Eq. (7), while the dimensionality transformation operation is formulated in Eq. (8): $Z = Z_{r} + i \cdot Z_{i},$ (7) $C ReLU (Z) = ReLU (Z_{r}) + i \cdot ReLU (Z_{i}) .$ (8)

The feature dimension is ultimately transformed to align with the input dimension. This architectural design enables the CMP module to perform in-depth processing and transformation of the input features. When dealing with holographic scenes comprising multiple objects with complex background structures, the CMP module demonstrates superior capability in extracting critical feature representations, including detailed texture patterns and precise spatial relationships. By augmenting the network’s capacity for complex feature representation, the CMP module significantly enhances the extraction of intricate holographic features embedded in the input image, improving the fidelity of holographic reconstruction.

The CHMF module, serving as the core component for feature modulation and optimization, integrates the functions of the HCLNA module and the HCPLF module. Specifically, the HCLNA module plays an essential role in extracting local and non-local feature representations from holographic data. As shown in Fig. 3(a), the HCLNA module operates through the following process: the input features are initially decomposed into two distinct components. For one component, the module employs adaptive max-pooling with an $8 \times$ downsampling factor to extract low-frequency features. This adaptive pooling mechanism dynamically selects representative low-frequency information according to the feature distribution characteristics, ensuring optimal feature representation. Compared with traditional pooling methods, this method demonstrates superior capability in preserving features essential for downstream processing. During low-frequency feature extraction, the HCLNA module simultaneously computes the corresponding variance $x_{v}$ , as mathematically described by the following equation: $x_{v} = \frac{1}{N} \sum_{i = 0}^{N} {(x_{n} - μ)}^{2},$ (9)where $N$ represents the total number of elements in the feature map, indicating its spatial dimensions, while $x_{n}$ denotes the $n$ th element, uniquely identifying each spatial location. The mean value $μ$ represents the average level of the elements in the feature map. The variance $x_{v}$ quantifies the feature dispersion, providing essential statistical characteristics for subsequent feature analysis and processing. After low-frequency feature extraction, the HCLNA module applies interpolation to restore the spatial dimensions for subsequent integration with the original features. A $1 \times 1$ CConv operation is then performed to facilitate feature fusion and refinement. This convolutional operation enables channel-wise information integration while preserving spatial dimensions, thereby enhancing feature representation capabilities. Subsequently, a normalization operation (NO) is applied to map feature values to a predefined range, enhancing model training efficiency and stability. During feature integration, the low-frequency components are combined with the original features and their corresponding variances through an optimized fusion algorithm, generating enhanced feature representations. This fusion strategy effectively leverages the complementary advantages of low-frequency and original features, significantly improving feature discriminability. The module concludes by merging the two processed feature streams, transforming them into a unified, highly representative feature space that constitutes the final output. The HCLNA module’s processing pipeline comprises eight sequential operations: FS (feature splitting) for feature decomposition, AMP (adaptive max-pooling) for low-frequency feature extraction, VC (variance computation) for variance calculation, Interp (interpolation) for spatial dimension restoration, $1 \times 1$ CConv for feature transformation, NO (normalization operation) for feature normalization, ME (modulation enhancement) for feature enhancement, and FF (feature fusion) for multi-stream integration. This modular architecture systematically captures multi-scale feature representations through its process-oriented design, enabling comprehensive processing of holographic spatial information. The module’s hierarchical processing capability provides robust support for subsequent hologram generation and optimization tasks.

Figure 3.(a) Architecture of the HCLNA module for local and non-local feature integration. (b) Architecture of the HCPLF module for complex feature enhancement.

Download full size

View all figures

The HCPLF module serves as a critical component for advanced feature refinement, working in synergy with HCLNA to enable comprehensive processing and enhancement of holographic feature representations. As shown in Fig. 3(b), the HCPLF module processes input features through the following workflow: initially, a $1 \times 1$ complex convolution operation expands the feature dimensionality, enhancing information capacity. Consider an input feature tensor $X \in C^{C_{i n} \times H \times W}$ undergoing $1 \times 1$ complex-valued convolution to produce output $Y \in C^{C_{i n} \times H \times W}$ . For complex-valued operations, the convolution is performed separately on the real and imaginary components. Let $K_{r} \in C^{1 \times 1 \times C_{in} \times C_{out}}$ and $K_{i} \in C^{1 \times 1 \times C_{in} \times C_{out}}$ denote the learnable convolution kernels for the real and imaginary parts, respectively. The final output $Y_{O}$ of the $1 \times 1$ complex convolution can be mathematically expressed as $Y_{O} = \sum_{k = 1}^{C_{in}} (K_{r} * X_{r}) + i \cdot \sum_{k = 1}^{C_{in}} (K_{i} * X_{i}),$ (10)where $X_{r}$ and $X_{i}$ represent the real and imaginary components of the input tensor $X$ , respectively, and $*$ denotes the convolution operation. The features then undergo FS, splitting into two parallel processing branches. One branch employs a $3 \times 3$ complex convolutional layer for local context encoding, which extracts detailed spatial correlations and local patterns through localized receptive fields, analogous to microscopic examination of holographic features. The branch features are then processed through the CRelu activation function, which introduces nonlinear transformations to enhance feature representation capacity, enabling the model to effectively capture complex feature patterns. The other branch undergoes normalization to transform the feature distribution into an optimal range, thereby accelerating model convergence and enhancing training stability. Following normalization, selective feature enhancement and information filtering are applied to emphasize salient features while suppressing redundant information, producing refined feature representations for subsequent integration. After parallel processing through distinct pathways, the two feature streams converge at the FF stage. Let $F_{1} \in C^{C \times H \times W}$ and $F_{2} \in C^{C \times H \times W}$ denote the feature tensors generated from distinct processing branches. To effectively integrate these features while preserving their relative importance, we propose a weighted fusion mechanism. Introducing learnable parameters $α \in [0, 1]$ and $(1 - α)$ to represent the adaptive weighting coefficients, the fused feature $F_{fusion} \in C^{C \times H \times W}$ can be expressed through the following formulation: $F_{fusion} = α F_{1} + (1 - α) F_{2} .$ (11)

Here, the weighting coefficient $α$ is optimized through backpropagation using the Adam optimizer, with its value dynamically adjusted during training to maximize the fusion effectiveness based on experimental validation. This adaptive fusion mechanism enables the model to automatically determine the optimal contribution ratio between the two feature branches, effectively leveraging their complementary characteristics to enhance overall feature representation. The combined features then undergo a final $1 \times 1$ complex convolution for feature refinement, producing enhanced feature representations that constitute the module’s output. The hierarchical and multi-path architecture of the HCPLF module enables it to process features at different abstraction levels, effectively integrating multi-scale feature representations. The design significantly enhances overall feature discriminability and provides robust support for subsequent hologram generation and optimization tasks.

The CHMF module establishes a hierarchical feature processing pipeline through sequential operations. Initially, input features undergo normalization preprocessing before HCLNA module processing. The HCLNA output merges with the original input, and the combined data is subsequently normalized before HCPLF module processing. The final feature representation is produced by strategic aggregation of both module outputs. This architecture enables comprehensive multi-scale, multi-dimensional feature processing by effectively integrating the complementary processing capabilities of both modules. The CHMF module exhibits superior performance in processing complex holographic scenes through precise feature representation fusion. This integration capability significantly enhances the network’s capacity for generating high-quality holograms. Through optimized sequential processing, the network synthesizes high-fidelity holographic reconstructions at the output stage.

3. Simulation and Experiments

The proposed model was implemented using Python 3.10 integrated with PyTorch 2.1.1, and trained on the DIV2K super-resolution dataset^[35], containing 800 images for training and 100 images for validation. The network optimization employed the Adam optimizer with an initial learning rate of $1 \times 10^{- 3}$ . The training procedure is conducted over a total of 30 epochs, with empirical observations indicating that the model achieves convergence and stabilizes around the 25th epoch. The SLM is configured with a pixel pitch of $4.5 μm \times 4.5 μm$ , and the system operates at a laser wavelength of 532 nm. The optical path distance between the SLM plane and the image plane is maintained at a fixed distance of 200 mm. All simulations were performed on a high-performance computing system equipped with an Intel(R) Xeon(R) Platinum 8160T CPU (2.10 GHz) and an NVIDIA Tesla V100 GPU with 16 GB memory.

3.1. Simulation results

Figure 4 presents a comparative analysis of numerical reconstruction results from Holo-encoder, HoloNet, CCNN, and CHMFNet. The peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM) were adopted as the evaluation metrics. The images reconstructed by the CHMFNet model demonstrate exceptional clarity and precise detail restoration. When reconstructing complex textures, CHMFNet accurately maintains edge sharpness and detail fidelity, closely approximating the ground truth reference. In comparison, Holo-encoder reconstructions exhibit significant artifacts and blurring, particularly evident in edge ghosting and detail loss. While HoloNet shows improved reconstruction quality, it exhibits noticeable distortions when processing images with complex backgrounds or intricate details. Although CCNN outperforms both Holo-encoder and HoloNet, it still falls short in terms of detail restoration and image clarity when compared to the proposed CHMFNet model.

Figure 4.Comparative analysis of numerical reconstruction results from Holo-encoder, HoloNet, CCNN, and CHMFNet.

Download full size

View all figures

Next, the model’s performance was quantitatively evaluated using PSNR and SSIM metrics. As shown in Fig. 5(a), 100 images from the DIV2K validation dataset were utilized as test samples, and average SSIM and PSNR values were calculated for all four algorithms. The proposed CHMFNet achieves superior performance, attaining 34.11 dB PSNR and 0.95 SSIM. In contrast, Holo-encoder demonstrates significantly lower performance with 23.51 dB PSNR and 0.75 SSIM, while HoloNet achieves moderate results of 30.40 dB PSNR and 0.91 SSIM. Although CCNN shows competitive performance with 32.42 dB PSNR and 0.93 SSIM, it remains inferior to CHMFNet. These results clearly demonstrate the significant advantages of the CHMFNet model in terms of image quality evaluation metrics.

Figure 5.(a) Average PSNR and SSIM values of all four algorithms. (b) Average MSE of the validation dataset during training.

Download full size

View all figures

Furthermore, the average MSE of the validation dataset during training was analyzed, as illustrated in Fig. 5(b). The results demonstrate that the CHMFNet model exhibits faster convergence and lower average MSE compared to other models. This indicates that CHMFNet reaches optimal performance more rapidly during training, with smaller deviations between its predictions and the ground truth. These findings further validate CHMFNet’s efficiency and accuracy in generating high-quality holograms, highlighting its potential for real-time holographic display applications.

To evaluate model generalization performance, we conducted experiments using the Flickr2K dataset^[36]. A randomly selected subset of 500 images served as test samples, on which we assessed four pre-trained models: Holo-encoder, HoloNet, CCNN, and our proposed CHMFNet. Reconstruction quality was quantitatively evaluated by computing PSNR and SSIM metrics between reconstructed and original images. The quantitative results presented in Table 1 demonstrate CHMFNet’s superior performance on the Flickr2K dataset, achieving state-of-the-art metrics with an average PSNR (AvgPSNR) of 34.37 dB and an average SSIM (AvgSSIM) of 0.96. Comparative analysis reveals significant performance gaps: while Holo-encoder (24.59 dB/0.79) exhibits fundamental limitations in noise suppression and detail preservation, and HoloNet (30.73 dB/0.92) shows improved yet still insufficient reconstruction quality, even the competitive CCNN (32.78 dB/0.94) underperforms CHMFNet in critical aspects of detail restoration and structural similarity. The experimental results demonstrate CHMFNet’s robust cross-dataset generalization capability, consistently producing high-fidelity holograms with enhanced reconstruction quality and superior detail preservation. These findings substantiate the model’s effectiveness and performance advantages in complex real-world applications across diverse data distributions.

Table 1. Average PSNR and SSIM Values of the Four Algorithms on the Flickr2K Dataset

View table
View all Tables
Table 1. Average PSNR and SSIM Values of the Four Algorithms on the Flickr2K Dataset

Holo-encoder HoloNet CCNN CHMFNet
AvgPSNR 24.59 30.73 32.78 34.37
AvgSSIM 0.79 0.92 0.94 0.96

For a comprehensive performance comparison, Fig. 6 provides quantitative analyses of computational efficiency across multiple models. Figure 6(a) details the architectural parameters of each model, while Fig. 6(b) presents the average computation time for 100 CGH generation cycles when trained on the DIV2K dataset. The results demonstrate that HoloNet exhibits the highest computational complexity, with both parameter count and processing time significantly exceeding other models. Although Holo-encoder demonstrates reduced parameter counts compared to HoloNet, its computational requirements remain significantly higher than contemporary approaches. Meanwhile, Holo-encoder demonstrates excellent rapid-processing capabilities in terms of generation time. CHMFNet’s model parameters are only one-sixth those of Holo-encoder. Compared to CCNN, CHMFNet improves hologram reconstruction quality while maintaining comparable processing speed. To thoroughly assess the practical applicability of our model, we conducted additional evaluations using the Flickr2K dataset comprising 2,650 diverse images. Figure 6(c) presents the average computation time required for 100 CGH calculations across all four models trained on this dataset. Comparative analysis reveals that the computational performance remains consistent with the DIV2K results shown in Fig. 6(b), demonstrating no statistically significant differences in processing times. Figure 6(d) shows the computational throughput analysis of CHMFNet across varying dataset scales, evaluating performance on randomly sampled subsets of 100, 200, 300, and 400 images from the Flickr2K benchmark. The results demonstrate consistent temporal performance, with marginal variation in average generation time across different dataset sizes. This stability in computational efficiency robustly validates the architecture’s scalability and predictable performance characteristics, independent of input data volume.

Figure 6.(a) Detailed architectural parameters of each evaluated model. (b) Average computation time per 100 CGH generation cycles using the DIV2K training dataset. (c) Comparative computation time metrics for 100 CGH generation cycles with the Flickr2K dataset training. (d) Scalability analysis showing CHMFNet’s computation time across varying dataset sizes (N = 100–400).

Download full size

View all figures

To further evaluate the performance improvement of the proposed CHMFNet compared to traditional CNN architectures, Table 2 presents an ablation study. The average PSNR and SSIM values were calculated using a subset of 100 images from the DIV2K validation dataset. The results in Table 2 demonstrate that CHMFNet achieves significant improvements in both PSNR and SSIM metrics compared to traditional real-valued CNN (RCNN), real-valued HMFNet (RHMFNet), and complex-valued CNN (CCNN). RCNN achieves an average PSNR of 23.51 dB and an average SSIM of 0.75, while RHMFNet attains an average PSNR of 28.24 dB and an average SSIM of 0.82, indicating improved signal fidelity and structural similarity. However, despite its effectiveness in reducing speckle noise, RHMFNet fails to achieve the desired reconstruction quality. Compared to the CCNN’s 32.42 dB PSNR and 0.93 SSIM, CHMFNet demonstrates superior performance with 34.11 dB PSNR and 0.95 SSIM. These results highlight CHMFNet’s enhanced capability in detail preservation and structural integrity maintenance, demonstrating its effectiveness in high-quality hologram generation. By combining complex-valued convolution and hierarchical multi-fusion blocks, CHMFNet shows significant advantages in hologram reconstruction quality over existing models.

Table 2. Ablation Studies on Various CNNs

View table
View all Tables
Table 2. Ablation Studies on Various CNNs

RCNN RHMFNet CCNN CHMFNet
AvgPSNR (dB) 23.51 28.24 32.42 34.11
AvgSSIM 0.75 0.82 0.93 0.95

To systematically investigate the contribution of each core module to the model performance, we performed ablation studies by sequentially removing the CMP module, HCLNA module, and HCPLF module, as shown in Table 3. All architectural variants were compared to the complete CHMFNet model using average PSNR and SSIM metrics. For experimental validation, we selected 100 representative images from the DIV2K validation dataset as test samples. Each model variant (including the complete architecture) was evaluated under identical conditions, with the average PSNR and SSIM values computed across all reconstructed outputs to ensure statistically reliable comparisons. Table 3 demonstrates that the complete CHMFNet architecture achieves optimal performance, with average PSNR (34.11 dB) and SSIM (0.95) values significantly outperforming ablated variants. Specifically, the removal of the CMP module reduces performance to 31.90 dB PSNR and 0.92 SSIM, confirming its critical role in hierarchical feature representation. This performance degradation suggests the CMP module is essential for capturing fine holographic features, with its absence directly compromising reconstruction fidelity. The HCLNA-ablated model exhibits reduced performance (PSNR: 32.85 dB, SSIM: 0.94), demonstrating this module’s critical role in local-non-local feature integration. Its absence impairs spatial relationship modeling and multi-scale feature utilization, degrading reconstruction quality. The HCPLF-ablated model has an average PSNR of 31.41 dB and an average SSIM of 0.89, revealing this module’s essential role in progressive feature refinement. The absence degrades feature optimization capability, limiting complex feature enhancement and integration, which ultimately reduces reconstructed image detail fidelity compared to the complete model. The ablation studies quantitatively validate the critical contributions of each module (CMP, HCLNA, and HCPLF) in CHMFNet. These components synergistically enhance hologram generation quality, with each module demonstrating significant, non-redundant impacts on reconstruction performance. The systematic performance degradation observed in all ablated variants confirms both the architectural coherence of CHMFNet and the essential role of each module, providing empirical foundations for future model optimization.

Table 3. Ablation Studies on Various Modules

View table
View all Tables
Table 3. Ablation Studies on Various Modules

w/o CMP w/o HCLNA w/o HCPLF CHMFNet
AvgPSNR (dB) 31.90 32.85 31.41 34.11
AvgSSIM 0.92 0.94 0.89 0.95

3.2. Experiments and results

The experimental setup, illustrated in Fig. 7, was implemented to evaluate the performance of the proposed model. A 532 nm green laser served as the illuminating light source, with the beam first passing through an expander system and being collimated by a lens with an 80 mm focal length. The collimated beam was then directed onto a reflective phase-only SLM (CAS Microstar FSLM-2K39-P02) featuring a 4.5 µm pixel pitch and $1920 \times 1080$ resolution. Prior to loading onto the SLM, the hologram underwent preprocessing through zero-padding to adapt $1920 \times 1072$ resolution data to the SLM’s native format. A beam splitter (BS) was employed to redirect the modulated light through a 4f optical system, incorporating two Fourier lenses with 250 mm focal lengths and a spatial filter to attenuate both zero-order and high-order diffraction beams generated by the SLM. To ensure the SLM operated exclusively in the phase-only modulation mode, a polarizer was strategically positioned to optimize the polarization state of the incident light. The reconstructed holographic images were captured using a high-resolution digital camera (Canon EOS 90D), which underwent rigorous calibration procedures to ensure precise and reproducible image acquisition. The entire experimental setup was mounted on a vibration-isolated optical platform to minimize mechanical vibrations and external disturbances, ensuring the integrity of the experimental results.

Figure 7.(a) Schematic representation of the experimental setup. (b) Photograph of the implemented optical configuration.

Download full size

View all figures

Figure 8 shows the optical reconstruction results of CGH rendered by different models, including Holo-encoder, HoloNet, CCNN, and CHMFNet. It is imperative to note that environmental factors such as ambient light, camera instability, and the non-ideal fill factor of the SLM may introduce variability into the experimental results. The captured images exhibit a high degree of consistency with the simulation results, validating the reliability of the experimental setup. The images reconstructed by the Holo-encoder model exhibit noticeable artifacts and blurred regions, significantly degrading the overall image quality. Although the HoloNet model demonstrates improved detail preservation relative to the Holo-encoder, it still suffers from distortions and residual noise in the reconstructed images. These limitations hinder the accurate reproduction of fine details, resulting in a lack of realism that falls short of the requirements for high-precision holographic displays. The CCNN model demonstrates moderate reconstruction quality, effectively capturing the general structure of the scene but failing to achieve optimal performance when compared to the CHMFNet. Specifically, the textures and fine details in the images reconstructed by CCNN are significantly less defined and sharper than those produced by the CHMFNet, resulting in an overall less impressive visual effect. In contrast, the CHMFNet model demonstrates exceptional performance in generating high-quality holographic reconstructions. The reconstructed images exhibit sharp details, minimal artifacts, and low noise levels. The edges of objects are well-defined. In summary, the experimental results provide conclusive evidence of CHMFNet’s exceptional reconstruction capabilities, establishing its potential for application in holographic display.

Figure 8.Optical reconstruction results of CGH rendered by different models. (a) The CHMFNet method, (b) the CCNN method, (c) the HoloNet method, and (d) the Holo-encoder method.

Download full size

View all figures

4. Discussion and Conclusion

This study presents a novel CHMFNet architecture for high-quality hologram generation. The framework, built upon a U-Net backbone, integrates a CMP module that enhances complex feature representation through optimized convolutional operations and advanced activation functions. To further improve the reconstruction quality, the architecture incorporates an innovative CHMF block that enables multi-scale hierarchical processing and advanced feature fusion. The CHMF combines HCLNA and HCPLF functionalities, where the HCLNA simultaneously processes local and non-local features within a complex-valued domain, representing a substantial improvement in capturing intricate spatial relationships inherent in holograms. Complementing this, the HCPLF implements progressive feature refinement, substantially enhancing hologram generation capability. The integration of complex-valued convolution and specialized CHMF design enables more effective optical information processing, generating holograms with reduced artifacts and enhanced fidelity. The computational results demonstrate that the proposed method achieves superior reconstruction quality, with average PSNR and SSIM values of 34.11 dB and 0.95, respectively, significantly outperforming conventional approaches. Both simulation and experimental results conclusively demonstrate that the proposed CHMFNet model generates holograms with more precise details and superior image quality compared to traditional models. The capabilities demonstrated by our proposed method highlight its strong potential for applications in holographic display systems.

Category: Diffraction, Gratings, and Holography

Received: Mar. 26, 2025

Accepted: May. 12, 2025

Posted: May. 14, 2025

Published Online: Aug. 21, 2025

The Author Email: Wenqiang Wan (wqwan_ecjtu@163.com)

DOI:10.3788/COL202523.090501

CSTR:32184.14.COL202523.090501

Table 1. Average PSNR and SSIM Values of the Four Algorithms on the Flickr2K Dataset

Table 1. Average PSNR and SSIM Values of the Four Algorithms on the Flickr2K Dataset

Table 2. Ablation Studies on Various CNNs

Table 2. Ablation Studies on Various CNNs

Table 3. Ablation Studies on Various Modules

Table 3. Ablation Studies on Various Modules