Feature fusion and variational autoencoder based deep coded aperture design for a CUP-VISAR diagnostic system

Miao Li; Chenyan Wang; Xi Wang; Lingqiang Zhang; Chaorui Chen; Zhaohui Guo; Xueyin Zhao

doi:10.3788/COL202523.041101

1. Introduction

It was reported that over 80 diagnostics had been installed at the 100-kJ-level ShenGuang III laser facility. Shock speeds and merger time can be measured using the VISAR diagnostic. This instrument measures the propagation speed of the leading shock for continuous time and shows the shock mergers as velocity jumps when subsequent shocks overtake the preceding one^[1,2]. Traditional VISAR diagnostic systems lack the capacity to achieve both high temporal resolution and two-dimensional (2D) information acquisition. Yang et al.^[3] developed the compressed ultrafast photography for a 2D velocity interferometer system for any reflector (CUP-VISAR)^[4] to surmount this limitation by utilizing spatial domain coding, data compression, and image inversion techniques to capture high-speed transient phenomena. This advancement facilitates the calculation of the inertial confinement fusion (ICF) velocity field, prediction of compression dynamics, and practical experimentation. Figure 1 illustrates the configuration of the CUP-VISAR system. Initially, the probe light emitted by the probe light source passes through the beam splitter BS1 and lens L1, reaching the target surface for reflection. The probe light, carrying Doppler frequency shift information, retraces its original path and enters the interferometer via reflection from BS2, generating interference fringes. These fringes are then directed through the BS7 beam splicing mirror into the CUP system, where they are encoded by the digital micro-mirror device (DMD). Subsequently, the encoded fringes are captured and superimposed by the streak camera, which simultaneously encodes, compresses, and records the 2D spatial information of the shock wave. Observing the ICF 2D shock wave velocity field presents several challenges: (1) High temporal and spatial resolutions for transient processes increase aliasing in streak camera data, reducing the signal sampling rate. (2) Enhancing the signal-to-noise ratio (SNR) requires considerations of light source brightness, detector sensitivity, and optical system resolution, necessitating a larger coding aperture relative to detector pixels. Specifically, DMD projection employs a $7 pixel \times 7 pixel$ ratio instead of $1 pixel \times 1 pixel$ , causing multiple pixels within the coding aperture to share the same encoding, leading to aliasing. This aliasing complicates the extraction of detailed features from the velocity fringe distribution, resulting in significant errors in calculating shock wave velocity.

Figure 1.Basic structure of the CUP-VISAR system for measuring shock wave velocity.

Download full size

View all figures

Creating a reliable linear observation matrix is essential for the CUP system to guarantee image inversion and efficient data compression. Yang et al.^[5,6] proposed an optimization method using a genetic algorithm, capturing dynamic scene information through random coding and refining it to enhance image inversion quality. This method, however, incurs significant computational and temporal costs for large-scale data processing. Conversely, deep learning methods optimize coding matrices by learning intrinsic structures and features from extensive datasets. Iliadis et al.^[7,8] proposed the “deep binary mask” architecture, where the encoder learns binary weights to create a coding matrix, enhancing the average peak signal-to-noise ratio (PSNR) by approximately 1 dB compared to the random mask. Marquez et al.^[9] developed the deep high-dimensional adaptive network (D-HAN) method, generating a coding matrix for training data while reducing its 2D cross-correlation.

In this Letter, we deal with the obstacle of accurate feature extraction hindered by encoding aperture aliasing in the CUP-VISAR system when analyzing multiple sampled frames. A deep learning-based coding aperture design framework is introduced for a 2D shock wave velocity field diagnosis system. Employing a convolutional variational auto-encoder (CVAE) network, the framework captures velocity fringe features and integrates frequency domain features, thereby enhancing signal compression sampling efficiency, mitigating aliasing from a large coding aperture, and improving reconstruction accuracy. Various reconstruction methods were employed to assess the recovery effect. The significance of adjusting code transmittance to enhance reconstruction accuracy is highlighted, particularly in the presence of experimental noise.

2. Coding Aperture Design Based on the Convolutional Variation Auto-Encoder Network

The CVAE network maps fringe image inputs to a low-dimensional latent variable space and achieves feature fusion by mixing frequency domain features. Figure 2 illustrates the comprehensive framework of the model. Mean and variance are utilized to generate the binary mask of the desired dimension, which is then expanded to match the input dimension. During sampling, the coding matrix is multiplied by the original image data and overlaid to produce the sampled measurement, which serves as the input for the reconstruction method. To calculate the loss function, we need to find the average absolute error between the input data and reconstructed output, the binary cross-entropy loss between the input data and decoder output, and the Kullback–Leibler (KL) divergence loss between the prior distribution and the latent variable distribution. Model parameters are updated using backpropagation, resulting in the generation of a binary coding matrix that captures the properties of input data.

Figure 2.Framework for coding matrix design utilizing a CVAE network.

Download full size

View all figures

Figure 3.Network architecture diagram for encoding and decoding.

Download full size

View all figures

Variational auto-encoder network: The variational auto-encoder (VAE) network utilizes latent variables and their distributions to represent the original streak image data and their prior distributions $p (x)$ . It is assumed that $q (x)$ approximates $p (x)$ , denoted as $q (x, z)$ approximating $p (x, z)$ . The disparity between the latent variable distributions and the prior distributions is measured using the KL divergence, $KL [p (x, z) ∥ q (x, z)] = E_{x \sim p (x)} \ln p (x) + E_{x \sim p (x)} \int p (z | x) \ln \frac{p (z | x)}{q (x, z)},$ (1)where $p (x)$ is the prior distribution around $x$ , i.e., $E_{x \sim p (x)} \ln p (x)$ is a constant. If the latent variable distribution is closer to the prior distribution, the smaller the KL dispersion, the smaller the loss function, and the loss function $ℓ$ is defined as $ℓ = - E_{x \sim p (x)} {E_{z \sim p (z | x)} [- \ln q (x | z)]} + E_{x \sim p (x)} {KL [p (z | x) ∥ q (z)]} .$ (2)

To ensure that the loss function $ℓ$ is minimized and $KL [p (z | x) ∥ q (z)] \neq 0$ , i.e., the model learns the features of the original data distribution, two terms in the loss are pitted against each other during training.

If $z$ is distributed according to the multivariate standard normal distribution $z \sim N (0, I)$ , where $I$ represents the unit matrix and $p (z | x)$ follows a multivariate normal distribution $p (z | x) \sim N (μ, σ^{2})$ with independent components, then Eq. (2) can be expressed as per the given distribution description: $ℓ = \frac{1}{2} \sum_{k = 1}^{d} [- \ln σ_{k}^{2} (x) + σ_{k}^{2} (x) + μ_{k}^{2} (x) - 1] - \frac{1}{n} \sum_{i = 1}^{n} \ln q (x | z_{i}), z_{i} \sim p (z | x) .$ (3)

Equation (3) comprises two components of the loss function: The first component is the KL divergence between the latent variable distribution and the prior distribution, and the second component is the binary cross-entropy loss between the decoder’s output and the original image data.

By utilizing the reparameterization technique, the issue of non-differentiability in random sampling can be addressed. This involves transforming the sampling of $z$ from $p (z | x) \sim N (μ, σ^{2})$ into sampling $ε$ from $z \sim N (0, I)$ so that $z = μ + σ \times ε$ . In other words, the sampling outcome is achieved by parameter transformation. Breaking down the sampling process into deterministic mapping and random noise components enables the sampling outcomes to be included in gradient calculations and optimization, leading to model training and the production of high-quality samples.

Network Architecture: The convolutional neural network is utilized to extract global and local properties, along with their spatio–temporal correlations, from shockwave diagnostic streak data images. The network, as shown in Fig. 3, adeptly captures a structured representation of the data by sharing parameters across the temporal and spatial dimensions. The feature map is denoted as $h_{k}$ , where $k$ represents the $k$ th element of the input data $x$ . The encoder executes convolution as specified below: $h_{k} (x_{k}) = κ_{i} [n_{i} (W_{i} x_{k} + b_{i})],$ (4)where $concat (\cdot, \cdot)$ denotes the concatenation operation along the channel dimension. This means that the features $h_{k_DF}$ and $h_{k_SF}$ are combined into a single tensor by appending the channels of $h_{k_SF}$ to those of $h_{k_SF}$ , resulting in an enriched feature representation $h$ that integrates information from both the fringe and spectrum features.

The encoder extracts the distribution features $h_{k_DF}$ from the original fringe image. The shock wave image undergoes Fourier transform to preprocess the amplitude spectrum, which involves logarithmic processing and frequency domain centralization, yielding spectrum features $h_{k_SF}$ . These features are then stitched along the channel dimension, leveraging complementary information to generate the fusion feature $h$ : $h = concat (h_{k_DF}, h_{k_SF}),$ (5)where $concat (\cdot, \cdot)$ denotes the concatenation operation along the channel dimension. This means that the features $h_{k_DF}$ and $h_{k_SF}$ are combined into a single tensor by appending the channels of $h_{k_SF}$ to those of $h_{k_SF}$ , resulting in an enriched feature representation $h$ that integrates information from both the fringe and spectrum features.

The fusion features undergo a linear mapping resulting in the calculation of mean and variance: $μ_{k} = W_{i} \times h + b_{i}, \log σ_{k}^{2} = W_{i} \times h + b_{i},$ (6)where $b_{i}$ is the linear mapping bias and $W_{i}$ is the linear mapping weight.

By training the model, the distribution features of the original fringe data were obtained and determined by the mean and variance. Based on these distribution features, a basic binary mask was generated with dimensions matching the defined potential variable $(207 / 7) \times (256 / 7)$ . For the base mask, the expansion coefficient is 7 (coding aperture size) for both row and column. The coding matrix is shifted one unit along the column direction per frame, generating a three-dimensional coding matrix composed of multi-frame images for reconstruction.

Utilizing backpropagation to calculate the gradient and update the model parameters minimizes the loss function. Introducing the average absolute error between the reconstruction result and the original image data as a term in the loss function, Eq. (3) can be rewritten as $ℓ = \frac{1}{2} \sum_{k = 1}^{d} [- \ln σ_{k}^{2} (x) + σ_{k}^{2} (x) + μ_{k}^{2} (x) - 1] - \frac{1}{n} \sum_{i = 1}^{n} \ln q (x | z_{i}) \frac{1}{n} \sum_{i = 1}^{n} | {\hat{x}}_{k} - x_{k} |, z_{i} \sim p (z | x),$ (7)where $μ_{k} (x)$ and $σ_{k}^{2} (x)$ are the mean and variance of the $k$ th latent variable from the variational distribution $q (x, z)$ . ${\hat{x}}_{k}$ is the decoder’s predicted value for the $k$ th pixel, $x_{k}$ is the actual value in the original image, $z_{i} \sim p (z | x)$ denotes sampling from the latent distribution, $q (x | z_{i})$ is the decoder’s probability distribution for the reconstructed image, and $n$ is the number of samples in the batch.

Model Training: To train the model, 8320 simulated shock wave diagnosis fringe images, each with dimensions of $256 \times 256 \times 50$ , were used as the training set. These fringe images encompass the various periods, shapes, moving directions, and speeds included in the experiment. During the training process, the images are divided into specified batch sizes and input into the CVAE network to capture the fusion features and generate the coding matrix. PyTorch is utilized as the experimental platform and optimized using the Adam optimizer.

The training procedure is segmented into four steps:

1.Initialization of the optimizer and gradient zeroing.
2.Forward propagation, which involves encoding, feature fusion, and coding matrix generation using mean and variance.
3.Backpropagation, which is performed to calculate the gradient regarding layer activation. Binary cross-entropy loss and KL divergence were calculated, and the total loss was determined using the reconstruction results obtained from the coding matrix generated in Step 2.
4.A model parameter update, leveraging the optimizer to update parameters based on gradients.

The training algorithm comprises two steps: initially focusing on coding matrix generation, followed by reconstruction using this matrix. A subsequent training phase aims to enhance the reconstruction quality by adjusting the coding matrix.

3. Simulation and Experimental Findings

3.1. Simulation verification and analysis

A coding matrix with a coding aperture of $7 pixel \times 7 pixel$ is generated by the feature fusion CVAE network. The large coding aperture coding matrices generated by the same training round model before and after feature fusion are compared with the random coding matrix, as shown in Figs. 4(a)–4(c). During the simulation sampling process, the coding matrix is Hadamard multiplied with the shock wave fringe image data frame by frame. The measurements and three masks are then simulated and reconstructed with the ADMM-TV^[10], TVAL3^[11], E3DTV^[12], and ADMM-UNet^[13] algorithms. Figure 5 presents the reconstructed results of the shock diagnosis fringe images, utilizing three large-aperture $7 pixel \times 7 pixel$ encoded matrices. The diagnosis fringe distribution has a sample data dimension of $256 \times 256$ and comprises 50 frames.

Figure 4.Mask. (a) Random mask; (b) 300 epoch feature-free fusion mask; (c) 300 epoch feature fusion mask.

Download full size

View all figures

Figure 5.Velocity fringe reconstruction of shock wave diagnosis by four algorithms with different masks. (a1) Random mask; (a2) measurement with the random mask; (a3)–(a6) 25th frame reconstruction of the random mask; (b1) feature-free fusion mask; (b2) measurement with the feature-free fusion mask; (a3)–(a6) 25th frame reconstruction of the feature-free fusion mask; (c1) feature fusion mask; (c2) measurement with feature fusion mask; (c3)–(c6) 25th frame reconstruction of the feature fusion mask.

Download full size

View all figures

The simulation of fringe image (25th frame) reconstruction utilized three distinct coding matrices and four different algorithms as shown in Fig. 5. From a visual perspective, the images reconstructed by the four algorithms using the random coding matrix and the feature-free fusion coding matrix still exhibit small-scale blurring at the edges. Although the ADMM-TV and TVAL3 algorithms produce smoother edges, they result in the loss of fringe image details. In comparison, when the coding matrix generated after feature fusion is used for reconstruction, the results of the four algorithms all enhance the detailed representation of the fringe contours, producing a more complete reconstruction of the fringe image than the previous two coding matrices. Subsequently, the performance of the reconstructed 25th frame is analyzed in detail. The PSNR values for the 25th frame fringes reconstructed by ADMM-TV, TVAL3, E3DTV, and ADMM-UNet algorithms with a random coding matrix are 17.48, 18.57, 24.35, and 24.59 dB, respectively. The structural similarity (SSIM) values are 62.85%, 70.73%, 78.42%, and 87.04%, respectively. The PSNR values for the 25th frame fringes reconstructed by ADMM-TV, TVAL3, E3DTV, and ADMM-UNet algorithms with a random coding matrix are 17.48, 18.57, 24.35, and 24.59 dB, respectively. The SSIM values are 62.85%, 70.73%, 78.42%, and 87.04%, respectively. For the generated feature-free fusion coding matrix restoration fringe (25th frame), the PSNR values increased by 0.35, 0.76, 1.48, and 1.43 dB compared to the random coding matrix with the algorithms mentioned above, while the SSIM values improved by 10.88%, 5.87%, 10.10%, and 2.08%. The PSNR values of 25th frame fringes recovered by the feature fusion coding matrix improved by 1.11, 1.99, 2.24, and 2.94 dB, with corresponding SSIM enhancements of 11.66%, 8.71%, 10.79%, and 2.81%. The coding matrix generated by our network has enhanced the reconstruction performance of the algorithms, surpassing that of the random coding matrix.

3.2. Experimental verification and analysis

Accounting for ambient noise, the dynamic response range constraint of the streak camera, and the presence of speckle noise in the constructed CUP-VISAR system, we completed a static verification experiment for the designed encoding mask. The fringe pattern was directly projected onto the corresponding mask in the optical system. These fringes were encoded by the coding plate, then passed through the 4f system, amplified by the lenses, and directed through mirrors to the streak camera. The streak camera recorded the compressed and encoded fringe data using a fully open slit (4 mm) without lens bias. Enhancing the transmittance of the coding matrix can improve the reception and transmission of previous weak light signals, augment the distinction between fringe data and background noise, and thereby enhance the SNR of the entire system. Excessive light transmittance may cause system saturation and diminish measurement accuracy. It is crucial to adjust light transmittance to maintain signal acquisition efficiency and overall system performance stability. Figure 6 outlines the specifications of the fabricated two group masks under the condition that the mask pixel lengths are 25.8 and 51.6 µm. The intensities of the light signal, which correspond to the light intensity passing through the fabricated coding matrices, were acquired and compared. The experimental results demonstrate that, under the two mask sizes, the average periodic voltage of the generated code increased by 12.56% and 49.03% compared to the random code, that is, the transmittance is increased by 12.56% and 49.03%.

Figure 6.Experimental mask plate specification drawing. The coding-aperture ratio of the mask is 7:3. (a) Feature fusion mask; (b) random mask.

Download full size

View all figures

The effect of the transmittance of the coding matrix on the quality of reconstruction is further verified through experiments. Various coding aperture ratios (A:B, where A indicates transmittance aperture numbers and B indicates opacity aperture numbers) were employed to generate the coding matrix. In the experiment, the diagnosis fringe pattern has a sample data dimension of $256 \times 256$ and comprises 35 frames. To verify the effect of the generated coding matrix on the reconstruction quality, the coded fringe data were recorded by the streak camera with different coding matrices. Measurements were reconstructed using four algorithms, and the results are presented in Fig. 7. The PSNR values and SSIM values of these reconstruction results peak when the coding matrix ratio is 7:3. Our coding matrix outperforms the random coding matrix in reconstruction performance, with the maximum PSNR and SSIM values for the ADMM-UNet algorithm increasing by 1.65 dB and 8.60%. Additionally, the 25th frame of the reconstruction findings demonstrates that the coding matrix provides more detailed fringe profiles compared to the random coding matrix [refer to Figs. 7(a1) and 7(b1)].

Figure 7.Velocity fringe reconstruction performance in various coding aperture ratios. (a1), (b1) PSNR performance comparison for different algorithms; (a2), (b2) SSIM performance comparison for different algorithms.

Download full size

View all figures

4. Conclusion

This letter presents a deep learning framework for the coding aperture design of the CUP-VISAR system in ICF research. Utilizing a CVAE network and frequency domain features, it generates a large aperture coding with fringe features, mitigating aliasing and improving detail extraction. Specifically, the ADMM-TV, TVAL3, E3DTV, and ADMM-UNet algorithms yield increased PSNR values of restored fringes (25th frame) by 1.11, 1.99, 2.24, and 2.94 dB compared to random coding matrices. Correspondingly, the SSIM values are augmented by 11.66%, 8.71%, 10.79%, and 2.81%. To account for noise in the experimental environments and the limited dynamic response range of the streak camera, a simulation-based 2D arbitrary reflector velocity interferometer imaging experiment is conducted. Results confirm that optimal average PSNR and SSIM peaks are attained when the coding aperture ratio is 7:3. Specifically, utilizing the ADMM-UNet algorithm with the generated coding matrix enhances the average PSNR peak by 2.56 dB and the average SSIM peak by 9.55% compared to regular 5:5 ratio matrix. In the comparative analysis, the average PSNR peak value of the coding matrix generated with this framework demonstrates a 1.65 dB increase compared to the random coding matrix. Similarly, the average SSIM peak value exhibits an 8.60% enhancement, while the transmission value in the large-size mask experiment rises by 49.03%. The proposed feature fusion and CVAE-based deep coding aperture design framework for CUP-VISAR real data shock wave diagnosis provides a theoretical foundation and practical utility for ICF research. In the ICF diagnostic process, the training set for the model should include as many features of experimental fringe data as possible. Prior to formal experiments, historical experimental datasets are input as extensively as possible. The purpose is to generate a more optimized encoding mask by learning the latent statistical distribution characteristics of real fringe data, thereby improving the modulation efficiency of the beam. This process ensures that the encoding matrix exhibits high adaptability and robustness in specific ICF experiments.

Category: Imaging Systems and Image Processing

Received: May. 21, 2024

Accepted: Oct. 9, 2024

Published Online: Apr. 11, 2025

The Author Email: Miao Li (limiao@cqupt.edu.cn), Xi Wang (xiwang@cqupt.edu.cn)

DOI:10.3788/COL202523.041101

CSTR:32184.14.COL202523.041101

微信扫一扫：分享