Learning from better simulation: creating highly realistic synthetic data for deep learning in scattering media

Bozhen Zhou; Zhitao Hao; Zhenbo Ren; Edmund Y. Lam; Jianshe Ma; Ping Su

doi:10.1117/1.APN.4.5.056007

1 Introduction

Optical imaging through scattering media,1 which includes elements such as turbid water,2 fog, and biological tissues,3 is vastly applicable across various fields. These range from high-speed color imaging4 and optical coherence tomography5 to quantitative phase imaging6 and biomedical imaging.7^,8 However, the process of getting a deeper, clearer view of the details is predominantly hindered by scattering, particularly the multiple scattering caused by the complexity and inhomogeneity of tissues.9 Recently, deep learning techniques have emerged as critical tools for addressing these complex imaging problems, and they have been successfully applied across various imaging disciplines.10 These include phase imaging,11^,12 auto-focusing,13 and super-resolution imaging.14^,15 There is also a lot of work devoted to the specific area of imaging through scattering media.16^–18

In the field of computational imaging,19 although deep learning has made significant strides in addressing imaging challenges, its dependency on large-scale training data presents a crucial issue: the acquisition of ground truth labels for model training is a cumbersome and time-consuming task. In certain domains, these labels can be obtained through established gold standards;20 however, this approach is not only inefficient but also becomes impractical when the volume of data scales into the millions. Furthermore, in some fields, attaining ground truth remains an elusive goal due to the absence of effective solutions to overcome specific technical challenges. For example, in particle field monitoring, acquiring ground truth for real, dynamic particle fields is always nearly impossible,21 especially at high particle concentrations where multiple scattering and defocusing significantly interfere with particle imaging. Many researchers22^–25 modified the raw reconstructed images or scanning microscope images via digital image processing methods, as “ground truth.” This approach, however, lacks stability and accuracy and often requires significant time to complete the processing. Moreover, although “ground truth” derived from low concentrations may be somewhat reliable, its credibility diminishes at higher concentrations. Thus, efficiently obtaining sufficient data for deep learning training within a brief timeframe, while ensuring credible ground truth, stands as a pivotal challenge in utilizing deep learning for imaging through scattering media.

A promising strategy entails the use of synthetic data for network training while guaranteeing that the network retains the ability to process experimental data effectively. Domain adaptation from simulation to reality26 is currently primarily accomplished through two methodologies: data manipulation and model manipulation.27 Some researchers have successfully generated realistic data efficiently in specific domains, such as the generation of ultrasonic multimodal total focusing method (M-TFM) images.28^,29 Particularly, in particle holography, however, due to the extreme complexity in the multiple-scattering process,30^,31 a highly realistic synthetic data can be quite difficult to generate. In the field of particle holography, current research focuses on model manipulation. Chen et al.21 introduced a simulation-training model Holo-Net, which incorporates the physical prior of the point spread function (PSF), enhancing the network’s stability in particle localization and tracking tasks. However, the exclusive reliance on PSF prior introduces certain limitations when addressing more complex particle fields. Therefore, their experimental validation involved a particle field of low concentration ( $\sim 50 particles / mL$ ) and relatively significant errors ( $\sim 1 mm$ in the axial direction). Tahir et al.32 developed an extensive synthesized network, comprising multiple deep neural networks, to capture the diversity in the data and a substantial training set to mitigate the risk of overfitting, ensuring adaptability across various scattering conditions. Both of them need a substantial amount of data for training an extensive model, which is quite demanding on the expensive computational cost.

Diverging from their approaches, we manipulate the data and introduce a method termed “learning from better simulation” (LBS). This method greatly reduces the required amount of data for a modified U-Net model by enhancing the synthetic data used in training to closely mirror real data characteristics. As a result, the model can effectively handle experimental data with high particle concentrations. As shown in Fig. 1(a), an experimental hologram of the particle field is captured and subsequently backpropagated to various distances to acquire raw reconstruction images. By feeding these images into the simulator-trained U-Net as the input, we can get the required 2D layer images as the output. After applying the output processing algorithm to the output, a 3D particle field is reconstructed. Figure 1(b) shows how we get better synthetic training images for the U-Net by utilizing experimental images. At step 1, we use an optimization algorithm to continuously match the reference experimental image from an experimental hologram with the reference synthetic image from a simulated hologram, aiming to minimize the differences and identify optimal parameters. At step 2, we take the raw simulated hologram as the input for the Holo-processing with the identified optimal parameters in step 1, which helps to convert the simulated hologram to better synthetic images. These better synthetic images are the input images for the training of the U-Net.

Figure 1.LBS method framework. (a) The reconstruction process for the experimental 3D particle field. (b) The training data preparation for the U-Net. $P$ is the parameter to be optimized, and $P$ * is the optimal parameter.

Download full size

View all figures

In summary, we propose a LBS method to generate better, highly realistic data by utilizing a single experimentally captured image sample and through an optimization method, without the help of any neural networks. In addition, we validate the quality of the generated better synthetic data with a simple U-Net. We believe that this technique not only paves the way toward the application of 3D particle field monitoring but also holds the potential in addressing complex problems that either lack a ground truth or in situations where obtaining a ground truth is significantly prohibitive. This is particularly applicable in scenarios involving multiple-scattering imaging, such as speckle-denoising,33 imaging in dynamic scattering media,34 and the phase recovery of medical biological tissues.35 Our LBS method opens up a new paradigm in creating highly realistic synthetic data for deep learning-based imaging techniques.

2 LBS Method

In this section, we introduce the framework of the LBS method in detail. The LBS method includes high-contrast simulation (HCS), optimization in combination with experiments (OCE), and a modified U-Net model.

2.1 HCS Framework

Figure 2 shows the details of the HCS method. The HCS intends to generate simulation images with high contrast to train the modified U-Net. To be more specific, we want to obtain the simulated hologram that is to be used for the generation of training input data next and the corresponding ground truth images as labels.

Figure 2.HCS framework.

Download full size

View all figures

We prepare a stack of layered images with a size of $256 pixel \times 256 pixel$ , inside which are a certain number of particles (the number can be 0, that is, no particles). In the simulated layer images, the gray value of the pixels occupied by particles is 1 (white), whereas the gray value of the background area is 0 (see sec. 1A in the Supplementary Material). These layered images are equally spaced along the optical axis ( $z$ -axis), forming a 3D particle field. Each layer image is illuminated by an on-axis plane wave and diffracts onto the sensor plane using the angular spectrum (AS) diffraction model.36 All the complex diffraction distributions are superimposed, and the intensity of the synthetic distribution is the original hologram (see sec. 1B in the Supplementary Material). The DC component is then subtracted from the hologram. The simulated hologram is preserved in its continuous gray-scale form instead of being quantized into gray-level steps to simulate the hologram recorded by an image sensor. The reason behind this is that, based on our observation, the quantization process of the hologram introduces a certain level of distortion to the frequency spectrum of the simulated hologram. This distortion then affects the subsequent image reconstruction performance through backpropagation, resulting in less ideal outcomes (see sec. 1C in the Supplementary Material).

To make the output processing algorithm [Fig. 1(a)] more efficient, we encode the layer images into ground truth images as the training labels of the modified U-Net. Specifically, in ground truth images, each particle corresponds to a square pattern measuring $3 pixel \times 3 pixel$ . For a particle with a diameter of $D_{j}$ on a layer (if present), its central position $(x_{j}, y_{j})$ coincides with the central position of the pattern in the ground truth image. The gray value of the 9 pixels in the pattern is equal to $D_{j} \times 4$ . For instance, a particle with a diameter of $25 μ m$ corresponds to a gray value of 100. These ground truth images will later be used as training labels for a U-Net network.

2.2 OCE Framework

The simulated hologram generated by the HCS method is then backpropagated to different distances to generate the raw reconstruction images (see sec. 1B in Supplementary Material). If we aim to generalize a simulator-trained model into the experiment, these ideally simulated raw reconstruction images without any noise introduced by the experimental setup and process are not suitable for the training of the modified U-Net. There is a lot of noise in the experiments, and even in the same experiment, sometimes we need to adjust the exposure time of the image sensor or the illuminating light intensity on the objects, which may lead to dramatic changes in the background light intensity in the captured hologram. Therefore, the experimentally raw reconstructed images are generated by holograms with different noises and different DC terms. To make simulated data similar to experimental data, the OCE framework facilitates the incorporation of experimentally introduced noise into the simulated hologram and subtracts the influence of fluctuating illuminated light intensity. The averaging of all raw reconstructed images acquired from a 3D particle field effectively removes random noise, thereby directly revealing the system noise introduced by the experiment and the background light intensity. We use the Gaussian noise and Gamma transformation to synthesize the system noise introduced by the experiment and by the average light intensity, respectively. We identify the optimized Gaussian noise parameters and the Gamma transformation factor by comparing the average of raw reconstructed images obtained from an experimental hologram with those obtained from a simulated hologram. By utilizing the optimal parameters, we make the reconstruction images obtained by simulation more similar to those obtained by the specific experiment. The OCE framework includes data preparation, a Holo-processing module, and an Optimization module for the optimization of the noise parameters and Gamma transformation parameter, which is illustrated in Fig. 3.

Figure 3.(a) In-line holography experimental setup and data preparation. (b) OCE framework.

Download full size

View all figures

2.2.1 Data preparation and Holo-processing

As shown in Fig. 3(a), we used an in-line holography setup to capture particle field images in the experiment. The captured experimental hologram is cropped to a size of $256 \times 256$ , which is a representative region of the entire experimental conditions, and generally meets the requirement that the image quality should not be poor and there should not be too many impurities inside. It is then backpropagated to get a series of raw reconstructed images $I_{i}$ , and then, we average these images to get $I_{e}$ , $I_{e} = \frac{1}{n} \sum_{i = 1}^{n} I_{i},$ (1)where $I_{e}$ is the average of raw reconstructed images by experiment, $I_{i}$ is the $i$ th raw reconstructed image, and $n$ is the number of reconstruction images. As shown in Fig. 3(b), we take this average image as the reference experimental image as the input for the subsequent optimization module in the OCE method. On the other hand, in the simulation, as in Sec. 2.1, we model the 3D particle field by layer images and obtain the simulated hologram. Now, we propose the Holo-processing, which aims to generate better reconstructed images so that we can average them to get the synthetic average image as the other input for the subsequent optimization module in the OCE method.

In Holo-processing, first, we add Gaussian random noise to the raw simulated hologram. However, as we explained in Sec. 2.1, we did not quantize the hologram to the image format, so strictly speaking, our hologram does not exist in the form of image data. Because of that, we cannot specify the value of the mean and variance of Gaussian noise and then add it directly to the hologram as others do. Instead, initially, we calculate the range $R$ (the difference between the maximum and minimum values) and the standard deviation $σ$ of the raw simulated hologram. Then, we specify the appropriate multiples $N_{R}$ and $N_{σ}$ to control the level of added noise. Random numbers of size $256 \times 256$ are generated with a normal distribution obeying $N (N_{R} \cdot R, N_{σ}^{2} \cdot σ^{2})$ . By adding noise at this level to the simulated hologram, we get a new hologram. The new hologram with updated noise is then backpropagated to create raw reconstructed images. After that, we apply image enhancement to all the raw reconstructed images using gamma transformation,37 which is expressed as $s = c r^{γ},$ (2)where $r$ is the gray value of input, $s$ is the gray value of output, $c$ is the gray scale factor and it is equal to 1 here, and $γ$ is the Gamma factor, which controls the scale of the entire transformation. These transformed reconstructed images after gamma transformation are the output of the Holo-processing module. We still average them to obtain the reference synthetic image $I_{s}$ . Subsequently, the reference experimental image $I_{e}$ and the reference synthetic image $I_{s}$ are sent into the optimization module together as input.

2.2.2 Optimization

The objective of the Optimization module is to determine the optimal values for three parameters $(N_{R}, N_{σ}, γ)$ such that the input reference synthetic image closely matches the reference experimental image. To accomplish this, we aim to solve the following optimization problem: $(N_{R}^{*}, N_{σ}^{*}, γ^{*}) = \underset{N_{R}, N_{σ}, γ}{\arg \min} 1 - SSIM [I_{e}, I_{s} (N_{R}, N_{σ}, γ)],$ (3)where $(N_{R}^{*}, N_{σ}^{*}, γ^{*})$ are the corresponding optimal parameters for $(N_{R}, N_{σ}, γ)$ , $I_{e}$ and $I_{s}$ are the reference experimental image and reference synthetic image, respectively, and SSIM is the structural similarity index measure.38

The problem in Eq. (3) is a complex, high-dimensional, nonconvex nonlinear optimization problem. It is challenging to formulate a precise mathematical expression for this problem, let alone obtain complete gradient information. Moreover, empirical investigations using gradient-based optimization methods have revealed many local optimal solutions, causing the algorithms to become easily trapped and unable to escape from these local optima. To address this, we employ the genetic algorithm (GA),39^,40 which is a highly effective global optimization method for solving intricate high-dimensional problems, especially for problems that are not smooth, not derivable, or have no explicit mathematical model (see sec. 1F in the Supplementary Material).

The optimal parameters $(N_{R}^{*}, N_{σ}^{*}, γ^{*})$ obtained from the optimization module are utilized as inputs to the Holo-processing module. Specifically, the noise distribution determined by $N_{R}^{*}$ and $N_{σ}^{*}$ is incorporated into the simulated hologram. Subsequently, the raw reconstructed images are derived, and a gamma transformation using $γ^{*}$ is applied to them. The resulting better synthetic images, which best replicate the experimental noise, together with the corresponding ground truth images by the HCS framework, are utilized as inputs and labels for training the modified U-Net model.

2.3 Modified U-Net

U-Net is a network architecture widely adopted in image segmentation tasks, particularly for medical images41 or images related to scattering media,22 where spatial precision is critical. It is especially well-suited for pixel-wise prediction tasks as it generates outputs that are spatially aligned with the inputs, enabling dense, per-pixel mapping.

We have made some modifications based on the original U-Net,41 as shown in Fig. 4. We use mean square error as the loss function and choose Adam optimizer as the optimizer, and the learning rate is set to $10^{- 4}$ or $10^{- 5}$ according to specific conditions. When testing the network, the original predictive output image must pass through the image processing algorithm (see sec. 1D in the Supplementary Material) to obtain the required particle field information. After that, we apply the evaluation metrics to quantify the result (see Appendix for more details).

Figure 4.Modified U-Net framework with the training input and label.

Download full size

View all figures

The U-Net architecture is implemented using Keras 2.6.0 with the TensorFlow-GPU 2.6.0 backend. Other software environments are Python 3.7, CUDA 11.2, CUDNN 8.1, and MATLAB R2022b. The network training is conducted on an Nvidia RTX 3090 24G GPU, and the generation of training data is conducted on an Intel Core i5-11400H CPU.

3 Results

3.1 Simulation Results

In this section, we show how far we can go in the limit of particle concentration on the simulation side. We choose the particle per pixel (ppp) concentration metric21^,22 here (see sec. 2A in the Supplementary Material).

We model a 3D particle field, with 60 particles with diameters of 20 to $30 μ m$ , 63 layers with an axial layer interval of $30 μ m$ , and an axial distance range of 1800 to $3660 μ m$ . We take $26.82 μ m$ as the lateral distance of a voxel. It is worth pointing out that, in the simulation, the transformed reconstructed images are obtained by Holo-processing with the original parameters (0, 0, 1), given the fact that there is no experimental data involved here. Because each hologram contains $256 pixel \times 256 pixel$ with a pixel pitch of $2.2 μ m$ , each group of data thus corresponds to a particle field of $563 μ m \times 563 μ m \times 1860 μ m$ ( $21 voxel \times 21 voxel \times 63 voxel$ ). A total of 128 groups of training data are prepared, corresponding to a data shape of $256 \times 256 \times 8064$ , and the 100 epochs of training take 135 min with a learning rate of $10^{- 4}$ .

In the test, we prepared nine groups of data and spliced them in a horizontal $3 \times 3$ way, so the particle field size after splicing is $63 voxel \times 63 voxel \times 63 voxel$ . As each group of data contains 60 particles, this particle field contains 540 particles; that is to say, the ppp concentration is $\sim 1.36 \times 10^{- 1}$ . Compared with $6.1 \times 10^{- 2}$ , the highest concentration used by Shao et al.,22 this concentration is very high. We show how difficult it is to identify particles at this high concentration in Fig. 5(a), where two transformed reconstructed images at different distances [defined by Fig. 3(b)] are shown. Perhaps A1 and A4, the two in-focus particles, can be identified using the traditional image threshold method or iterative calculation method, but it is impossible to identify A2 and A3 using those methods. A2 and A3 are seriously disturbed by the defocused image of other particles, and even partially occluded, so only part of the particle profiles is exhibited. For such cases, where the particles overlap in the axial direction, it is difficult to obtain accurate results using traditional methods. In addition, the traditional methods are generally based on light intensity (gray level), so the four in-focus particles B1 to B4 are likely to be considered the background because of their similar gray values compared with the surrounding environment, whereas the three out-of-focus particles C1 to C3 are likely to be considered in-focus particles because of their larger brightness. Figure 5(a) fully demonstrates the difficulty and importance of reconstructing high-concentration particle fields.

Figure 5.(a) Two transformed reconstructed images with additional name labels, with the ppp concentration is $1.36 \times 10^{- 1}$ . The solid red box marks all the in-focus particles and the dashed blue box marks several out-of-focus particles. (b) Location comparison between the prediction result and the ground-truth.

Download full size

View all figures

Here, like others, both the lateral positioning error and the axial positioning error should be less than 1 voxel ( $26.82 μ m$ in transverse direction, $30 μ m$ in axial direction) to be considered a correct location of a particle. Figure 5(b) demonstrates the reconstructed particle field obtained using the reconstruction method proposed in this paper under the simulation conditions described in this section. The reconstructed particle field and the simulated ground truth particle field are displayed simultaneously in the same coordinate system, with particles exhibiting positional errors marked in black. Among these more than 500 particles, the number of particles of false negative (FN) is 15 and the number of particles of false positive (FP) is 1 (see Appendix for related definitions). All 16 particles are considered error particles. The result shows that even at such a high concentration (2.2 times $6.1 \times 10^{- 2} ppp$ ), we still obtain the extraction rate (ER) of 97.21% and Jaccard index (JI) reached 0.97, which is better than both Shao et al.’s22 ER of 95% at $6.1 \times 10^{- 2} ppp$ and Chen et al.’s21 ER of 82.4% at $6.1 \times 10^{- 2} ppp$ . The results in Fig. 5(b) show that, after using the proposed method, the reconstruction result accuracy reaches a very high level under high concentration.

3.2 Experimental Results

We use polystyrene microspheres with diameters of 20 to $30 μ m$ freely suspended in ultrapure water, which is placed in a quartz cuvette with a thickness of 3 mm, as the particle field for the experiment (see sec. 2B in the Supplementary Material). The particle concentration measures $10,000 particles / mL$ when the solution is diluted. A hologram is captured by the image sensor, and a region with a size of $1024 \times 1024$ in the whole hologram is cropped and selected as the test data [Fig. 3(a)]. We crop this region to $4 \times 4$ sub-holograms with a size of $256 \times 256$ and backpropagate them to get 976 reconstructed images (1800 to $4800 μ m$ , interval $50 μ m$ , shape $256 \times 256 \times 976$ ). It is important to note that to align experimental images with synthetic ones, where particles appear white instead of black (low gray value), we process the experimental reconstructed images using image inversion ( $255 - I_{o}$ ) on the original backpropagated images. To examine the performance of the test result, we use the self-made compound focusing criterion (see sec. 1E in the Supplementary Material) and manual examination to obtain the reference results, which are used to evaluate the results by the output of the simulator-trained model. These reference results are not the ground truth for the training of the U-Net but for the evaluation of the prediction by the simulator-trained U-Net. When validating the network, these raw reconstructed experimental images are taken as the input. Through the simulator-trained U-Net and the output processing algorithm, the prediction results are generated and then evaluated by comparing them to the reference results. As the pixel size of the image sensor is $2.2 μ m$ , the volume size corresponding to the test image is $2253 μ m \times 2253 μ m \times 3000 μ m$ .

To generate better synthetic training data, we must find the optimal parameters for the Holo-processing first. On the simulation side [Fig. 3(b)], as presented in Sec. 2.1, we generate the random particle field containing 12 particles with diameters of 20 to $30 μ m$ . The $z$ -direction distance from the layers in the particle field to the sensor plane is 1800 to $4800 μ m$ , with its interlayer interval to be $200 μ m$ . One simulated hologram is generated and used as the reference simulated hologram. On the experiment side, among the 16 subholograms mentioned above, we choose one with an image quality that is relatively good without much impurity, to be the experimental hologram for reference. During optimization, by setting the fitness function as the objective function in Eq. (3), the GA is optimized for up to 100 generations, unless termination conditions are met. We conduct multiple runs of the GA within a time duration of 20 min. Although the optimal parameters $(N_{R}^{*}, N_{σ}^{*}, γ^{*})$ varied with each run, they are found to be close to each other. We select the set of parameters that minimizes the fitness function as the best parameter values, which in this case are ( $- 0.3068$ , 0.2164, 0.8337). After that, we directly send ( $- 0.3068$ , 0.2164, 0.8337) into Holo-processing as parameters and start to generate all the 256 groups of data processed by Holo-processing (1800 to $4800 μ m$ , interval $200 μ m$ , 12 particles per group). Then, we get 4096 better synthetic images (shape $256 \times 256 \times 4096$ ) as the training input images. Figure 6 shows the advantage of the better synthetic data over normal simulation data, which is not processed by the HCS method or the OCE method.

Figure 6.Reconstruction comparison among the normal simulation data, better synthetic data, and real experimental data under different reconstruction distances.

Download full size

View all figures

As observed, of the three defocused particles, namely, A1, A2, and A3, A2 virtually mirrors A3, whereas A1 diverts noticeably by featuring an unusual black core. This attribute might become a significant challenge for the model when attempting to distinguish between in-focus particles. On a similar note, the in-focus particle B2 bears a striking resemblance to B3, whereas B1 contrasts starkly, being partially obscured by surrounding noise. Moreover, the in-focus particles C2 and C3 share strong similarities, yet C1 is disproportionately drowned out by the background. Hence, the attributes of the better synthetic data align more closely with the characteristics of actual experimental data.

With the corresponding 4096 synthetic ground truth images from the simulated particle fields by HCS as the labels, the U-Net model is trained for 100 epochs with a learning rate of $10^{- 5}$ , which takes $\sim 80 \min$ . During training and validation, we consider particles with lateral errors within 10 pixels ( $22 μ m$ ) and axial errors within $150 μ m$ to be true positive (TP). To evaluate the overall performance, we set another indicator point, which is equal to (ER + JI) when the test particle number is over a certain number, or 0 when that number is below this certain number (see sec. 2B in the Supplementary Material).

Figure 7 shows the results of our training and validation. As a comparison, we also used the other two types of data from Fig. 6 for training. Specifically, for the normal simulation data, we used the exact same particle field as in the better synthetic data, meaning the labels are identical between the two, with the only difference being the training input, whereas all other parameters remain consistent. For the real experiment data, we used another hologram captured at a different time to create the training input. As for the labels corresponding to these training inputs, we directly employed the focusing method mentioned in Ref. 42 to locate the in-focus positions of the particles to generate the labels (though they are not fully accurate), whereas all other parameters remain consistent. We can see that for better synthetic data, after only 100 epochs of training, with a training time of less than 80 min, at the 84th epoch, we get a good result, that is, JI reached 0.886 and ER reached 0.8557 (85.57%). However, for the normal simulation data, the highest JI is only around 0.3, and the ER does not exceed 30%. On the other hand, for the real experimental data, the highest JI is below 0.7, and the ER reaches only $\sim 50 %$ . In addition, the model trained with better synthetic data is the most stable during validation. In contrast, the model trained with real experimental data is highly unstable, and its training loss does not decrease to a sufficiently low value. Meanwhile, the performance of the model trained with normal simulation data on the validation dataset shows no significant improvement. This result indicates that although using real experimental data as input is better than using simulation data, this is only true if the corresponding labels are sufficiently accurate. When we cannot ensure accurate labels, using simulation data with perfectly accurate labels may be a better option. Furthermore, when comparing the two types of simulation data—better synthetic data and normal simulation data—it becomes clear that the realism of the simulation data is quite crucial to the network’s performance.

Figure 7.Training results for the three types of training datasets: loss during training, and JI, ER, and point metrics for validation.

Download full size

View all figures

After the training, using the trained models, we can get the output prediction layer images. Figure 8 shows the comparison between the predicted experimental results and the reference results. Here, the experimental results are generated by three U-Nets trained on three types of data, respectively: normal simulation data, better simulated data (obtained through the OCE framework proposed in this paper), and real experimental data. The particles of TP, FN, and FP are marked with different colors in the figure. We can see that, for our better synthetic data result, in the lateral direction, most of the reconstructed particles match the reference result very well, and in the axial direction, although some particles do not match the reference result exactly, they are still within our tolerance ( $150 μ m$ ). Moreover, by inspecting these FN particles’ positions, we find that most of them are at the edge of the sub-images with a size of $256 pixel \times 256 pixel$ . In other words, the primary reason the model fails to detect some particles is that crucial information at the edges is lost during the image cropping process. Therefore, these undetected FN particles are reasonable and can be accepted, which, to some extent, also shows the superiority of our LBS method. On the other hand, in the reconstruction from the model trained with normal simulation data, most of the particles are missed. Although the reconstruction from the model trained with real experimental data performed better than the normal simulation data, it is clearly much worse compared with the results from the better synthetic data. The results in Fig. 8 fully demonstrate the effectiveness of the proposed method.

Figure 8.Comparison of particle locations between the reconstruction and the reference results for the three types of training datasets. The reconstructed particles are categorized as true positive (TP), false negative (FN), and false positive (FP).

Download full size

View all figures

For comparison, the concentration in Chen et al.’s21 experimental validation is around $50 particles / mL$ , and its ER is similar to ours, but its axial error is $\sim 1 mm$ , which is much larger than ours ( $150 μ m$ ). In addition, compared with Tahir et al.’s32 experimental result, our requirement for TP is stricter than theirs, which is a 4 particle size error in the transverse direction and a 12 particle size error in the axial direction. In addition, we did not encounter the problem that appeared in their experiments, where the performance drops dramatically when the depth increases. As we can see from Fig. 8, even when the depth increases to over $4000 μ m$ , our performance is still good.

4 Discussion and Conclusion

In this paper, we present and experimentally demonstrate a fast and adaptive method called the LBS method for generating highly realistic synthetic images for scattering media. Without any help from a neural network, after an optimization process is used to best synthesize the noise caused by multiple scattering, defocusing, the experimental system, and environment from a single measurement and added to the simulated data, the synthetic data are obtained. To validate the quality of the better synthetic data, we use it to train a simple U-Net model. We demonstrate the effectiveness of the created highly realistic synthetic data and the generalization capability of the simulator-trained U-Net model by the accurate reconstruction of real high-concentration 3D particle fields. The major challenge in reconstructing a 3D particle field with high concentration is a complex multiple scattering process.31 Designing a network specifically tailored to this process can be exceedingly complex, and the training cost can be tremendous.21^,32 However, in this paper, we successfully generate synthetic data that accurately reflect the impact of this complex physical process. Therefore, the proposed LBS method perfectly addresses the two critical issues in 3D particle field monitoring, which are the absence of experimental ground truth and the complex physical model.

The new concept that we propose in this paper, which is to synthesize the complex physical model output by optimizing the synthetic data using experimental output, offers inspiration and guidance to others employing simulation-based training methods, where paired ground-truth labels are also hard to obtain. For instance, medical imaging of tissue samples (phase objects) has predominantly depended on time-intensive staining procedures due to the significant barrier presented by multiple scattering in deep biological tissue. The implementation of the proposed HCS and OCE methods could potentially enable us to swiftly generate a sizable dataset with corresponding high-contrast ground truth images. This would negate the need for laborious processes such as mechanical movements and repeated microscopic imaging to acquire training data.

This strategy of simulator-trained deep learning will make significant contributions to the next-generation deep learning methods,26^,43 where strenuous manual labeling is no longer needed. We expect that the successful creation of synthetic data adhering to the multiple-scattering process will amplify our comprehension of this process in complex scattering media,44 potentially sparking groundbreaking formulations or models in this field. In addition, we expect that this LBS method will enhance the applicability of three-dimensional particle field imaging in areas such as particle tracking velocimetry,45^,46 holographic cytometry,47^–50 and water quality measurement.51^–53

Despite the promising results, our approach has certain limitations. Specifically, we have validated the method only on polystyrene microspheres under a limited range of concentrations, which may impose an upper bound on the sample density. Although our framework does not require explicit calibration of the experimental system—as the LBS method inherently captures the system’s noise characteristics and illumination conditions—more complex, denser, or thicker scattering media may require further modeling of the physical interactions, including more sophisticated noise, transformation representations. Although the noise characteristics are system-specific, our method is general and can be applied to other systems, such as in-line holographic particle imaging, if researchers choose to adopt our approach.

Our method can be further improved in the future. Notably, the optimal SSIM value between the reference synthetic and experimental images currently stands at $\sim 0.7$ . If we can increase this value closer to 1, thereby enhancing the similarity between the synthetic and experimental images, it is likely that the network’s performance could also be improved. One promising approach is using a generative adversarial network.54 This technique, given a training set, learns to produce new data that statistically resembles the training set data. This method holds potential for generating superior synthetic data similar to experimental data, which could improve the model’s generalization and broaden its application across various research fields.

5 Appendix: Evaluation Metrics

We typically focus on two values, namely, extraction rate and accuracy. In the result of network prediction, there are three kinds of particles, namely, true positive (TP), false positive (FP), and false negative (FN). TP represents correctly positioned particles, FP represents error-detected particles that are in the reconstruction but not in the ground truth, and FN represents missing particles that are not detected in the reconstruction but are present in the ground truth. The extraction rate (ER) is defined as $ER = (1 - \frac{n_{FN}}{n_{GT}}) \times 100 %,$ (4)where $n_{FN}$ is the number of particles of FN and $n_{GT}$ is the number of particles in the ground truth or in the reference result. Specifically, the accuracy metric we use here is the Jaccard index (JI), which is computed as $JI = \frac{n_{TP}}{n_{TP} + n_{FP} + n_{FN}} .$ (5)

Noting that the predictions of our network cannot be the same as the ground truth, we set a certain tolerance when judging whether a particle is TP and the tolerance is different in simulation and experiment.

Acknowledgments

Acknowledgment. This work was supported by the National Key Research and Development Program of China (Grant Nos. 2022YFB2804300 and 2023YFB3611500), the National Natural Science Foundation of China (Grant No. 62275218), the China Postdoctoral Science Foundation (Grant No. 2022M712586), and the Guangdong Basic and Applied Basic Research Foundation (Grant No. 2023A1515011335).

Bozhen Zhou is currently a graduate student at the Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China. His current research interests include computational imaging, digital holography, and machine learning.

Zhitao Hao received his MS degree from the Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China. His work focuses on particle fields, digital holography, and computational imaging.

Zhenbo Ren received his BS degree in optoelectronic information engineering from the Huazhong University of Science and Technology, Wuhan, China, in 2011, his MS degree in instrument science and technology from the Tsinghua University, Beijing, China, in 2014, and his PhD in optical engineering from the University of Hong Kong, Hong Kong, in 2018. He is currently an associate professor with the School of Physical Science and Technology, Northwestern Polytechnical University, Xi’an, China. His research interests include computational imaging and intelligent sensing.

Edmund Y. Lam received his BS, MS, and doctor’s degrees in electrical engineering from the Stanford University, Palo Alto, US. He is now a professor of electrical and electronic engineering and director of the Imaging Systems Laboratory at the University of Hong Kong, Hong Kong, China. He has broad research interests around the theme of computational optics and imaging and has published over 300 journal and conference papers. A recipient of the IBM faculty award, he is also a fellow of SPIE, Optica, the Institute of Electrical and Electronics Engineers (IEEE), the Society for Imaging Science and Technology (IS&T), and the Hong Kong Institution of Engineers (HKIE).

Jianshe Ma received his BS degree in mechanical engineering from the Beijing University of Science and Technology, Beijing, China, in 1992, his MS degree in machine manufacturing from the Kunming University of Science and Technology, Kunming, China, in 1997, and his PhD in mechatronic engineering from the Harbin Institute of Technology, Harbin, China, in 2000. He is currently an associate professor and a member of the Optical Memory National Engineering Research Center. His research interests include optical storage, semiconductor lighting, and ultraviolet communication.

Ping Su received her BS degree in measurement and control technology and instruments and her PhD in optical engineering from the Tsinghua University, Beijing, China, in 2005 and 2010, respectively. She is now an associate professor at the Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China. Her research interests include optical storage, holography, and ultraviolet communication. She is currently a member of SPIE and Optica.

Category: Research Articles

Received: Mar. 27, 2025

Accepted: Jul. 14, 2025

Published Online: Aug. 27, 2025

The Author Email: Zhenbo Ren (zbren@nwpu.edu.cn), Ping Su (su.ping@sz.tsinghua.edu.cn)

DOI:10.1117/1.APN.4.5.056007

CSTR:32397.14.1.APN.4.5.056007