Lensless fiber endomicroscopy is an emerging tool for minimally invasive in vivo imaging, where quantitative phase imaging can be utilized as a label-free modality to enhance image contrast. Nevertheless, current phase reconstruction techniques in lensless multi-core fiber endomicroscopy are effective for simple structures but face significant challenges with complex tissue imaging, thereby restricting their clinical applicability. We present SpecDiffusion, a speckle-conditioned diffusion model tailored to achieve high-resolution and accurate reconstruction of complex phase images in lensless fiber endomicroscopy. Through an iterative refinement of speckle data, SpecDiffusion effectively reconstructs structural details, enabling high-resolution and high-fidelity quantitative phase imaging. This capability is particularly advantageous for digital pathology applications, such as cell segmentation, where precise and reliable imaging is essential for accurate cancer diagnosis and classification. New perspectives are opened for early and accurate cancer detection using minimally invasive endomicroscopy.
【AIGC One Sentence Reading】:SpecDiffusion enables high-res phase imaging in lensless endomicroscopy, aiding cancer detection and classification.
【AIGC Short Abstract】:SpecDiffusion, a speckle-conditioned diffusion model, enhances lensless fiber endomicroscopy for complex tissue imaging. It achieves high-resolution, high-fidelity phase reconstruction, benefiting digital pathology, especially cell segmentation for cancer diagnosis. This advancement enables early, accurate cancer detection with minimally invasive techniques.
Note: This section is automatically generated by AI . The website and platform operators shall not be liable for any commercial or legal consequences arising from your use of AI generated content on this website. Please be aware of this.
Lensless fiber endomicroscopy is an emerging tool in medical imaging and diagnostics, noted for its high imaging resolution and minimum invasiveness[1–10]. Unlike traditional endoscopes that rely on lenses on the application side to capture images, the lensless fiber endomicroscope employs computational imaging techniques to visualize microscopic structures with an ultra-thin probe. Therefore, it provides access to previously inaccessible areas and reduces neuronal damage during clinical operation, thereby minimizing patient discomfort and contributing to faster post-operation recovery[11–13]. For typical lensless fiber endomicroscopes employing wavefront shaping technologies, in vivo confocal fluorescent imaging is usually utilized[14–19]. However, the potential toxicity of fluorescent dyes has prompted the exploration of safer alternatives[16], and approaches are being sought to save the effort and time required for staining tissue[20–22]. Quantitative phase imaging (QPI) serves as a promising label-free technique, which utilizes the phase information to enhance endomicroscopic imaging capabilities[5,12,23]. Moreover, it enables the extraction of critical biophysical properties of biological samples, such as the refractive index[24,25], cell volume[26], and dry mass[27,28], which has been proven to be a potential biomarker for cancer diagnosis[29]. Therefore, achieving QPI through lensless fiber endomicroscopes can help cancer diagnosis in the early stage with minimum invasiveness. Nevertheless, due to the discrete core spacing and optical path differences in multi-core fibers (MCFs), the phase information gets distorted through the lensless endomicroscope[30,31]. To address this problem, the far-field amplitude-only speckle transfer (FAST) technique[5,32] is introduced to correct phase distortion via an additional reference measurement, enabling high-fidelity QPI through the lensless fiber endomicroscope. However, the iterative propagation process and the requirement of the additional calibration procedure limit FAST’s imaging speed. This constraint significantly hampers its application in real-time medical diagnostics during clinical procedures, thereby hindering the precision of a doctor’s clinical operation.
Recently, deep learning has emerged as a powerful tool in QPI[33–38]. It first verifies its effectiveness in reconstructing phase and amplitude image through a multi-mode fiber[2,39,40], successfully reconstructing simple structures like the handwritten digits in the MNIST dataset[41]. Nevertheless, multi-mode fibers are sensitive to bending, which limits their ability to achieve high-resolution phase imaging. In contrast, MCFs are relatively less susceptible to external factors and exhibit a lateral memory effect. Although deep learning has proved its potential in phase reconstruction through MCFs[42,43], the capability of the previously reported methods is limited in reconstructing simple structures like handwritten digits and icons[44,45]. The performance of deep learning models is contingent upon the complexity of the images being reconstructed. Consequently, the relatively simplistic architecture of the conventional U-Net[46], which is generally utilized in previous works, is inadequate for the phase reconstruction of complex images through a scattering medium like MCFs. U-Net’s locality-biased convolutional operations cannot capture the long-range, strongly non-linear dependencies introduced by the strong scattering, leaving its inverse mapping from speckle intensity to phase fundamentally under-constrained. The diffusion model[47–49] is the recently prevailing artificial intelligence (AI) model noted for generating high-quality, complex images. Unlike traditional image-to-image reconstruction methods, the generation procedure of diffusion models involves two key processes: the forward process, where data are gradually transformed into Gaussian noise through a sequence of steps, and the reverse process, where the model learns to reconstruct the original data from this noise. The effectiveness of the diffusion model originates from its iterative generation process, which alleviates the generation challenge in each step and is useful for tasks where fine details are critical. Hence, diffusion models have been widely implemented in sophisticated tasks including drug design[50], protein design[51], and microscopic image data analysis[51–55]. While typical diffusion models offer powerful capabilities in unsupervised image generation, applying them in complex phase reconstruction tasks remains challenging. Typically, the unsupervised design of the diffusion model hinders phase reconstruction from the speckle images captured at the detection system. The most recent work incorporates the physical process into the denoising process of the diffusion model, achieving high-quality holographic reconstruction[56]. However, it relies on specific initialization and well-crafted step size to avoid undesired artifacts, which undermines the fidelity of the reconstructed images. Therefore, high-fidelity phase reconstruction of complex images using diffusion models remains challenging.
In this paper, we introduce a conditional diffusion model, termed SpecDiffusion, to achieve high-fidelity quantitative phase reconstruction in the lensless fiber endomicroscope using speckle images captured at the detection system. As illustrated in Fig. 1, SpecDiffusion implements an iterative framework to break the challenging speckle reconstruction task into several steps. Inspired by phase retrieval algorithms[57], we use the intensity-only speckle images as a supervisor for the diffusion model in each step to ensure high-fidelity phase reconstruction. For the phase retrieval process, SpecDiffusion needs only one intensity image as the reference, and the quality of the reconstructed phase image is significantly enhanced through the iterative reconstruction process. To verify the effectiveness of the proposed diffusion model, we build an optical setup to generate a training dataset consisting of images from ImageNet[58] and their corresponding speckles through MCFs. Given the complexity and diversity of ImageNet images, this dataset can serve as a benchmark for assessing the phase reconstruction capability of the lensless fiber endomicroscope in various scenarios. The pretrained SpecDiffusion model on this dataset demonstrates high robustness in unseen structures. This is validated by a USAF resolution test chart, where the proposed diffusion-driven endomicroscope demonstrates its ability to resolve details down to several micrometers. Additionally, we explore the practical potential of SpecDiffusion in biomedical applications, achieving cancer tissue image reconstruction with high resolution by transfer learning. To further validate the efficacy of SpecDiffusion’s reconstructed tissue images in digital pathology, a zero-shot cell segmentation task is conducted on these reconstructed tissue images. Notably, SpecDiffusion not only achieves improvements in key image quality metrics but also closely mirrors the segmentation results obtained from the real phase image. By integrating a physics-informed, speckle-conditioned diffusion model with an ultra-thin, lensless fiber endomicroscope, our approach enables high-resolution, label-free QPI of complex biological tissues. This advancement facilitates real-time, minimally invasive diagnostics, offering sustainable clinical applications such as early cancer detection and longitudinal metabolic monitoring without the need for bulky optics or staining procedures.
Figure 1.Illustration demonstrates the working principle of the diffusion-driven lensless fiber endomicroscope. The quantitative phase reconstruction process involves using speckle images captured at the detection system to guide the denoising process of the SpecDiffusion model. The reconstructed phase images are applicable to digital pathology tasks, including cell segmentation, enhancing both accuracy and detail in diagnostic imaging.
2.1 Diffusion-Driven Quantitative Phase Imaging Through a Lensless Fiber Endomicroscope
We propose a conditional diffusion model, termed SpecDiffusion, for phase reconstruction of complex images. It breaks down the reconstruction process into iterative denoising steps, gradually transforming the initial random phase into the desired phase image. The speckle images captured at the far-field of the MCF serve as the conditioning images that guide the iterative denoising process during phase reconstruction. The SpecDiffusion architecture consists of three core components: an encoder, self-attention middle layers, and a decoder. Both the encoder and decoder are built using residual blocks, while the middle layers employ self-attention mechanisms. This design enables the network to effectively model long-range dependencies between features extracted from different spatial regions of the input image, enhancing its ability to capture complex patterns in the data.
SpecDiffusion is trained from scratch using a supervised approach, ensuring high reconstruction fidelity. As shown in Fig. 2(a), during training, Gaussian noise is incrementally added to the ground truth phase images over several time steps, degrading them to an almost unrecognizable state. SpecDiffusion then operates as a denoising model, predicting and progressively removing the noise. Unlike conventional diffusion models, each denoising step is guided by conditioning speckle images, steering the noise reduction process in a physics-informed manner and enhancing the accuracy of phase reconstruction.
Figure 2.Architecture of the speckle-conditioned diffusion model (SpecDiffusion). (a) Training process of SpecDiffusion. The phase label is mixed with Gaussian noise, and SpecDiffusion is trained to predict the imposed noise with the guidance of speckle. (b) Inference process of SpecDiffusion. With the guidance from input speckle, the randomly generated initial phase is gradually denoised and transformed toward the label phase by SpecDiffusion.
The inference process employed by the SpecDiffusion is depicted in Fig. 2(b). In this process, the model learns to reconstruct the phase information incident to the MCF from the noisy random initial phase by performing a series of learned denoising steps. Each denoising step utilizes not only the current state of the reconstructed phase but also the speckle images to reverse the noise addition accurately. Subsequently, this denoised phase is mixed with a smaller amount of Gaussian noise, serving as the initialization for the next iteration. This denoising step is repeated for iterations, progressively reconstructing the phase image toward the actual phase. The culmination of this process yields the final phase image, which accurately represents the predicted phase corresponding to the input speckle pattern. Detailed principles of the SpecDiffusion are explained in Appendix A.
To visualize the phase reconstruction performance of SpecDiffusion, ground truth phase images are illustrated in Fig. 3(a). They are holographically projected on the facet of MCF. When the modulated light field is transmitted through the MCF, its phase gets distorted due to the intrinsic optical path differences between cores and the honeycomb artifacts[8], thus forming the speckle patterns in Fig. 3(b). Based on the speckle patterns, we compare the performance of the U-Net and SpecDiffusion in Figs. 3(c) and 3(d). To optimize computational efficiency, input images were down-sampled from to resolution during training, reducing model parameter requirements. With higher resolution, the reconstruction fidelity can be enhanced. Benefiting from the iterative reconstruction process, SpecDiffusion demonstrates the capacity to reconstruct detailed structures within complex images. In the magnified region of the crab image, U-Net manages to reconstruct only the approximate shape of the pincers. In contrast, SpecDiffusion achieves a much clearer and more detailed reconstruction of the pincers. Further detailed comparison of the phase reconstruction results is demonstrated in the Supplement 1, Fig. S2. To further quantitatively characterize the phase reconstruction fidelity, we employ the mean absolute metric (MAE), peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM)[59], 2D correlation coefficient, and phase residual standard deviation (PRSTD) as evaluation metrics. The metric distributions of U-Net and SpecDiffusion are illustrated in the Supplement 1, Figs. S2(e)–2(h). The averaged evaluation metrics are summarized in Table S1 in the Supplement 1. Quantitatively, SpecDiffusion outperforms conventional U-Net on all metrics, indicating higher fidelity of the reconstructed phase images. The evaluation on ImageNet underlines SpecDiffusion’s ability to enhance diagnostic imaging by providing clearer, more detailed reconstructions. In medical applications, clearer reconstructions result in sharper boundaries between different regions, thereby improving discrimination for cancer diagnosis. Moreover, the detailed reconstructions provide more imaging information retrieved from MCF speckles, which is beneficial for further virtual staining.
Figure 3.Diffusion-driven phase reconstruction of ImageNet images through the lensless fiber endomicroscope. (a) Ground truth phase images. (b) Speckle patterns captured at the detection side of the lensless fiber endomicroscope. (c) and (d) Reconstructed phase images by U-Net and SpecDiffusion. The SSIM value for each reconstructed image with respect to its corresponding phase label is shown. Scale bars: 50 µm.
2.2 Characterization of the Resolution Through the Diffusion-Driven Fiber Endomicroscope
The evaluation of natural images is crucial for evaluating the general performance of SpecDiffusion. However, to precisely quantify the resolution of the diffusion-driven fiber endomicroscope, it is essential to employ the standardized USAF 1951 test chart as the evaluation target. Reconstructing the test chart using a deep neural network presents a unique challenge because it is an unseen object with distinct high-frequency features that differ significantly from the natural images used to train such models. Deep neural networks, particularly those trained on natural image datasets, are often biased toward capturing low-frequency, global features, which may result in poor reconstruction of fine, high-frequency details like those found in the test chart[60]. This issue is exacerbated in speckle-based imaging systems, where the random and complex nature of speckle patterns further complicates the process of accurately resolving small, detailed structures. As a result, the USAF test chart is a widely recognized benchmark for evaluating the performance of neural networks in resolving fine details, especially when applied to unseen objects with features that differ from the training data.
In the experiment, phase images of the USAF chart were projected onto the spatial light modulator (SLM), with the corresponding speckle patterns captured by the detection system. The reconstructed images were then analyzed to determine the system’s resolution limit by identifying the smallest resolvable elements on the test chart. As depicted in Fig. 4(d), the results demonstrate that the diffusion-driven fiber endomicroscope is capable of resolving bars with a width of 4.10 µm, indicating a precision suitable for complex imaging tasks. In comparison, conventional neural networks like U-Net can hardly achieve high-resolution reconstruction in this range. As demonstrated in Fig. S3 in the Supplement 1, the optimum resolution of the system using the U-Net for phase reconstruction is 10.25 µm. In Fig. 4(e), the reconstruction results from SpecDiffusion effectively delineate the peak phase across three lines, exhibiting a pronounced contrast between the peaks and valleys within the phase image. This distinct clarity demonstrates that SpecDiffusion, through the accurate reconstruction of speckle patterns, enables micrometer-level imaging resolution for the lensless fiber endomicroscope, indicating that the diffusion-driven fiber endomicroscope can be a promising high-resolution imaging probe for in vivo tissue imaging.
Figure 4.Diffusion-driven phase reconstruction of the test chart through the lensless fiber endomicroscope. (a) Speckle pattern captured at the detection side of the lensless fiber endomicroscope. (b) Ground truth images of the test chart. (c) U-Net reconstructed phase image. (d) SpecDiffusion reconstructed phase image. (e) Phase reconstruction contrast of ground truth, U-Net, and SpecDiffusion. Scale bars: 50 µm.
2.3 Diffusion-Driven Fiber-Optic Phase Reconstruction Toward Human Cancer Tissue Samples
To further evaluate SpecDiffusion’s potential in biomedical applications, such as cancer diagnosis, we demonstrate an additional dataset consisting of human cancer tissue. Typically, deep learning models encounter significant challenges with overfitting, resulting in difficulties when generalizing across various application domains, particularly in the medical field. Nevertheless, the diversity inherent in the ImageNet dataset, combined with SpecDiffusion’s robust reconstruction abilities, fortifies our model against overfitting. Notably, with transfer learning on a small set of tissue, which is readily obtainable in clinical scenarios, SpecDiffusion not only sustains its performance but also exhibits improved efficacy. A reconstructed video demonstrating the denoising process of SpecDiffusion is shown in Visualization 1.
Based on the model pretrained on ImageNet, we fine-tune it on 150 tissue images, and the reconstructed images of U-Net and SpecDiffusion are depicted in Figs. 5(c) and 5(d). SpecDiffusion clearly outlines the cell morphology and boundaries, facilitating precise cell analysis and enumeration, which are crucial for Fig. 5(b). Moreover, it vividly captures cellular sociology, depicting the cellular aggregation patterns within tissues. The MAE, PSNR, SSIM, and 2D correlation coefficient distributions of U-Net and SpecDiffusion are illustrated in Figs. 5(e)–5(h). The distributions of SpecDiffusion are concentrated in regions indicative of better metric performance and exhibit a more centered tendency on most metrics. This suggests that the phase reconstruction capacity of SpecDiffusion is reliable and consistent. The averaged evaluation metrics are summarized in Table 1. Quantitatively, SpecDiffusion reduces the MAE from 0.2091 to 0.0304 rad, indicating its accurate reconstruction. This reduction also highlights SpecDiffusion’s potential for further applications in measuring biophysical properties, such as refractive index, dry mass, and optical transmission matrix[12]. As a result, real-time quantitative measurement during clinical operation becomes possible, enabling precise identification of boundaries between healthy and diseased regions and minimizing patient harm. Meanwhile, the PSNR is improved from 19.43 to 34.98 dB, suggesting that SpecDiffusion extracts more information from speckles, thereby supporting the success of subsequent virtual staining. SpecDiffusion also achieves a higher SSIM from 0.4310 to 0.9697, which is an indicator for retaining more structural information, making the reconstructed tissue structures reliable for clinical cancer diagnosis. Furthermore, the elevated 2D correlation coefficient, from 0.6999 to 0.9915, and the improved PRSTD value, from 0.2532 to 0.0422, demonstrate greater overall fidelity with less reconstruction bias between different regions, further proving the reliability of the tissue images. These metrics collectively affirm that SpecDiffusion’s reconstructions are high-fidelity and reliable, thereby demonstrating its applicability to complex image reconstruction tasks. More tissue reconstruction results and the corresponding residual maps are illustrated in the Supplement 1, Fig. S4. The residual maps demonstrate low reconstruction errors for SpecDiffusion, verifying its phase reconstruction precision. This enhanced performance and precision make SpecDiffusion a promising tool for advanced medical imaging and diagnostics, offering significant benefits for clinical practices.
Figure 5.Diffusion-driven cancer tissue reconstruction through the lensless fiber endomicroscope. (a) Ground truth phase images. (b) Speckle patterns from MCFs captured at the detection side of the lensless fiber endomicroscope. (c) and (d) Reconstructed phase image by U-Net and SpecDiffusion. (e) MAE, (f) PSNR, (g) SSIM, and (h) 2D correlation coefficient distributions evaluated on 1,000 reconstructed tissue images by U-Net and SpecDiffusion. (i) MAE, (j) PSNR, (k) SSIM, and (l) 2D correlation coefficient evaluated on 1,000 reconstructed tissue images by U-Net and SpecDiffusion with varying tissue dataset size. The SSIM value for each reconstructed image with respect to its corresponding phase label is shown. Scale bars: 50 µm.
Table 1. Phase Reconstruction Through the Lensless Fiber Endomicroscope Using U-Net and SpecDiffusion on a Human Cancer Tissue Dataset.
Method
MAE (rad)
PSNR
SSIM
2D correlation
PRSTD
IoU
U-Net
0.2091
19.43
0.4310
0.6999
0.2532
0.2480
SpecDiffusion
0.0304
34.98
0.9697
0.9915
0.0422
0.7658
Additionally, we examine how the size of the tissue dataset used for transfer learning impacts the reconstruction capabilities of SpecDiffusion, as detailed in Figs. 5(i)–5(l) in the Supplement 1. Initially, SpecDiffusion displays limited reconstruction efficiency when applied directly to the tissue dataset. However, it rapidly adapts to the tissue reconstruction task and improves its performance with the support of a modestly sized dataset. In contrast, the conventional U-Net model shows only marginal enhancements in its reconstruction abilities through transfer learning, remaining substantially less effective overall. Moreover, we investigated how the number of denoising steps in SpecDiffusion influences its reconstruction quality, as depicted in the Supplement 1, Fig. S5. By increasing the number of denoising steps, SpecDiffusion sacrifices inference speed for greater reconstruction accuracy and fidelity. At a denoise step number of 100, SpecDiffusion achieves a reconstruction speed of 1.77 frames per second (fps). It renders SpecDiffusion a feasible tool for clinical applications where time sensitivity is not a critical concern.
To further assess the potential of our reconstructed images in conjunction with other AI technologies, we employ the segmentation anything model (SAM)[61] for zero-shot cell segmentation tasks on reconstructed and ground truth images, as illustrated in Fig. 6(a). SAM stands out as an AI segmentation model characterized by its exceptional generalization ability, capable of delivering outstanding segmentation outcomes without prior knowledge about the samples. The details for the segmentation task are explained in the appendix. As AI technologies like SAM are increasingly integrated into medical diagnostics, it is important to ensure that the reconstructed images maintain fidelity to their originals within the realm of modern AI technology.
Figure 6.Cell segmentation by SAM on ground truth and reconstructed images. (a) Cell segmentation based on lensless MCF phase imaging by SAM. (b) Ground truth phase images and their cell segmentation results. (c) and (d) Reconstructed phase images by U-Net and SpecDiffusion, and their cell segmentation results. Scale bars: 50 µm.
The segmentation results of SpecDiffusion’s reconstructed images are illustrated in Fig. 6(d), which closely mirror those of the ground truth in Fig. 6(b), accurately depicting cell locations, sizes, and morphologies. The color-coded segmentation effectively distinguishes individual cells within dense populations, enabling precise isolation and analysis of cellular structures. This enhanced visualization capability supports clinical evaluation of cell morphology and states, significantly improving lesion detection potential. This performance starkly contrasts with U-Net, whose segmentation results, indicated in Fig. 6(c), are limited to identifying the general boundaries of different tissues without providing the granularity required for detailed cell-level segmentation. To quantitatively evaluate the segmentation quality, we employ the intersection over union (IoU) to measure the agreement between segmentation results derived from reconstructed phase images and their ground truth counterparts. The methodology for calculating IoU is detailed in the Supplement 1, Sec. 6. The averaged IoU is illustrated in Table 1. SpecDiffusion substantially enhances segmentation accuracy, improving the IoU from 0.2480 to 0.7658. The distribution of IoU values, highlighted in the Supplement 1, Fig. S6, indicates the aggregation of SpecDiffusion’s segmentation results within the higher IoU range, suggesting that the majority of its reconstructions are conducive to effective segmentation. The results signify the alignment between SpecDiffusion’s reconstructed images and the ground truth in the context of segmentation tasks, indicating its higher fidelity for downstream applications.
3 Discussion
Typical diffusion models[62] rely on a sequence of noise-reduction steps based purely on learned data distributions without explicit physics-based guidance, which limits their effectiveness in applications requiring high precision. SpecDiffusion stands out from conventional diffusion models by incorporating speckle patterns as conditioning inputs, enhancing its capability for detailed and accurate image reconstruction in biomedical imaging. By utilizing physics-informed guidance in its denoising process, SpecDiffusion not only outperforms traditional models like U-Net in performance but also meets the specific demands of fiber endomicroscopic imaging more effectively. Furthermore, we have validated the diffusion-driven fiber endomicroscope’s performance in cell segmentation, and this capability can be further extended to other digital pathology applications, such as tumor boundary detection, tissue classification, and morphological analysis, with minimum invasiveness.
In the realm of generative models, SpecDiffusion presents notable advantages over established techniques such as generative adversarial networks (GANs)[63] and vision transformers[64]. While GANs are adept at generating visually appealing images, they frequently encounter issues related to training stability and convergence, particularly in complex imaging tasks where the preservation of fine details is critical. Conversely, vision transformers, despite their efficacy in processing sequential data, demand substantial computational resources and often struggle to efficiently capture spatial hierarchies that are crucial for high-resolution imaging. SpecDiffusion employs an iterative noise reduction process that is both stable and robust, facilitating the high-resolution reconstruction of phases in a physics-informed manner. This approach not only enhances the model’s ability to handle detailed imaging tasks but also mitigates common problems associated with other generative models, such as training instability and excessive computational load. The integration of speckle patterns as a conditioning input further aids in guiding the denoising process, ensuring that SpecDiffusion maintains a high level of accuracy and detail essential for biomedical applications. Furthermore, SpecDiffusion offers a resolution comparable to the iterative phase retrieval method[5], while significantly reducing reconstruction time from 24 s per image to just 0.56 s. This substantial improvement in efficiency highlights SpecDiffusion’s potential for real-time clinical applications. Moreover, the lensless fiber endomicroscope offers substantial advantages for in vivo imaging, allowing for deep tissue visualization with minimal invasiveness. We have demonstrated that, combined with the high-resolution and accurate phase reconstructions provided by SpecDiffusion, this system could be further employed for real-time, label-free imaging of tissues. When integrated with advanced AI models like SAM, the system enables precise cell segmentation and tumor boundary detection, enhancing the accuracy of cancer diagnosis and tissue classification. The minimally invasive nature of the fiber endomicroscope, along with its potential for high-fidelity imaging, makes this approach particularly suited for real-time diagnostics, surgical guidance, and early detection of diseases in hard-to-reach areas, improving both patient outcomes and digital pathology workflows.
Despite these promising results, there are limitations to the current implementation of SpecDiffusion. While the model is trained on a large dataset of 100,000 images, its performance may still be limited when applied to tissue types or imaging conditions not represented in the dataset. Additionally, although the reconstruction speed is a significant improvement, further optimizations may be necessary for real-time clinical applications. Future work could address these limitations by expanding the dataset to include more diverse tissue types and imaging conditions, as well as optimizing the model for faster, real-time applications. Additionally, exploring the integration of SpecDiffusion with other imaging modalities could provide more comprehensive diagnostic capabilities. Overall, SpecDiffusion represents a significant step forward in quantitative phase reconstruction for lensless fiber endomicroscopy, offering a powerful tool for improving precision in medical diagnostics, particularly in cancer detection and digital pathology with minimal invasiveness.
Acknowledgments
Acknowledgment. This work was supported by the Innovative Key Project of Suzhou Institute of Biomedical Engineering and Technology, Chinese Academy of Sciences (No. CX202501003), the National Key R&D Program of China (No. 2024YFC2417400), the National Natural Science Foundation of China (No. 62375280), and the Deutsche Forschungsgemeinschaft (DFG) (No. CZ55/40). The authors would like to thank Dr. Nektarios Koukourakis for the valuable support.
Methods
Experimental Setup
We implement the optical setup illustrated in Fig. S1 in the Supplement 1 to acquire the training images for the diffusion-driven fiber endomicroscope. In this setup, phase images are projected onto an SLM and imaged by the fiber endomicroscope, with the speckle patterns captured by the detection system. A detailed description of the optical setup is provided in the Supplement 1. In our experiment, we employ a 40 cm FIGH-350S multi-core fiber (Fujikura) with the following specifications: 350 µm outer diameter, cores, 3 µm core diameter, and 32.5% filling factor. Since SpecDiffusion acquires general phase reconstruction capability through training on our ImageNet-derived dataset, there is no specific requirement for the applied optical fiber. Regarding the polarization, we illuminate the MCF with a linearly polarized light source and place a linear polarizer in front of the camera for filtration. During the experiment, the SLM and detection camera were triggered together to ensure the phase images and speckle patterns were captured at the same time, ensuring accurate alignment in the training dataset. The distance between the far-field image plane for capturing speckles and the MCF facet on the detection side was 0.5 mm.
Training Dataset
Utilizing the optical setup, we construct a tailored dataset for QPI in lensless fiber endomicroscopy. The natural complex images from the ImageNet dataset[58] are converted to grayscale images and are further transformed to phase images. These phase images are resized to a resolution of 980 pixel × 980 pixel and subsequently zero-padded to 1920 pixel × 1080 pixel to accommodate the display resolution of the SLM. Computer-generated holograms are then calculated and displayed on the SLM to achieve a precise holographic representation of the ground truth image. Compared to MNIST and Fashion-MNIST datasets in our previous study[42], images from ImageNet are more complex and detailed, thus providing a more robust benchmark for evaluating phase reconstruction capabilities in real-world scenarios. In the experiment, we captured 100,000 paired phase and speckle images, allocating 90,000 pairs for training, and the remaining 10,000 pairs for testing.
SpecDiffusion Model
Our proposed SpecDiffusion model employs a conditional diffusion pipeline, decomposing the phase reconstruction process into sequential image denoising steps. This method significantly mitigates the complexity of phase reconstruction for each individual step. As depicted in Fig. 2(a), the SpecDiffusion training process adheres to a Markov Chain framework. In image denoising task , it begins with diffusing target phase with Gaussian noise , generating a pseudo-input to be denoised. This diffusion process is controlled by the diffusion coefficient , as illustrated in Eq. (A1). Subsequently, this pseudo-input fuses with speckle to form a composite input, thereby incorporating speckle information to supervise the following image denoising process by SpecDiffusion. The goal of SpecDiffusion is to accurately estimate the added Gaussian noise , compare its estimate with the actual noise, and refine its parameters through backpropagation. Notably, as increases, the diffusion coefficient also increases, transforming the target phase toward approximate Gaussian noise. As the target phase information within the composite input is gradually erased in the diffusion process, SpecDiffusion must rely on the speckle pattern to predict the added Gaussian noise, and it also potentially alters its prediction goal from mere Gaussian noise estimation to a modification guiding the diffused input closer to the authentic phase image,
In the inference process, SpecDiffusion accomplishes phase reconstruction via sequential image denoising steps, methodically transforming Gaussian noise into the desired phase image. This process begins with a randomly generated initial phase , which is subsequently combined with the given speckle input to create a composite input for the SpecDiffusion model. Leveraging the speckle information within this composite input, SpecDiffusion estimates the noise and eliminates it from , yielding a denoised phase . However, due to SpecDiffusion’s inherent limitations, typically deviates from the exact target phase. To refine the predicted phase image, is renoised with Gaussian noise and then utilized as the initial phase for the subsequent image denoising step . The initial phase for any given denoising step is obtained as depicted in Eq. (A2), maintaining the same diffusion ratio as in the training process. Through denoising iterations, the phase image is progressively refined, leading to a final output accurately replicating the target phase,
Neural Network Training
SpecDiffusion is implemented using the PyTorch framework and Python 3.9.12, and trained on a platform with a GeForce RTX 3090 GPU and an Intel Xeon Platinum 8369B CPU. Before training, speckle patterns and images are cropped to a size of . During training, we set the batch size to 16 and the learning rate to . L1 distance between the predicted noise and actual loss is adopted as the loss function. The diffusion ratio of SpecDiffusion is set to 10−6 initially and linearly increases to 0.01 over 2,000 steps. Training proceeds for 400 epochs with the Adam optimizer[65] and lasts 5.5 days for ImageNet. During testing, SpecDiffusion is tested on a platform with an NVIDIA Tesla A100 GPU and an Intel(R) Xeon(R) Gold 6248R CPU. The diffusion ratio of SpecDiffusion is set to 10−4 initially and linearly increases to 0.09 over 1, 000 steps.