Self-supervised PSF-informed deep learning enables real-time deconvolution for optical coherence tomography

Weiyi Zhang; Haoran Zhang; Qi Lan; Chang Liu; Zheng Li; Chengfu Gu; Jianlong Yang

doi:10.3788/AI.2025.10026

1. Introduction

Among on-site and in vivo tomographic imaging modalities (computed tomography, ultrasound, etc.), optical coherence tomography (OCT) is advantageous in resolution, speed, and sensitivity^[1]. It has been successfully applied in various biomedical and industrial applications, such as diagnosing ocular and cardiovascular diseases^[1,2] and detecting internal defects in materials^[3]. However, OCT imaging is often plagued by speckle noise^[4], which diminishes contrast and obscures fine details. Additionally, the point spread function (PSF) effect can introduce blurring and distortion^[5,6] affecting the accuracy of OCT in applications requiring precise visualization.

Numerous efforts have been made to reduce the speckle noise in OCT. Before the era of deep learning, averaging is the most common technique used in OCT speckle noise reduction^[7,8], which has been employed in commercial systems to generate so-called high-definition scans. The use of deep learning algorithms significantly improves the efficiency of the speckle noise reduction^[9–13]. Nowadays, trained denoising deep neural networks (DNNs) have been deployed in commercial OCT systems for the so-called artificial intelligence (AI) enhancement.

On the other hand, deconvolution, as another useful computational technique for OCT imaging enhancement, which could reverse the blurring effects introduced by PSF^[14], has not yet been deployed in commercial OCT systems. In OCT, the application of deconvolution can date back to the works of Schmidt et al. in the 1990s^[15,16]. They employed the CLEAN algorithm that was originally developed for use in radio astronomy^[17] to restore OCT images of biological tissue. The results in their works and subsequent studies^[18–23] show that the resolution and contrast of OCT images can be improved by deconvolving the PSF of the OCT scanner. Recently, Wang et al. demonstrated that deconvolution could also be used to suppress the sidelobe artifacts in OCT^[6]. Ge et al. addressed the noise-contamination issue with the combination of Lucy–Richardson (LR) deconvolution with random phase masking^[24]. However, the aforementioned studies utilized or devised iterative deconvolution algorithms that are computationally intensive and, as a result, fail to satisfy the real-time inference demands of OCT devices.

DNNs offer end-to-end architectures and rapid inference, making them suitable for developing real-time deconvolution solutions. Supervised or semi-supervised deep learning methods, though, typically require paired pre- and post-convolution data for training, which is difficult to obtain due to the highly scattering and interferometric nature of OCT imaging^[25]. Iterative deconvolution algorithms, such as the LR deconvolution^[17], can be used to generate pseudo pre-convolution data, but they tend to introduce speckle-noise-induced artifacts^[24]. These artifacts and noise features can be incorporated into the training of supervised DNNs.

By developing a self-supervised PSF-informed learning framework, we propose a method that facilitates real-time OCT imaging without the requirement for either experimental ground truth (GT) or additional data collection efforts. By employing a combination of denoising pre-processing, blind PSF estimation, and sparse deconvolution^[26] techniques, our method effectively enhances the resolution and contrast of OCT images, using only the noisy B-scans as input. The resulting DNNs are not only capable of rapid inference but also adaptable to various imaging conditions, including different wavebands and imaging modalities, thus enabling the practical application of real-time deconvolution in OCT devices for improved diagnostic and inspective capabilities.

2. Methods

2.1. Theoretical background

In OCT, the image formation process can be characterized by the convolution of the original object $h (x, y)$ with the system’s PSF $f (x, y)$ , as described by the following equation: $g (x, y) = Φ_{conv} [h (x, y)] = h (x, y) * f (x, y) .$ (1)This convolution process $Φ_{conv}$ represents the ideal blurring of the image without considering noise.

However, in practical OCT imaging, speckle noise significantly degrades image quality by introducing random intensity variations. This noise, represented as $noise (x, y)$ , contaminates the observed image $g (x, y)$ , resulting in the noisy image $g_{noisy} (x, y)$ : $g_{noisy} (x, y) = g (x, y) \cdot noise (x, y) = [h (x, y) * f (x, y)] \cdot noise (x, y) .$ (2)When noise is minimal, the observed image $g (x, y)$ can be approximated as $g^{'} (x, y) = g_{noisy} (x, y)$ , allowing for deconvolution to reconstruct an image $h^{'} (x, y) = Φ_{deconv} [g^{'} (x, y)]$ that closely resembles the original object $h (x, y)$ . However, as noise increases, the deconvolved image $h^{'} (x, y)$ becomes increasingly dominated by noise-induced artifacts, making the true structures unrecognizable.

To mitigate this issue, our OCT deconvolution method includes a denoising pre-processing step before the deconvolution: $h^{'} (x, y) = Φ_{deconv} {Φ_{denoise} [g^{'} (x, y)]},$ (3)where $g^{'} (x, y)$ is the noisy image obtained by the OCT system, $h^{'} (x, y)$ is the reconstructed high-resolution image, and $Φ_{deconv}$ and $Φ_{denoise}$ are processes that enhance resolution and suppress noise, respectively. The denoising step specifically targets speckle noise, preparing the image for the deconvolution process to produce an artifact-free image with improved resolution and contrast.

2.2. Self-supervised learning framework

In this work, we present a self-supervised learning framework that enables real-time denoising and deconvolution of OCT images, as illustrated in Fig. 1. Rather than relying on paired datasets before and after deconvolution, which are challenging to obtain due to the highly scattering and interferometric nature of OCT, we utilize only noisy OCT B-scans for training.

Figure 1.Schematic of the real-time OCT deconvolution framework, encompassing both training and inference phases. The training phase involves collecting a diverse dataset and applying a subsampling strategy to generate denoised images, which are then utilized to train a denoising network as detailed in this paper. Subsequently, the sparse deconvolution is applied to the denoised images to create a set of enhanced images for supervision. The trained network is subsequently deployed in the inference phase for direct enhancement of OCT scans, facilitating real-time visualization.

Download full size

View all figures

For the denoising process, we assume that the signals across different pixels are correlated while the noise is uncorrelated^[27]. This allows us to employ various subsampling strategies to acquire B-scan pairs that share the same content but exhibit different noise distributions. These pairs are then used to train a neural network that focuses on preserving the signal component of the B-scans while effectively reducing noise.

For the deconvolution process, we capitalize on the known PSF characteristics of OCT imaging systems and integrate sparse deconvolution techniques^[26]. This combination allows us to generate paired data from the denoised OCT B-scans, where each pair consists of the original blurred image and a version where the blur has been artificially reduced. By training a neural network on these paired data, we emphasize the recovery of the significant features, leading to a cleaner and more accurate restoration of the original object.

2.2.1. Training denoising network

In the training of the denoising network, we adopt the checkerboard subsampling strategy^[28], which enables us to create a discrete four-image training set from a single noisy image. This strategy involves dividing the image into two sets of pixels (“even” and “odd”) by sampling pixels in a checkerboard pattern. Specifically, we define two subsampled images $x_{even}$ and $x_{odd}$ as follows: $x_{even} (i, j) = x [i,2 j + (i \mod 2)],$ (4) $x_{odd} (i, j) = x [i,2 j + (i \mod 2) + 1],$ (5)where $x$ is the original noisy image, and $i$ and $j$ are the row and column indices, respectively. Similarly, we can define two other subsampled images, but subsampled from the other direction: $x_{even}^{'} (i, j) = x [2 i + (j \mod 2), j],$ (6) $x_{odd}^{'} (i, j) = x [2 i + (j \mod 2) + 1, j] .$ (7)

The neural network is then trained to learn the mapping from the even subsampled image to the odd subsampled image: $f_{θ} (x_{even}) \to x_{odd} or f_{θ} (x_{even}^{'}) \to x_{odd}^{'} .$ (8)

This can be viewed as a form of self-supervised learning, where the network is trained to predict the missing pixels in the odd subsampled image based on the information from the even subsampled image.

During training, the network’s parameters $θ$ are optimized to minimize the loss between the predicted odd subsampled image and the actual odd subsampled image. A common choice for the loss function in image-denoising tasks is the L2 loss, which can be expressed as $L (θ) = \frac{1}{N} \sum_{i, j} {[y_{odd} (i, j) - f_{θ} (x_{even}) (i, j)]}^{2},$ (9)where $N$ is the total number of pixels in the image, $y_{odd}$ represents the actual odd subsampled pixel values, and $f_{θ} (x_{even})$ represents the predicted values generated by the network from the even subsampled image.

The final denoised image $\hat{x}$ is obtained by applying the trained network $f_{θ}$ to the original noisy image $x$ : $\hat{x} = f_{θ} (x) .$ (10)

This process results in a denoised image that has effectively learned to reconstruct the missing information in the original noisy image through self-supervision, without the need for any external clean reference images.

2.2.2. Blind PSF estimation

In the training of the deconvolution network, the PSFs of OCT imaging systems are required for the sparse deconvolution^[26], which is then used to generate paired samples. We extended the blind estimation method proposed in Ref. [6] to enable the direct extraction of 2D PSF information from OCT B-scans. It involves minimizing a cost function that balances fidelity to the measured data and the sparsity of the representation. The cost function can be expressed as $\min_{d, x_{k}} {\frac{1}{2} \sum_{k = 1}^{K} {‖ s_{k} (z) - d (z) * x_{k} (z) ‖}_{2}^{2} + λ \sum_{k = 1}^{K} {‖ x_{k} (z) ‖}_{1}},$ (11)subject to the constraint that the norm of the PSF is unity: ${‖ d (z) ‖}_{2} = 1 .$ (12)Here, $s_{k} (z)$ represents a set of $K$ -measured OCT A-line data, $d (z)$ is the axial PSF, $x_{k} (z)$ is the sparse representation of the tissue structure, and $λ$ is the regularization parameter that controls the trade-off between fidelity and sparsity.

This problem can be efficiently solved using the alternating direction method of multipliers (ADMM) algorithm, which iteratively updates the estimates of the PSF and the sparse vectors. The ADMM algorithm involves the following steps:

Update the sparse vectors $X$ while keeping the PSF $D$ fixed: $X^{(t + 1)} = \arg \min_{X} {\frac{1}{2} {‖ D X - S ‖}_{2}^{2} + ρ {‖ X - Y^{(t)} + U^{(t)} ‖}_{2}^{2}} .$ (13)

Update the PSF $D$ while keeping the sparse vectors $X$ fixed: $D^{(t + 1)} = \arg \min_{D} {\frac{1}{2} {‖ D X - S ‖}_{2}^{2} + ρ {‖ X - Y^{(t)} + U^{(t)} ‖}_{2}^{2}} .$ (14)

Update the dual variable $U$ : $U^{(t + 1)} = U^{(t)} + X^{(t + 1)} - Y^{(t + 1)} .$ (15)

We then apply the aforementioned procedures to the line data along the transverse direction of OCT B-scans to acquire the transverse PSF. Assuming that the 2D PSF can be modeled using a 2D Gaussian function, the full width at half-maximum (FWHM) values of both the estimated axial and transverse PSF are calculated and used to generate the PSF kernel with the same FWHM in both directions: $f (x, y) = A \cdot \exp {- [\frac{{(x - x_{0})}^{2}}{2 σ_{x}^{2}} + \frac{{(y - y_{0})}^{2}}{2 σ_{y}^{2}}]} .$ (16)

2.2.3. Training deconvolution network

The next step is the sparse deconvolution of the denoised B-scans, which utilizes the estimated PSF as prior information. We proceed by solving an iterative deconvolution process to enhance the spatial resolution of the reconstructed image. This procedure can be divided into two primary phases: sparse representation and iterative reconstruction.

In the first phase, we aim to find a sparse representation $g$ by minimizing a regularized cost function. This approach penalizes non-essential features while preserving key structural information in $g$ . The optimization problem is formulated as $\arg \min_{g} {\frac{λ}{2} {‖ f - g ‖}_{2}^{2} + R_{Hessian} (g) + λ_{L 1} {‖ g ‖}_{1}},$ (17)where $f$ represents the observed time-lapse or volumetric image stack, $λ$ is a regularization parameter that controls the trade-off between data fidelity and sparsity, and $λ_{L 1}$ is another regularization factor applied to the $L 1$ -norm of $g$ . The term $R_{Hessian} (g)$ enforces smoothness and sparsity in $g$ by penalizing high-frequency components, thus reducing noise and preserving significant image features.

The Hessian-based regularization term $R_{Hessian} (g)$ is defined as $R_{Hessian} (g) = {‖ g_{x x} ‖}_{1} + {‖ g_{y y} ‖}_{1} + 2 {‖ g_{x y} ‖}_{1},$ (18)where $g_{x x}$ , $g_{y y}$ , and $g_{x y}$ denote the second-order partial derivatives of $g$ in the $x$ -direction, $y$ -direction, and the mixed $x y$ -direction, respectively. This regularization encourages sparsity in the gradient domain, suppressing noise while retaining important structural features in the image.

In the second phase, we perform iterative deconvolution to obtain the final solution $x$ . A Bayesian-based LR deconvolution algorithm is employed to process these images. The optimization problem for deconvolution is formulated as $\arg \min_{x} {{‖ g - A x ‖}_{2}^{2}},$ (19)where $g$ is the image after sparsity reconstruction and $A$ is the previously estimated PSF. The objective is to minimize the difference between the observed image $g$ and the reconstructed image $x$ through iterative updates. The updated rules are as follows: $y_{j + 1} = x_{j} (\frac{h^{T} \cdot g}{h \cdot x_{j}}), v_{j} = x_{j + 1} - y_{j}, α_{j + 1} = \frac{\sum_{j} v_{j} \cdot v_{j - 1}}{\sum_{j - 1} v_{j - 1} \cdot v_{j - 1}},$ (20)and the next iteration is updated as $x_{j + 1} = y_{j + 1} + α_{j + 1} (y_{j + 1} - y_{j}) .$ (21)

The algorithm iterates until convergence, progressively refining the reconstructed image. This process is performed on various images, forming a dataset of image pairs before and after the deconvolution. A network with the same structure is then trained using these image pairs, with the following loss function: $L (θ) = α_{1} \cdot \frac{1}{N} \sum_{i = 1}^{N} {[f_{ϕ} (x_{i}) - y_{i}]}^{2} + α_{2} \cdot {1 - SSIM [f_{ϕ} (x), y]} .$ (22)

Specifically, the loss is calculated as a weighted sum of L2 loss, represented by the first part of Eq. (22), between the network output after denoising and network deconvolution, $f_{ϕ} (x)$ , and the reference image obtained after denoising and sparse deconvolution, $y$ . This is combined with the structural similarity index measure (SSIM) loss, which is the second part of the equation, and measures the perceptual similarity between these images. The parameters $α_{1}$ and $α_{2}$ are weighting factors that balance the contributions of the L2 and SSIM losses, respectively.

2.3. Lightweight DNN architecture

We propose a lightweight DNN architecture to accelerate the inference process for both denoising and deconvolution tasks, as illustrated in Fig. 2. This design is optimized for efficiency, combining compactness with a streamlined configuration to minimize computational resource requirements while maintaining high performance.

Figure 2.Architecture of the proposed lightweight DNN. (a) Overview of the DNN structure, highlighting four convolutional blocks followed by concatenations and an activation layer. The numbers indicate the corresponding number of channels. (b) Detailed composition of a convolutional block, consisting of two $3 \times 3$ convolutional layers with batch normalization and ReLU activation, and a $1 \times 1$ convolutional layer for residual connections. (c) Configuration of the activation layer, incorporating a $1 \times 1$ convolutional layer and a Sigmoid function to produce the final output.

Download full size

View all figures

The architecture begins with four convolutional blocks, followed by a concatenation operation that serves as the primary feature extraction mechanism. Each convolutional block comprises two $3 \times 3$ convolutional layers, each followed by batch normalization (BN) and a ReLU activation function. To enhance learning efficiency, a $1 \times 1$ convolutional layer is incorporated as a shortcut connection, aligning the dimensions of input and output feature maps to facilitate residual learning. Each block outputs feature a map with a channel depth of 32. Residual connections within each block facilitate efficient feature learning and gradient flow. This configuration can be mathematically expressed as $Y = ReLU {BN {{Conv}_{2} {ReLU {BN [{Conv}_{1} (X)]}}} + {Conv}_{1 \times 1} (X)},$ (23)where $X$ represents the input feature maps, ${Conv}_{1}$ and ${Conv}_{2}$ are the sequential $3 \times 3$ convolutions, and $Y$ represents the output feature maps. The residual connections improve gradient flow and stability during training, leading to more efficient convergence.

After each convolutional block, the output feature maps are concatenated along the channel dimension using a concatenation module. This operation increases the channel depth and integrates multi-scale features, enriching the representation for subsequent processing.

The outputs of the four convolutional blocks are passed through an activation layer, which includes a $1 \times 1$ convolutional layer to reduce the channel dimensions to the desired output size, followed by a sigmoid activation function to generate the final output.

Our proposed architecture achieves a significant reduction in network parameters, requiring only 69,000 parameters in total. The use of lightweight convolutional layers and BN ensures computational efficiency without compromising output quality. This makes the design highly suitable for real-time inference tasks in OCT imaging.

2.4. Data collection

To assess the efficacy and generalization capability of our proposed method, we conducted both qualitative and quantitative analyses using a variety of private and public datasets, as detailed in Table 1. These datasets encompass diverse OCT imaging scenarios, including free-space and endoscopic imaging, and they span across three primary operational wavelengths (840, 1060, and 1310 nm), with data collected both ex vivo and in vivo. The B-scans from gold nanoparticles, orange slices, organoids, and rabbit retinas were obtained using our custom-built 840 nm central wavelength spectral-domain OCT systems [as illustrated in Fig. 3(a)]. For further specifics, refer to our previous works^[29–31]. The B-scans of swine arteries were derived from the intravascular OCT study detailed in Ref. [32], which comprises data from both a research prototype and a commercial system, both utilizing 1310 nm swept-source OCT technology. The human retina B-scans were sourced from the publicly accessible OCTA-500 dataset, which involves images captured by an 840 nm spectral-domain OCT system, as reported in Ref. [33].

Table 1. Summary of Datasets Used in Our Experiments and Their Conditions

View table

View all Tables

Table 1. Summary of Datasets Used in Our Experiments and Their Conditions


Object	Waveband (nm)	Imaging optics	Acquisition condition	B-scan number	Data source
Gold nanoparticle	840	Free-space	Ex vivo	300	Our home-built systems
Laser viewing card	840	Free-space	Ex vivo	512	Our home-built systems
Human eye	1060	Free-space	In vivo	135	Commercial prototypes
Orange slice	840	Free-space	Ex vivo	300	Our home-built systems
Organoid	840	Free-space	Ex vivo	600	Our home-built systems
Swine artery	1310	Endoscopic	In vivo	1000	Ref. [32]
Swine artery (commercial system)	1310	Endoscopic	In vivo	830	Ref. [32]
Human retina	840	Free-space	In vivo	506	Ref. [33]
Rabbit retina	840	Free-space	In vivo	291	Our home-built systems

Figure 3.Methodology and results of PSF estimation. (a) Schematic of our home-built OCT system. (b) Workflow for determining the PSF of our custom OCT systems using gold nanoparticles, including imaging, averaging, selection, fitting, and modeling steps. (c) Comparative analysis of OCT images from orange and organoid samples before and after denoising with the proposed method. (d) Comparative assessment of axial and transverse PSFs derived from orange and organoid images, contrasting estimates from original and denoised samples with GT measurements.

Download full size

View all figures

2.5. Implementation

The collected data were initially processed to suppress noise prior to the deconvolution step. Each category was processed separately, as noise distributions and PSF may vary across datasets. The previously mentioned downsampling was performed to generate two sets of images with nearly identical structures but differing noise information, serving as the input and reference, respectively.

The denoising network was trained on a computer workstation equipped with an NVIDIA RTX 4090 GPU with 24 GB of memory, using the PyTorch framework. The dataset sizes ranged from 135 to 1000 images, and the image dimensions within each category were consistent, ranging from $250 \times 250$ to $1600 \times 1024$ . For larger datasets, we allocated 10%–20% of the data as a test set to ensure robust generalization evaluation while mitigating the risk of overfitting. For smaller datasets, we opted not to partition the data into a test set in this stage, prioritizing full utilization of the available data for self-supervised training. Each network was trained for 6 epochs with a batch size of 1 and a learning rate of 10^-4. The loss function employed in training was L2 loss. Upon completing the training, all original images were processed using the trained network to generate denoised outputs.

The blind PSF estimation method was employed to obtain an initial guess of the 2D PSF for each category. One pre-denoising image from each category was selected for estimation, based on the assumption that the PSF is consistent across images within the same category. The sparse deconvolution was then applied to the denoised image using a set of PSFs, slightly larger or smaller in both directions around the initial guess. Adjustments to the PSFs were made in both directions based on these deconvolution results. The underlying assumption is that a PSF that is too small will produce an image with no enhancement, whereas a PSF that is too large will introduce artifacts.

The adjusted PSF was then used to perform sparse deconvolution on all denoised images in the category, generating a second dataset. The PSF estimation and sparse deconvolution for the endoscopic data were performed in the Cartesian coordinate system. Because some datasets are relatively small, with the exception of a 10% randomly selected test set, most denoised-deconvolved data pairs were used for training, utilizing the same device and network structure. The training was conducted with a batch size of 1, a learning rate of 10^-4, and a mixed loss function as described in the deconvolution network details. The network was trained for 15 epochs, and the resulting model was applied to the test sets for evaluation.

3. Results

3.1. Blind PSF estimation directly from OCT B-scans

We begin by evaluating the accuracy of our blind PSF estimation method, using the known PSF of OCT systems as GT for validation. For this purpose, we employ a custom-built OCT system^[30] and experimentally measure the PSF using gold nanoparticles with dimensions ranging from 50 to 70 nm. As illustrated in Fig. 3(a), the OCT system consists of a spectral-domain interferometer operating at a central wavelength of 840 nm with an A-line rate of 80 kHz. A 2D galvo scanner is used for acquiring B-scans and volumes. It has an imaging range of 6 mm, a working distance of 30 mm, an axial resolution of 5 µm, and a transverse resolution of 20 µm.

Figure 3(b) illustrates the acquisition processing of the GT PSF. We first imaged dispersed nanoparticle gold samples and acquired 500 consecutive scan images of the same cross-section. Frame averaging was applied to these images to reduce speckle noise, yielding a denoised B-scan representation of the gold nanoparticles. Next, the 2D Gaussian function defined in Eq. (16) was used to fit the nanoparticle imaging results. For B-scan images, the accuracy of the derived PSF can be impacted by particle adhesion or deviations from the scanning plane, which introduce artifacts that do not accurately represent the system’s PSF. To address this, we manually selected particles exhibiting regular shapes that align with theoretical expectations, while excluding those distorted by adhesion or overly small and dim particles caused by shifts. The selected particles were then fitted with the Gaussian function using the Python Astropy library. The fitted parameters were averaged to produce a PSF model that reliably characterizes the imaging system.

To assess the efficacy of our proposed PSF estimation technique, we tested it on samples exhibiting diverse imaging characteristics, as depicted in Fig. 3(c). We utilized both original and denoised samples to explore the impact of speckle noise on PSF estimation. The images, from left to right, represent original and denoised B-scans from orange and organoid samples. Notably, the orange samples display numerous grid-like patterns, which may aid in estimating the PSF. In contrast, organoid samples lack distinct features. Accurate PSF estimation in the absence of distinct features would underscore the method’s robust generalization capabilities.

Figure 3(d) presents the PSF estimates derived from these B-scans, alongside the GT measurements obtained using the method outlined in Fig. 3(b). For the orange B-scans, the estimated PSFs in both axial and transverse directions closely match the GT, irrespective of whether the samples were original or denoised. However, for the organoid B-scans, the original samples provided accurate PSF estimates, while the denoised ones showed discrepancies, particularly in the transverse direction. This suggests that in scenarios where imaged objects lack structural features, speckle noise might offer additional cues for PSF estimation. Consequently, within our methodological framework illustrated in Fig. 1, we opt to use the original B-scans for the blind PSF estimation.

3.2. OCT deconvolution in various imaging conditions

As early astronomical studies^[34] have pointed out, deconvolution can enhance the resolution of discrete imaging objects and improve the quality (sharpness, contrast, etc.) of continuous imaging objects. To validate the application effectiveness of the proposed OCT deconvolution method in this paper, we conducted tests on two types of OCT images with discrete and continuous features, respectively. The baseline methods used for comparison include the well-known LR deconvolution algorithm^[20], a recent breakthrough in the field of biophotonics, namely the sparse deconvolution algorithm^[26], and a method that also employs our proposed self-supervised approach but adopts the classic U-Net architecture^[35] as its network configuration.

3.2.1. Resolution improvement for discrete objects

We first used gold nanoparticles, which are much smaller than the diffraction limit of the OCT system, as imaging objects to quantitatively evaluate the performance of deconvolution algorithms in terms of resolution improvement. The experimental results are shown in Fig. 4(a), which demonstrates the imaging characteristics of individual gold nanoparticles in OCT B-scans. It can be observed that the original gold nanoparticle images are simultaneously affected by the limitations of the system’s optical design and speckle noise, resulting in a blurred appearance. After applying the LR deconvolution, the resolution of the gold nanoparticles can be effectively enhanced, but there is a significant presence of noise sidelobes and artifacts. The sparse deconvolution, which incorporates the continuity of the Hessian matrix as a prior constraint, can effectively reduce artifacts. Due to the incorporation of an additional self-supervised learning denoising module, our method is even more effective in suppressing sidelobes and artifacts. Compared to the U-Net network architecture, the network architecture proposed in this paper, despite a significant reduction in the number of parameters, still achieves a comparable deconvolution performance. Figure 4(b) further provides a three-dimensional characterization of the gold nanoparticle images before and after the deconvolution using our method. It can be seen that, compared to the multi-peak structure before the deconvolution, the image after the deconvolution is a single-peak distribution, closely resembling the theoretical imaging effect of objects smaller than the diffraction limit.

Figure 4.Comparative analysis of resolution enhancement by different deconvolution algorithms on gold nanoparticles. (a) Imaging characteristics of individual gold nanoparticles in OCT B-scans, demonstrating the effects of various deconvolution techniques. (b) Three-dimensional intensity distributions of a nanoparticle before and after processing with the proposed method. (c) Axial and transverse intensity profiles at the maximum intensity point of a nanoparticle, comparing original and deconvolved images. (d) Quantitative resolution comparison, measured as FWHM, for 30 gold nanoparticles processed with different methods, with the statistical representation of the interquartile range and outliers. LR, Lucy–Richardson deconvolution; SD, sparse deconvolution.

Download full size

View all figures

Figure 4(c) presents the axial and transverse intensity profiles of a nanoparticle as captured in OCT B-scans, both before and after deconvolution. The original PSF shows broader widths and is affected by speckle noise. While the LR and sparse deconvolution methods successfully enhance resolution, they introduce sidelobe artifacts to varying extents. In contrast, our approach not only achieves superior resolution enhancement but also effectively avoids the introduction of artifacts. We further calculate the FWHM of these profiles to quantitatively characterize the improvement in resolution. The results are given in Fig. 4(d). We randomly selected 30 OCT images of gold nanoparticles for statistical analysis. The box in the figure represents the interquartile range from the 25th to the 75th percentile, and the whiskers denote the boundaries for outliers. It can be observed that after the deconvolution, there is an approximate 40% improvement in resolution. Moreover, our method effectively removes artifacts, resulting in a lower statistical variance.

We further evaluate the resolution enhancement capabilities of the deconvolution algorithms on the OCT imaging of orange samples, as the cell walls of these samples are depicted as discrete linear structures. The results are shown in Fig. 5. Figure 5(a) shows the comparison between the original B-scan and its deconvolved counterpart using our proposed method. Figure 5(b) presents the intensity profiles of the images processed with various deconvolution algorithms along the green lines indicated in Fig. 5(a). Figure 5(c) compares the original images and the results after processing with different deconvolution algorithms for the regions of interest outlined by dashed boxes in Fig. 5(a). It can be observed that, compared to the original image, various deconvolution algorithms can achieve approximately 40%–50% enhancements in axial and transverse resolutions, respectively. However, the LR deconvolution introduces image artifacts resembling salt-and-pepper noise, while the sparse deconvolution, though not generating artifacts, does not effectively address the speckle noise within the tissue. In comparison, our method, regardless of whether it employs the U-Net architecture or the architecture proposed in this paper, achieves superior resolution enhancement and more effective suppression of noise and artifacts.

Figure 5.Comparison of different algorithms for OCT images of orange samples. (a) Comparison between the original B-scan and its deconvolved counterpart using our proposed method. (b) Intensity profiles of the images processed with various deconvolution algorithms along the green lines indicated in (a). (c) Comparison of the original images and the results after processing with different deconvolution algorithms for the regions of interest outlined by dashed boxes in (a). LR, Lucy–Richardson deconvolution; SD, sparse deconvolution.

Download full size

View all figures

3.2.2. Imaging enhancement for continuous objects

For the validation of deconvolution methods on continuous imaging objects, we first employ various ocular OCT imaging datasets, given that ophthalmology is one of the most significant application areas for OCT. The results are depicted in Fig. 6. Figure 6(a) shows the results from a commercial prototype designed for ocular biometry, employing a 1060 nm SS OCT system with an enhanced depth of focus, enabling simultaneous imaging of both the anterior and posterior segments of the eye. It can be observed that our deconvolution method significantly enhances the image sharpness and contrast-to-noise ratio (CNR), and it effectively improves the resolution for discrete point-like objects within the image, such as eyelashes. Particularly, the enhancement of signals from the posterior segment of the eye will contribute to the increased measurement accuracy in ocular biometry. In contrast, while the LR deconvolution does enhance the image clarity to some extent, it introduces severe noise artifacts. The sparse deconvolution mitigates the artifact issue to a certain degree, but its performance in image enhancement is comparatively limited.

Figure 6.Enhanced OCT images of ocular samples using different deconvolution methods. (a) Comparison of human eye images, with insets showing detailed regions. (b) and (c) Posterior human eye^[33] and rabbit retina images before and after enhancement with our proposed method, highlighting improved clarity and resolution. LR, Lucy–Richardson deconvolution; SD, sparse deconvolution.

Download full size

View all figures

Figures 6(b) and 6(c) demonstrate the results of a publicly available retinal OCT dataset and our own in vivo rabbit imaging used for retinal surgery research, respectively. Figure 6(b) highlights pathological features indicative of retinal detachment, while Fig. 6(c) shows low signal-to-noise ratios (SNRs) in retinal layering due to cataract induced by surgical invasion. It can be seen that our deconvolution methods significantly enhance the CNR of various retinal layers. In comparison, the results from the LR and sparse deconvolution methods, similar to those in Fig. 6(a), lead to artifacts and only marginal improvements in CNR.

Fiber-form endoscopic imaging is another important application scenario for OCT. Here we employ the swine artery datasets acquired from a research prototype using a MHz SS laser [the results are shown in Figs. 7(a) and 7(b)] and a commercial intravascular OCT system [the results are shown in Figs. 7(c) and 7(d)]. The details of these data are described in Refs. [32,36]. It can be seen that all deconvolution algorithms could enhance the imaging and improve the resolution of endoscopic OCT. However, the LR and sparse deconvolution also bring about smear-like artifacts on the images. Such artifacts do not occur with our method, whether using the U-Net or the lightweight network architecture proposed in this paper. At the same time, our method has superior enhancement in terms of SNR and sharpness.

Figure 7.Comparison of different algorithms for swine artery endoscopic scans. (a) and (c) Endoscopic scans of the swine artery on two systems^[32] before and after enhancement using our method. (b) and (d) Comparisons of different algorithms corresponding to the respective regions marked on the left.

Download full size

View all figures

To quantitatively assess the impact of deconvolution algorithms on the enhancement of continuous imaging objects, we employed the CNR as our metric, in the absence of GT data, which is defined as $CNR = \frac{| μ_{s} - μ_{b} |}{σ_{b}},$ (24)where $μ_{s}$ and $μ_{b}$ are the mean of the signal and background regions, and $σ_{b}$ is the standard deviation of the noise.

Table 2 presents a quantitative assessment of the enhancement in CNR for OCT images of swine artery, human retina, and human eye samples following deconvolution with various methods. The table provides mean CNR values along with their standard deviations for each sample type before and after deconvolution. The original images, which have not undergone any deconvolution process, serve as a baseline with CNR values of 5.64, 6.00, 2.53, and 12.03 for the respective sample types. The LR deconvolution method shows a decrease in CNR for most samples, with values of 3.23, 2.82, 1.31, and 6.04. The sparse deconvolution method also results in lower CNR values, with 3.68, 3.10, 2.21, and 15.28, respectively. In contrast, the proposed method, both with the U-Net architecture and the lightweight network, significantly enhances the CNR, with the U-Net variant yielding values of 6.06, 28.42, 13.35, and 58.50, and the lightweight network achieving 6.19, 23.24, 17.71, and 72.45. These results indicate that the proposed deconvolution methods effectively improve the CNR of OCT images across different sample types, suggesting better image quality and clearer visualization of details after deconvolution, particularly with the lightweight network architecture, which shows slightly higher CNR enhancement compared to the U-Net variant.

Table 2. Comparison of CNR Enhancement in OCT Images: Mean and Standard Deviation Values for Swine Artery, Human Retina, and Human Eye Samples Before and After the Deconvolution with LR Deconvolution, Sparse Deconvolution, and Our Proposed Methods

View table

View all Tables

Table 2. Comparison of CNR Enhancement in OCT Images: Mean and Standard Deviation Values for Swine Artery, Human Retina, and Human Eye Samples Before and After the Deconvolution with LR Deconvolution, Sparse Deconvolution, and Our Proposed Methods


Method	Swine artery	Human retina	Human eye	Rabbit retina
Original	5.64 (1.87)	6.00 (1.53)	2.53 (0.89)	12.03 (0.48)
LR	3.23 (0.99)	2.82 (0.74)	1.31 (0.46)	6.04 (0.24)
SD	3.68 (1.48)	3.10 (0.69)	2.21 (0.80)	15.28 (0.65)
Ours (U-Net)	6.06 (2.05)	28.42 (5.38)	13.35 (5.86)	58.50 (4.78)
Ours	6.19 (2.19)	23.24 (10.29)	17.71 (8.05)	72.45 (7.22)

3.2.3. Generalization to unseen objects

To evaluate the generalization capability of our proposed method, we deployed the trained deconvolution model on OCT images of previously unseen objects. Since such data is not available in public datasets, we acquired it using our home-built OCT imaging system. Here we employed a deconvolution model that was trained using all the data of a gold nanoparticle, laser viewing card, orange slice, organoid, and rabbit retina as listed in Table 1.

We used the data of optical tape [Figs. 8(a)–8(c)] and human finger [Figs. 8(d)–8(f)] as examples. Figures 8(a) and 8(d) present the original and enhanced images, with regions marked by white dashed boxes enlarged and shown in Figs. 8(b) and 8(e), respectively. Figures 8(c) and 8(f) depict the intensity profiles along the green lines indicated in Figs. 8(a) and 8(d). As shown in the figure, our method enhances resolution and structural clarity effectively in both ex vivo and in vivo scenarios. The boundaries of each layer and the particles within the optical tape become sharper and thinner after enhancement. For the human finger scans, the contrast and definition of the skin layer boundaries and sweat ducts improved, even though no similar samples were included in the training set. The intensity profile of the optical tape accurately identifies the positions of four layers and an interstitial particle, even under high noise conditions. Similarly, the enhanced intensity profile of the finger scan reveals the stratum corneum, epidermis, and dermis with higher contrast and sharper boundaries.

Figure 8.Generalization evaluation of the proposed deconvolution method using previously unseen OCT images of optical tape and a human finger. (a), (d) Original and enhanced OCT images, with white dashed boxes highlighting regions of interest that are enlarged in (b), (e). (c), (f) Intensity profiles along the green lines in (a) and (d), demonstrating improved resolution and structural clarity.

Download full size

View all figures

3.3. Real-time inference capabilities

We further compare the capabilities of these deconvolution methods in terms of inference speed. Figure 9(a) presents a comparative analysis of inference times for images of diverse sizes, each category comprising 100 samples. The inference time is plotted in a log scale.

Figure 9.Inference time and model parameter sizes for different algorithms. (a) Comparison of inference time for different algorithms on images of varying sizes. (b) Model parameter sizes for the U-Net and the proposed lightweight network.

Download full size

View all figures

The sparse deconvolution method, due to its iterative nature, is consistently the most time-consuming across all image sizes. The LR deconvolution, leveraging MATLAB’s optimized built-in function, offers relatively faster processing by skipping steps such as sparse reconstruction. However, their processing speed is still unable to meet the requirement of real time. In contrast, our deep-learning-based methods, particularly the proposed lightweight network, exhibit substantially faster inference speeds. This is remarkable considering that they require two sequential network passes for processing. The efficiency of our design allows for millisecond-level inference even with larger image sizes. Additionally, our proposed network architecture outperforms the U-Net by nearly 2 times in speed, which is a result of its compact design that is 90% smaller than the U-Net, as shown in Fig. 9(b). These advantages of our methods, regardless of the U-Net or the proposed architectures, benefit from the end-to-end inference and iteration-free capabilities of deep learning.

3.4. Performance deconstruction

We assessed the impact of the parameters and methods used in our framework on the enhancement quality of our final results. The comparisons are illustrated in Fig. 10. In Fig. 10(a), we present the outcomes of our processing framework when our denoising method is replaced. Compared to the original image, the result without the denoising step is noisy, and the structural details are compromised. Any denoising method contributes to quality improvement: The Noise2Noise^[37] method, implemented with our proposed lightweight network, demonstrates strong denoising capabilities but requires paired images for training; the BM3D algorithm smoothens structures and suppresses noise, though its long inference time makes it unsuitable for real-time scenarios; the K-SVD method introduces artifacts to the structures. The method we employed offers a balanced performance in terms of inference time, denoising effectiveness, and dataset availability. It is noteworthy that the denoising method we used is more balanced and versatile: our comparison indicates that under different conditions, such as paired datasets or no inference time constraints, the modules of our framework can be replaced to achieve even better results.

Figure 10.The impact of parameters on the results at various stages of our workflow. (a) Original and enhanced images of orange slices using different denoising techniques. (b) Enhanced images with varying PSF parameters applied in the sparse deconvolution step. Each row or column shares the same standard deviation along the $y$ or $x$ axis, respectively. (c) Enhanced images with different sparsity (SP) or fidelity (FI) parameters, while the other parameter is optimized in the sparse deconvolution step.

Download full size

View all figures

We then illustrate the impact of the PSF employed in deconvolution by comparing it in Fig. 10(b). The images within the blue region exhibit relatively optimized outcomes in each direction, characterized by improved resolution and minimal artifacts compared to those outside this area. It is evident that the standard deviation along any axis governs the resolution enhancement in that specific direction. From this, we infer that a PSF that is too small in a given direction results in negligible enhancement, while one that is too large causes structural distortion. The PSF at the center, with parameters optimized in both directions, demonstrates the most effective enhancement among the examples shown. Additionally, it is noteworthy that images processed with a PSF close to the optimized value can sometimes yield similar results. This observation underscores the robustness of our algorithm, indicating that as long as the PSF estimation lies within a certain range, the outcomes remain satisfactory.

We analyze the roles of sparsity and fidelity in shaping fine details and controlling structural deformation. These two parameters, integral to the process of sparse reconstruction as described in Eq. (17), are illustrated in Fig. 10(c). As shown in the left column, excessive sparsity results in an overall reduction in pixel intensity, causing the true structures to disappear, whereas insufficient sparsity retains all noise, compromising the clarity of the reconstruction. In contrast, the right column demonstrates that overly high fidelity preserves noise similar to low sparsity, while overly low fidelity leads to the structural deformation and uneven brightness distribution within the structures. Our optimization process targets a balance between these two parameters to achieve high-quality reconstructions. By tuning sparsity and fidelity, we aim to preserve essential details while minimizing noise and avoiding structural distortions, ensuring that the reconstructed results maintain both clarity and structural fidelity.

4. Discussion

In this study, we present a self-supervised deep learning framework designed for real-time deconvolution in OCT imaging. By addressing the dual challenges of speckle noise and PSF-induced blurring, we have significantly enhanced the resolution and contrast of OCT images, using only noisy B-scans as input. This approach leverages a combination of denoising pre-processing, blind PSF estimation, and sparse deconvolution, which together improve the quality of OCT images and mitigate the impact of imaging artifacts.

The denoising pre-processing stage is crucial in reducing speckle noise, which is a well-known source of degradation in OCT images. By utilizing a self-supervised learning framework, we effectively trained the denoising network without the need for paired clean-noisy image datasets, which are often difficult to obtain. This method shows considerable promise, particularly in settings where paired data is unavailable or impractical. However, this approach may have limitations when noise characteristics are highly variable or not sufficiently represented in the training data. In such cases, the network’s performance could degrade, underscoring the need for more robust noise modeling and data augmentation techniques.

The blind PSF estimation component of our framework demonstrated commendable robustness across a variety of imaging conditions. However, challenges may arise when imaging systems are subject to non-ideal conditions, such as equipment malfunctions or calibration errors, which could lead to deviations from the idealized PSF model. In our current implementation, we assume a consistent PSF within a given image category, modeled as a 2D Gaussian. While this assumption holds in many cases, it may not generalize to situations where the PSF is complex or dynamically changing. Future work could explore more sophisticated PSF models, including those capable of capturing spatially or temporally varying PSFs, to enhance the generalizability of the framework.

The DNN architecture employed for the deconvolution process offers a lightweight solution that balances performance with computational efficiency. Our results demonstrate a significant improvement in image quality, with minimal computational overhead, enabling real-time image enhancement. However, the network’s performance may be influenced by the diversity of the images encountered. OCT images from novel devices, varying tissue types, or different acquisition protocols may not fully match the characteristics of the training data, potentially limiting the generalizability of our framework. To address this, expanding the diversity of training datasets and incorporating domain adaptation techniques could enhance the model’s robustness across a broader range of clinical and industrial settings.

While our framework excels at mitigating speckle noise and PSF-induced blurring, it does not address other potential sources of image degradation, such as motion artifacts, shadowing effects, or other system-induced distortions. These factors may still impact the diagnostic quality of OCT images, particularly in dynamic or clinical settings. Future iterations of this framework could integrate multi-modal processing techniques or complementary strategies to account for additional sources of degradation and further improve image fidelity.

In terms of real-time performance, our DNN architecture demonstrated millisecond-level inference times even for relatively large OCT image sizes, making it suitable for practical real-time applications. However, inference speed may be influenced by hardware limitations, particularly on lower-specification devices or in multi-task environments where the model must share computational resources. Optimizing the network’s efficiency, perhaps through model pruning or hardware-specific optimizations, could help alleviate this issue.

Furthermore, our method is built upon a simplified image formation model that accounts only for the interplay among image intensity, PSF, and noise, as illustrated in Eq. (2). While this approach is effective for structural OCT imaging, it presents inherent limitations when applied to functional OCT imaging modalities. Functional OCT techniques, such as OCT angiography, polarization-sensitive OCT, and Doppler OCT, capture additional dimensions of the detected light field, including the phase, frequency, and polarization state. These factors are crucial for accurately representing tissue dynamics, microvascular flow, birefringence properties, and velocity measurements. To extend our deconvolution approach to these modalities, the imaging formation model must incorporate these additional optical properties, and the learning process should be adapted to handle multi-dimensional data representations.

5. Conclusion

In conclusion, the proposed self-supervised deep learning framework achieves real-time deconvolution for OCT imaging by enhancing resolution and contrast while mitigating the effects of speckle noise and PSF-induced blurring. The framework integrates denoising pre-processing, blind PSF estimation, and sparse deconvolution, requiring only noisy B-scans as input. It has been evaluated across various imaging conditions and demonstrated adaptability to different wavebands and sample types without the need for paired training data. The method effectively improves structural clarity in both ex vivo and in vivo scenarios and provides consistent performance across diverse datasets. These results validate the framework’s potential for practical implementation in OCT devices for enhanced diagnostic and inspection capabilities.

Category: Research Article

Received: Dec. 24, 2024

Accepted: Feb. 11, 2025

Published Online: Mar. 19, 2025

The Author Email: Jianlong Yang (jyangoptics@gmail.com)

DOI:10.3788/AI.2025.10026