1. Introduction
Three-dimensional (3D) surface imaging is an important technique in a variety of fields and has been realized by several methods. Classical interferometry[1,2] uses light interference to make precise surface measurements. Geometric moiré and holographic interferometry[3,4] provide insight into surface deformation by analyzing interference patterns. Digital holography[5–9], which advances this approach, uses digital techniques for enhanced analysis. Fringe projection profilometry and deflectometry[10–13] use structured light to reveal surface topographies. Digital image correlation and stereovision provide noncontact surface imaging. And confocal microscopy[14–16] measures surfaces by peak detection with axis scanning.
In this paper, we focus on confocal surface metrology, a method that uses -axis scanning at regular intervals in confocal microscopy to acquire the depth image. Traditional data-processing techniques, including the center of mass algorithm, parabolic fitting, Gaussian fitting[17], and sinc fitting[18,19], are typically used to determine the point of maximum response along the axis. These conventional approaches require extensive signal input for high-quality data fitting. This means a significant amount of measurement time is required, which severely limits the application scenarios of confocal microscopy for 3D surface imaging, such as dynamic measurements.
With the advent of deep learning, supervised learning methods have been applied to 3D surface measurement[20], but these approaches introduce new challenges. First, neural networks trained in labeled scenarios struggle to adapt effectively to new unlabeled microscopy environments[21]; second, the reliance on labeled data naturally limits measurement capabilities to the range of the provided labels and restricts exploration beyond.
Sign up for Chinese Optics Letters TOC Get the latest issue of Advanced Photonics delivered right to you!Sign up now
To address these challenges, we introduce a novel transformer architecture[22,23], SSL Depth, which incorporates physical information and utilizes self-supervised learning (SSL). Specifically, our method efficiently predicts the intensity, depth, and width of a Gaussian function directly from raw images captured by confocal microscopy. Here, the intensity corresponds to the scattered light intensity of the surface, the depth represents the morphology, and the width indicates the peak width of the confocal system in the direction. In SSL Depth, the raw images themselves serve as the supervising signal, and our method learns by reconstructing these raw images. We perform -axis scans on a set of surface roughness state comparators using a commercial confocal microscope. Notably, these real-world experiments show that our method can increase the imaging speed by 16 times and increase the resolution capability of the system. This significant result underscores the effectiveness of our SSL approach in advancing confocal microscopy-based 3D surface imaging.
2. Framework for SSL
SSL has the ability to extract physical parameters directly from raw microscopy images[24]. Central to our approach, as shown in Fig. 1, is a physics-informed transformer for analyzing confocal scanning images. Inspired by the spatiotemporal vision transformer (ViT)[25], our model incorporates a high-dimensional data embedding scheme that allows the application of pretrained weights from foundation image models[26], enhancing the capabilities of the network. Additionally, we integrate adapters[27] to accelerate the training process. Our network consists of three decoders for predicting intensity, depth, and width. These predictions are then synthesized by a Gaussian function to reconstruct the raw images.

Figure 1.Network architecture. The encoder consists of a ViT and an adapter for accelerated training. Intensity, depth, and width are predicted by three decoders to finally reconstruct the raw data.
As shown in Fig. 2(a), the patch embedding module in our method is inspired by the handling of the temporal dimension in spatiotemporal ViT. We use a 3D convolution operation to transform stacks of confocal microscopy images scanned along the axis into patches. We then add lateral position encoding and -directional position encoding to these patches and align them with the transformer architecture of image-based pretrained foundation models. Furthermore, we have implemented an adapter that integrates with the ViT. During the training process, we freeze the parameters of the ViT while updating the adapter and other components.

Figure 2.Network modules design. (a) Convert the microscopic imaging stack into patches by a 3D convolution; (b) prediction head design for intensity, depth, and width.
To predict physical features, we decode each feature separately using dedicated decoders designed for physical information. These decoders consist of convolutional and upsampling operations that transform the features into images of the same size as the input. As shown in Fig. 2(b), we process three physical parameters: intensity, depth, and width. (1) Since the raw confocal microscopy images are normalized, we apply a sigmoid function to the intensity to ensure that it remains in the range 0–1. (2) Since the measurable morphology is within the range of the confocal scan, we apply a sigmoid function to the depth and then multiply it by the number of steps in the direction. (3) The peak width in the direction of the microscopy image is equal to the width of the point spread function of the confocal microscope and does not undergo drastic changes within the field of view. Therefore, we impose a constraint on the output width for gradual spatial variation, specifically convolving it with a Gaussian kernel.
In our self-supervised training process, the goal is to ensure that the reconstructed stack is highly similar to the raw confocal microscopy images, a task that involves measuring the similarity between 3D stacks. To achieve this, we apply the sum of mean absolute difference () and multiscale structural similarity (MS-SSIM) metrics, which have proven effective in similar 2D tasks[24], to their 3D versions. The key challenge here is to transform MS-SSIM for 3D application. We approach this by implementing a 3D SSIM and applying multiscale processing only in the lateral dimensions.
3. Data Collection
For our measurements, we use a commercial confocal microscope, Sensofar S-mart Optical Profilometer[28]. As shown in Fig. 3, the objects of our measurements are surface roughness comparators according to the ISO 2632/1-1975 standard. These comparators are tools used to determine surface roughness by comparative methods, visual estimation, or with the aid of a magnifying glass. Our focus is on measuring standard parts labeled “VERTICAL MILLING.” We perform three sets of measurements on the same field of view but with different -axis step sizes, representing , , and measurement modes. Specifically, we randomly select 40 areas, each measuring , for these measurements. In the lateral direction, each pixel represented 1.38 µm, providing a high-resolution image of the surface. For the -direction steps, the measurement mode used a step size of 2 µm, while the measurement mode used a larger step size of 8 µm, and the measurement mode used the largest step size of 32 µm.

Figure 3.Examples of raw measurement data. Scale bar is 50 µm.
4. Results
4.1. Training
In our study, the raw measurement data have pixel sizes of (151, 1028, 1232) for the mode; (38, 1028, 1232) for the mode; and (10, 1028, 1232) for the mode. During training, we randomly crop the data: for the mode, the crop size is (96, 224, 224); for the mode, it is (25, 224, 224); for the mode, it is (6, 224, 224). We perform 10,000 random crops per epoch across all 40 regions. To augment the data set, we introduce Gaussian noise with a random scale of 0–1 to the input data. In terms of data processing, the patch embedding for the mode data creates small patches of size (16, 16, 16), while for the mode, the patch size is (5, 16, 16), and for the mode, the patch size is (1, 16, 16). Prior to training, the encoder in our network is loaded with weights from the vision-based pretrained foundation model, CLIP[26]. During training, these weights are frozen while the other parts of the network are updated. We use the AdamW optimizer[29]; the training process starts with a cosine warm-up for the first 30 epochs, followed by maintaining a learning rate of 3 × 10−5 for the next 60 epochs.
We develop our code using JAX[30], a high-performance numerical computing library that provides an expressive array programming model and is optimized for both GPU and CPU performance. Training is performed on a single NVIDIA Tesla A100 40 GB GPU. For the mode, we set the batch size to 1, with each epoch consisting of 10,000 steps and taking approximately 31 min. In contrast, the mode has a batch size of 2, with each epoch containing 5000 steps and taking about 17 min.
4.2. Evaluation
After training, we save the weights for the inference process. We start by performing a central crop on the raw data in the direction, with a size of 96 for the mode, 25 for the mode, and 6 for the mode, to fit our trained model. In the lateral direction, we perform regular pixels crops with an overlap of 32 pixels and then stitch these results to get the original dimensions of . The inference time for a stack is about 10 s in mode and about 6 s in mode.
4.3. Comparable measurement capability at 16× speed
We compared the results of traditional commercial algorithm processing data in and modes with the results of SSL Depth processing data in , , and modes (note that the commercial algorithm does not support mode). The image from this commercial microscope is sufficiently clear and normal for the way in which it is used, so it can be used for ground truth. Our results show that the results of SSL Depth are superior. For this comparison, we extract a region shown in Fig. 4(a). Figures 4(b) and 4(c) correspond to the mode, i.e., a step of 2 µm. We observe that both the traditional commercial algorithm and SSL Depth can detect traces at the positions of the cross section. Next, Figs. 4(d) and 4(e) show the mode with a step of 8 µm. We observe that the traditional commercial algorithm based on function fitting fails to detect traces at the positions of the cross section, while SSL Depth can distinguish them. Then, Fig. 4(f) shows the mode with a step of 32 µm. SSL Depth results still distinguish traces. Finally, Fig. 4(g) shows the mean absolute error corresponding to the cross-sectional line varies with speed, where we assume that the traditional commercial result is true, and that both the and result errors for the SSL Depth are less than the error for the commercial microscope.

Figure 4.Comparative Experiment 1. (a) Confocal microscope intensity image, with the yellow area indicating the field of view for subsequent analysis; (b) depth image obtained from 1× mode data using the traditional commercial algorithm; (c) depth image obtained from 1× mode data using SSL Depth; (d) depth image obtained from 4× mode data using the traditional commercial algorithm; (e) depth image obtained from 4× mode data using SSL Depth; (f) depth image obtained from 16× mode data using SSL Depth; (g) mean absolute error (L1) corresponding to the cross-sectional line, assuming that the traditional commercial 1× result is true. The red dashed line shows the error at 4× speed for the commercial microscope. Scale bar is 50 µm.
4.4. Superior measurement with the same data
We further discover that SSL Depth can give better results under the same standard measurement conditions. As shown in Fig. 5(a), we select a region measuring and collect data in mode ( step of 2 µm). We then process these data using both the traditional commercial algorithm and SSL Depth. We find that at the cross-sectional position, the traditional commercial algorithm fails to detect scratches, while SSL Depth does. This finding is significant because it indicates that our approach extends the capability of the current system.

Figure 5.Comparative Experiment 2. (a) Confocal microscope intensity image, with the yellow area indicating the field of view for subsequent analysis; (b) depth image obtained from 1× mode data using the traditional commercial algorithm; (c) depth image obtained from 1× mode data using SSL Depth. Scale bar is 50 µm.
5. Discussion
Our SSL approach does not rely on additional prior knowledge, such as labeled data or more prior rules, nor does it require the collection of more data for the same measurement target. Instead, our method makes more comprehensive use of the measurement data compared to standard fitting methods. Specifically, we use SSL to learn effective representations of the distribution from 40 raw measurement stacks. This process essentially compresses the raw data and acts as an implicit constraint on the solution. Traditional methods, which typically involve a combination of function fitting, filtering, and image processing, only allow direct processing at the pixel level and do not facilitate feature learning. Our method is more effective because it goes beyond mere pixel manipulation to learn and exploit the deeper features of the data.
One disadvantage of our method is the memory consumption and the extended training time (about 2 days) during the training stage. This is because our network directly processes 3D data, resulting in a higher number of patches compared to typical visual tasks. This is an area for future improvement. However, the inference stage is remarkably fast, processing a complete data stack in less than 10 s. In addition, our model only needs to be trained once for samples of the same type. In the future, we plan to explore training on a larger data set to achieve sufficient generalizability. This would allow the model to be directly applicable to entirely new types of samples, facilitating easier deployment and broader adoption.
6. Conclusion
In this paper, we present a self-supervised approach, SSL Depth, for processing 3D surface measurements using confocal microscopy. We have developed a novel architecture based on the ViT, which incorporates physical information relevant to 3D surface measurements. This method is trained directly on raw confocal microscopy measurement stacks without the need for labeled data. Our approach demonstrates superiority over the traditional commercial algorithm in confocal microscopy measurements of surface roughness comparators. Not only can our method achieve comparable measurement capability with speeds, but it also produces better results than the traditional commercial algorithm when using the usual amount of measurement data. This indicates that our approach not only speeds up confocal microscopy measurements but also expands the current limits of confocal microscopy measurement capability, offering improved accuracy and efficiency.