Small-angle X-ray scattering (SAXS) is a promising metrology technology for complex nanostructures in semiconductor manufacturing. However, parameter reconstruction based on SAXS measurement often faces challenges in achieving high precision and repeatability due to the increasing complexity of structures and the demands for precise measurement. To address these challenges, a correlation learning-based method is proposed to enhance the accuracy and reduce the uncertainty of the profile reconstruction in SAXS measurement. This method leverages the long short-term memory (LSTM) mechanism to capture and learn inherent parameter correlation effectively. The precision and reliability of the proposed method are demonstrated through the simulations of synthetic Si gratings. Our method exhibits remarkable measurement accuracy with an improvement of at least 13.9%, and the measurement repeatability is nearly 1.4 times higher compared to the previous learning-based methods. We expect that our approach will provide a novel solution for SAXS measurement, enabling accurate and reliable profile reconstruction of nanostructures.
【AIGC One Sentence Reading】:Correlation learning enhances SAXS profile reconstruction accuracy by 13.9% and repeatability by 1.4x using LSTM, addressing challenges in nanostructure metrology.
【AIGC Short Abstract】:A correlation learning-based method using LSTM is proposed to enhance SAXS measurement accuracy and reduce uncertainty in profile reconstruction. Simulations of Si gratings demonstrate significant improvements in precision and repeatability, with at least a 13.9% increase in accuracy and 1.4 times higher repeatability compared to previous methods. This approach offers a novel solution for accurate and reliable SAXS-based metrology of nanostructures.
Note: This section is automatically generated by AI . The website and platform operators shall not be liable for any commercial or legal consequences arising from your use of AI generated content on this website. Please be aware of this.
Small-angle X-ray scattering (SAXS), as a non-contact, non-destructive, and high-resolution measurement scheme[1,2], has gradually become one of the most important techniques for measuring diverse nanostructures, especially in material science and biology[3–5]. It is also applied to inline critical dimension metrology in semiconductor manufacturing to ensure product quality and optimize fabrication process[6–8]. However, SAXS measurement, a typical ill-posed inverse problem[9], is not straightforward as it involves a complicated process of estimating structural parameters from the scattering patterns[10,11]. Low signal-to-noise ratios often lead to inaccurate or even erroneous reconstruction results and high uncertainty[12]. These drawbacks become more severe with the increasing complexity of the structures[13], which will seriously interfere with the correct perception of the nanostructures and fabrication process. Thus, it is crucial to employ certain methods to enhance measurement accuracy and reduce uncertainty[12].
Incorporating certain priors into the reconstruction process can improve the measurement precision in various semiconductor measurement techniques[11,12,14,15]. This strategy is also beneficial for improving reconstruction accuracy and for reducing uncertainty in the SAXS inverse problem such as providing a fin-shaped mirrored prior[16,17]. However, conventional iteration-based methods[3,7,11,18,19], despite incorporating certain prior knowledge, still suffer from inherent inefficiency because they require calling the predefined forward physical model to calculate the scattering patterns and assess many candidate solutions[13,20]. Deep learning-based methods have gained significant attention in semiconductor metrology[21–24] due to their exceptional capabilities in rapid inference and robust feature extraction[25]. Recently, researchers have extended their application to parameter reconstruction for nanostructured samples in SAXS measurement, including the tilt of memory hole structures, the width and height of simple trapezoidal gratings, and the profile of complex profile gratings[22,26,27]. These approaches can be regarded as model-dependent neural networks[28], providing prior knowledge from the perspective of the physical forward model. These successes provide valuable insights: by integrating reliable prior knowledge into the reconstruction algorithm or guiding the algorithm to learn such prior knowledge, we can greatly enhance the accuracy and reliability of the reconstruction process[29]. In semiconductor manufacturing, the etching process often leads to the aspect ratio-dependent etching (ARDE) effect[30] or significant tilt variation near the edge of the wafer[31]. These phenomena also result in strong nonlinear inter-layer correlation for the nanostructure[32]. Hence, guiding the reconstruction process to learn this correlation serves as a potential enhancement scheme.
In this work, we propose a correlation learning-based method for profile reconstruction in SAXS measurement. Utilizing the long short-term memory (LSTM) mechanism[33], our approach can actively learn the inter-layer correlation of profile structural parameters. Based on this characteristic, we design and evaluate a novel network model to reconstruct the profile parameters of synthetic complex Si gratings. The analysis results demonstrate that this learning-based approach significantly enhances accuracy and reduces uncertainty in nanostructure profile measurement.
Sign up for Chinese Optics Letters TOC Get the latest issue of Advanced Photonics delivered right to you!Sign up now
2. Theory and Method
2.1. X-ray scattering principle
SAXS measurement can well extract structural parameters of nanostructured samples, whose schematic diagram is shown in Fig. 1. Figure 1(a) illustrates the fundamental process of small angle X-ray scattering. The monochromatic X-ray beam impinges upon the sample at an angle of incidence (AOI), and subsequently, the scattering pattern within a small angle region is recorded by a detector. After exposures of multiple AOIs, we can integrate the recorded scattering patterns to get the scattering map of the sample[16], which is shown in Figs. 1(b) and 1(c). According to the kinematical approximation, if we know the relative electron destiny of a sample, its Fourier transform is where is the spatial vector in real space, is the transferred wave vector in the reciprocal space, and the scattering vector is considered as the vector difference between the incidence wave vector and the scattering wave vector, and . According to the first-order Born approximation, the scattering map is expressed as . Thus, considering the periodicity of samples, surface roughness, and noise, the scattering map is further expressed as[1,16]where is the structure factor, is the Debye–Waller factor, and is the root mean square of the surface roughness amplitude. is the noise, including the background noise and Poisson noise.
Figure 1.(a) Schematic diagram of the SAXS measurement, (b) scattering patterns of multiple AOIs, and (c) scattering map of the sample.
Figure 2(a) is the top and side views of a complex profile grating sample. The sample is characterized by an inter-line distance , which is called the grating pitch. Hence, the structure factor can be expressed as , where is the diffraction order. In the side view, the sample can be modeled as a stack of rectangles. Its profile is demonstrated by a structural profile parameter set , where is the structure’s layer number and and are the critical dimension and center position of the th layer, respectively[34].
Figure 2.(a) Top and side views of a complex profile grating sample. The profile is demonstrated by a structural parameter set t = {CDn, CPn}nN. (b) Various types of phenomena, including tilting, twisting, bowing, tapering, and potential combinations thereof. (c) Kendall τ correlation coefficients for the CDs and CPs. (d),(e) Correlation ellipses for the CDs and CPs, which demonstrate that the parameters of two adjacent layers are more correlated than those of the distant layers.
In an ideal condition, the s of each layer are equal, and the s of each layer are all zero. However, in the fabrication processes (i.e., the etching or multi-patterning processes)[35], nanostructures may demonstrate various types of phenomena[7], including tilting, twisting, bowing, tapering, and potential combinations thereof, which are depicted in Fig. 2(b). To investigate the impact of these phenomena on the profile parameters, we generated 100,000 samples. The samples consist of basic structures, including the above-mentioned basic profile phenomena[7]. These basic structures are randomly stacked to form different grating samples. Each grating sample is a stack of 40 rectangles with a thickness of 75 nm for a single rectangle. Through interpolation, the rectangle size of each layer varies smoothly with the height of a sample[34]. The mean value of at each height is about 75 nm, with a range of , and the mean value of is 0 nm, with a range of . The correlation ellipses for the s and s are shown in Figs. 2(d) and 2(e), which illustrate that the parameters of two adjacent layers are more correlated than those of two distant layers, and this correlation may exhibit nonlinear characteristics. We calculate the Kendall correlation coefficient to quantify this kind of relationship in Fig. 2(c). If two layers are further apart, the correlation coefficient between the corresponding parameters of these two layers will decrease. The correlation between the s of the two distant layers even reaches a nonlinear negative correlation. Similar characteristics can also be observed using Spearman correlation analysis. Utilizing this natural knowledge of the inherent inter-layer correlation may offer a potential way to improve SAXS measurement accuracy and uncertainty.
2.2. Correlation learning-based method
We propose a novel method that can learn the knowledge about the inter-layer correlation to enhance SAXS measurement accuracy and reduce uncertainty. For the SAXS inverse problem of reconstructing structural parameters from the scattering map, the scattering map (input) can be regarded as an angular series, while the structural parameters (output) can be seen as a layered series. The LSTM mechanism can effectively convey and express information across different series and easily learn the inherent correlations within each series[36]. Hence, using the LSTM mechanism to solve this inverse problem has distinct advantages. Specifically, the LSTM achieves these through gate structures. The gate structures enable selective transmission of information through the Sigmoid function layer for point-by-point multiplication. The Sigmoid function regulates the weight of information exchange between neural layers. There are three types of gate structures in an LSTM cell, namely, the forget gate, input gate, and output gate.
The forget gate determines which information from the last state needs to be forgotten. The forget gate gets the hidden state from the last state and the input information of the current state, and then outputs a weight for the last cell state . The Sigmoid function limits between 0 to 1, which represents the forgetfulness extent of the hidden state . In this way, the LSTM can retain important information for a long period, and the memory can be dynamically adjusted with the current input information . The weight , as well as its forgetting effect on , can be written as where and are the weights and biases of the forget gate, respectively, and is the Hadamard product.
The input gate is the structure that determines how many components from the current input information to add into the cell. The tanh function extracts the valid information from the . Using the Sigmoid function, the input gates controls the extent of the valid information. In other words, the input gate filters the extracted valid information and scores each component from 0 to 1. The higher the score, the more component will enter the current cell state . The extent of the valid information and component from the current input information to add into the cell is demonstrated as where , , , and are the weights and biases of the input gate, respectively.
The output gate is the neural gate that calculates the current state’s output. Like the forget and input gates, the Sigmoid function also exists in the output gate, which helps to further extract the information from the current cell state . Also, the current cell state is mapped to the interval as the current hidden state by the tanh function. The current cell state , output , and hidden state are expressed as where and are the weight and bias of the output gate, respectively.
The overall framework (named CLNet) has a residual module and a correlation learning module in Fig. 3. The residual module is first used to extract features from the scattering map, and the correlation learning module is further used to reconstruct the profile parameters using the output of the residual module. The residual module is based on the backbone of ResNet34 and has a separate convolution layer at the front end, followed by four large blocks, which comprise three, four, six, and three groups of small blocks, respectively. Each of the small blocks contains two convolutional layers. An adaptive average pool layer then follows these and finally extracts a angular series. The correlation learning module comprises two distinct channels dedicated to capturing the inherent parameter correlations of the s and s, respectively. The channels directly reconstruct the s and s, respectively, from the series, which is shown in Fig. 3. In each channel, the series needs an LSTM cell to take the progress of 40 time steps. The whole operations of the LSTM cell are also illustrated in the local enlargement of Fig. 3, and the mechanism is described by Eqs. (3)–(5). The bidirectional LSTM structure, which involves two sets of LSTM cells that process the series in the forward and backward directions, is employed for accurate identification and reconstruction[37]. The bidirectional LSTM structure produces an output at each time step, and all the outputs form a new layered series to be fed into a fully connected layer with a Sigmoid function at the end of each channel. After linearly scaling the outputs of both channels to their respective reconstruction ranges, the reconstructed s and s are finally estimated. To suppress the overfitting caused by a large number of model weights, batch and layer normalization are performed after each convolution layer and LSTM layer[38,39].
Figure 3.Framework architecture of the correlation learning-based method. The framework, namely, as CLNet, consists of a residual module and a correlation learning module. The residual module is a modified ResNet34, which has an adaptive average pooling layer at the beginning and another one at the end. The correlation learning module is a Bi-LSTM structure, which can directly reconstruct the profile parameters.
We generate the scattering maps of the grating samples mentioned in Sec. 2.1. The sample material is Si. The grating pitch is 170 nm, and is set to 2 nm for the surface roughness. The X-ray photon energy is 24 keV. The exposures of each sample are taken at 100 AOIs, uniformly spanning from to 20°. The data recorded by the detector with 512 pixel × 512 pixel only contain information in the middle few rows of pixels[27]. Hence, we take out 4 pixel × 512 pixel from each AOI’s scattering patterns. These data of multiple AOIs are concatenated together to 400 pixel × 512 pixel. Poisson noise is added to the scattering data, where the noise level is related to the photon number. The photon number is 10 times the number of pixels in the scattering data, which is set according to actual experimental conditions[40]. The background noise level is a relative intensity of [10]. It should be noted that our method can be applied to other X-ray photon energies theoretically.
The dataset includes the scattering maps and the corresponding structural parameters of the samples. The train, validation, and test sets are divided into 8:1:1. The model is trained on a workstation with two NVIDIA Quadro RTX 8000 GPUs and implemented using PyTorch. The Adam optimizer is used to update the network weights with the initial learning rate and , . The loss function is the L1 loss with the physical symmetry prior[27]. The batch size is 16. These hyperparameters are chosen to ensure that the CLNet model can sufficiently understand and express the underlying information from the dataset. The training epoch is 200, taking about 25 h in total.
3. Results and Discussion
Figure 4 shows the reconstructed results and the corresponding fitting scattering data of one randomly selected instance in the test set. The reconstructed results for profiles and are shown in Figs. 4(a)–4(c), respectively. It shows that our method can precisely estimate the structural parameters. Figure 4(d) shows the measured scattering map of this sample recorded by the detector. To further evaluate the accuracy of our method, we recalculated the fitting data and chose six -slices to compare with the map in Fig. 4(d). Figures 4(e)–4(j) show the measured data, fitting data, and corresponding residuals at , , , , and and . The fitting scattering data at these slices exhibit great consistency with the measured data. The proposed method achieves high fitting accuracy in low and medium q regions, where the measured scattering intensity is significant, and the signal-to-noise ratio is high. However, in the high q regions, the noise impact becomes more significant, leading to increased residuals and reduced accuracy.
Figure 4.Reconstructed results and corresponding fitting scattering data of a randomly selected sample. (a)–(c) Reconstructed results for profiles CD and CP, respectively. (d) Measured qx-qz scattering map obtained from the scattering patterns at 100 AOIs, uniformly spanning from −20° to 20°. (e)–(j) Measured data, fitting data, and the corresponding residuals at the q-slices, which are indicated by the dashed green lines in (d). In the high q regions, the residual increases due to the low signal-to-noise ratio of the measured data.
Reconstruction can be achieved for diverse samples, including phenomena such as tilting, bowing, and tapering, as illustrated in Fig. 5. Figures 5(a) and 5(b) are the reconstructed s and s of three randomly selected samples in the test set, respectively. In addition to CLNet, the results of well-trained ResNet34[41], ResNet50[27,41], AlexNet[22,42], EffNet-b2[43], and TinyViT-5m[44] are also shown in Fig. 5 for ablation and comparison experiments. The hyperparameters used by these models are consistent with those described in Sec. 2.3. It can be seen that our method performs more excellently in comparison. To quantitatively analyze and compare the performance of these six models, we use three kinds of key metrics to evaluate 10,000 instances in the test set: mean absolute error (MAE), R-squared value (), and slope. where is the ground truth, is the reconstructed parameter, and is the total number of samples.
Figure 5.Reconstructed results of three randomly selected samples in the test set. Each grating consists of 40 layers. (a), (b) Corresponding reconstructed CDs and CPs, respectively. Besides our method, the plots also include results from five other models, namely, ResNet34, ResNet50, AlexNet, EffNet-b2, and TinyVit-5m. These are used for ablation and comparative experiments.
For clarity, it is essential to distinguish the three types of MAE since each grating sample has 40 layers, denoted as per-layer MAE, per-sample MAE, and overall MAE[27]. The per-layer MAE refers to the MAE of each layer for all samples, and the per-sample MAE indicates the MAE of each sample for all 40 layers. The MAE of the entire test set is denoted as the overall MAE. The results of per-layer MAE for and are shown in Figs. 6(a) and 6(c), respectively. Our method has the minimum error at each layer. The fitting results of CLNet for the edge layers demonstrate significant improvement compared to AlexNet and EffNet-b2. However, this phenomenon is insignificant for ResNet34, ResNet50, and TinyViT-5m. Figures 6(b) and 6(d) illustrate the cumulative error distribution of the per-sample MAE for and , respectively. Compared with other models, the cumulative curves of CLNet reached 100% most quickly, indicating that our method has achieved high accuracy in reconstructing structural parameters. The CLNet method has the overall MAE for the of 0.110 nm, while the ResNet34, ResNet50, AlexNet, EffNet-b2, and TinyViT-5m have those of 0.132, 0.142, 0.510, 0.456, and 0.156 nm, respectively. The overall MAE for the of our method is 0.031 nm, while the values of ResNet34, ResNet50, AlexNet, EffNet-b2, and TinyViT-5m are 0.036, 0.041, 0.098, 0.087, and 0.045 nm, respectively. The results indicate that our method achieves a 20.0% improvement in accuracy for reconstructing the and a 13.9% improvement for reconstructing the than the second-highest achieved by ResNet34. The maximum per-sample MAE of the CLNet is 1.981 nm for and 0.631 nm for , whereas ResNet-34 and TinyViT-5m yield the second-highest results of 2.264 nm for and 0.764 nm for , respectively.
Figure 6.(a), (c) Per-layer MAE for CD and CP in the test set, respectively. (b), (d) Cumulative error distributions of per-sample MAE for CD and CP, respectively. Per-layer MAE represents the mean absolute errors for each layer of all samples, and per-sample MAE represents the mean absolute errors for individual samples.
It should be noted that in Fig. 6(d), the cumulative error distribution curve rises quickly within 0.01 nm. Under the ideal sample situation, where the ground truth s of each layer are equal to zero, the CLNet’s reconstructed results for are almost perfect, which is shown in Fig. 7. In other words, our method fully utilizes the LSTM mechanism to learn the prior knowledge about this kind of inter-layer correlation.
Figure 7.(a), (b) Reconstructed CP and the corresponding per-layer MAE, respectively, under the ideal situation where the ground truth CPs of each layer are equal to zero.
The and slope are widely used to evaluate the fitting performance of the methods. for the of our proposed method is 0.9995, while those of the compared models are all less than 0.9993. For , of CLNet is 0.9953, while those of other methods are less than 0.9942. This indicates that our method has superior fitting ability. The slopes of the CLNet are 0.9993 for and 0.9942 for , respectively. In addition to the CLNet, the second-highest achieved results for and are 0.9997 (TinyViT-5m) and 0.9920 (ResNet34). The slope achieved by TinyVit-5m for the is closer to 1 than our method, but the slope for the achieved by TinyVit-5m is only 0.9897. The detailed evaluation results are shown in Table 1. Notably, our method still demonstrates good fitting ability at the background level of , with both the and the slope approaching 1.
Table 1. Evaluation Comparison of Six Types of Methods
Table 1. Evaluation Comparison of Six Types of Methods
Method
Overall MAE (nm)
Max per-sample MAE (nm)
R2 (arb. units)
Slope (arb. units)
CD
CP
CD
CP
CD
CP
CD
CP
CLNet (our method)
0.110
0.031
1.982
0.631
0.9995
0.9953
0.9993
0.9942
ResNet34
0.132
0.036
2.396
0.764
0.9993
0.9942
0.9978
0.9920
ResNet50
0.142
0.041
2.967
0.645
0.9992
0.9926
0.9995
0.9867
AlexNet
0.510
0.098
3.871
1.358
0.9933
0.9644
0.9934
0.9558
EffNet-b2
0.456
0.087
2.606
1.065
0.9948
0.9716
0.9968
0.9625
TinyViT-5m
0.156
0.045
2.264
0.742
0.9991
0.9921
0.9997
0.9897
Uncertainty quantification holds significant importance as it enables the estimation of method repeatability and facilitates comparability across diverse approaches[45,46]. In order to quantify the model uncertainty of our methods, we sample the posterior distribution using stochastic gradient Langevin dynamics (SGLD)[47] and perform Bayesian inference averaging on the obtained samples. Typically, uncertainty is represented by calculating the three times standard deviation ()[48]. Figure 8 illustrates the uncertainty () quantifications for and of one randomly selected sample, using our method and the five methods of comparison. Besides accurately obtaining structural parameters, our method achieves confidence intervals better than the other methods. To further understand and evaluate the uncertainty of our method, we calculated the mean for the overall test set. The CLNet method has the mean for of 0.322 nm, while the ResNet34, ResNet50, AlexNet, EffNet-b2, and TinyViT-5m have those of 0.527, 0.570, 0.673, 0.673, 0.795, and 0.424 nm, respectively. The mean for of our method is 0.116 nm, while the values of ResNet34, ResNet50, AlexNet, EffNet-b2, and TinyViT-5m are 0.204, 0.155, 0.216, 0.231, and 0.189 nm, respectively. The CLNet reduces the uncertainty for and by 24.1% and 25.2%, respectively. The results demonstrate that our method not only performs superior in terms of accuracy but also enhances the measurement repeatability by nearly 1.4 times.
Figure 8.(a), (b) Model uncertainty quantification for CD and CP, respectively, based on six types of methods, including CLNet (our method), ResNet34, ResNet50, AlexNet, EffNet-b2, and TinyVit-5m. Our method performs best in terms of reconstruction accuracy and repeatability.
In this work, we propose a correlation learning-based method for reconstructing complex profile nanostructures in the field of SAXS measurement. Considering the inter-layer correlation of profile structural parameters and the nature of SAXS signals, we have designed and evaluated the CLNet framework, which incorporates an LSTM mechanism. The CLNet actively learns inter-layer correlations of samples, significantly enhancing the accuracy and uncertainty of profile reconstruction. Notably, our novel approach improves reconstruction accuracy by at least 13.9% and also exhibits a measurement repeatability that is nearly 1.4 times superior to the best among five other learning-based methods. In the future, we hope to extend our method to a broader range of nanostructures, such as 3D hole nanostructures. Additionally, this proposed approach may have potential applications in other profile metrology techniques, such as optoelectronic coupled devices (OCDs), which need further exploration.
[7] M. Wormington, A. Ginsburg, I. Reichental et al. X-ray critical dimension metrology solution for high aspect ratio semiconductor structures. Metrology, Inspection, and Process Control for Semiconductor Manufacturing XXXV, 11611, 150(2021).
[13] J. Reche, Y. Blancquaert, G. Freychet et al. Dimensional control of line gratings by small angle X-ray scattering: Shape and roughness extraction. 2020 31st Annual SEMI Advanced Semiconductor Manufacturing Conference (ASMC), 1(2020).
[18] R. Ciesielski, L. M. Lohr, H. Mertens et al. Pushing the boundaries of EUV scatterometry: reconstruction of complex nanostructures for next-generation transistor technology. Metrology, Inspection, and Process Control XXXVII, 12496, 447(2023).
[22] S. Liu, T. Yang, J. Zhang et al. X-ray scatterometry using deep learning. Tenth International Symposium on Precision Mechanical Measurements, 12059, 481(2021).
[24] B. Dey, D. Cerbu, K. Khalil et al. Unsupervised machine learning based CD-SEM image segregator for OPC and process window estimation. Design-Process-Technology Co-optimization for Manufacturability XIV, 11328, 317(2020).
[26] A. Baranovskiy, I. Grinberg, M. G. Greene et al. Deep learning for the analysis of X-ray scattering data from high aspect ratio structures. Metrology, Inspection, and Process Control XXXVII, 12496, 837(2023).
[29] S. S. Ginosar. Modeling Visual Minutiae: Gestures, Styles, and Temporal Patterns(2020).
[30] J. Zhang, T. Lan, Y. Gao et al. CDSAXS study of 3d NAND channel hole etch pattern edge effects and etched hole pattern variance. Metrology, Inspection, and Process Control XXXVIII, 12955, 813(2024).
[37] S. Siami-Namini, N. Tavakoli, A. S. Namin. The performance of LSTM and BiLSTM in forecasting time series. 2019 IEEE International Conference on Big Data (Big Data), 3285(2019).
[38] S. Santurkar, D. Tsipras, A. Ilyas et al. How does batch normalization help optimization?. Advances in Neural Information Processing Systems, 31(2018).
[39] J. L. Ba, J. R. Kiros, G. E. Hinton. Layer normalization(2016).
[40] C. Settens, B. Bunday, B. Thiel et al. Critical dimension small angle X-ray scattering measurements of FinFET and 3D memory structures. Metrology, Inspection, and Process Control for Microlithography XXVII, 8681, 200(2013).
[41] K. He, X. Zhang, S. Ren et al. Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770(2016).
[42] A. Krizhevsky, I. Sutskever, G. E. Hinton. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25(2012).
[43] M. Tan, Q. Le. Efficientnet: Rethinking model scaling for convolutional neural networks. International Conference on Machine Learning, 6105(2019).
[44] K. Wu, J. Zhang, H. Peng et al. Tinyvit: Fast pretraining distillation for small vision transformers. European Conference on Computer Vision, 68(2022).
[45] B. Bunday, G. Orji. Metrology. 2021 IEEE International Roadmap for Devices and Systems Outbriefs, 1(2021).
[47] C. Li, C. Chen, D. Carlson et al. Preconditioned stochastic gradient Langevin dynamics for deep neural networks. Proceedings of the AAAI Conference on Artificial Intelligence, 30(2016).