Real-time 3D imaging based on ROI fringe projection and a lightweight phase-estimation network

Yueyang Li; Junfei Shen; Zhoujie Wu; Yajun Wang; Qican Zhang

doi:10.3788/AI.2024.10008

1. Introduction

The technique of structured-light three-dimensional (3D) imaging based on fringe projection, also known as fringe projection profilometry (FPP)^[1–4], has long been a focal point in both academic research and industrial applications due to its efficiency, precision, and cost-effectiveness on converting data into 3D models. Through the processing of 3D imaging data, precise measurement and analysis of object dimensions, shapes, and surface features have been well achieved in static scenes^[5–7]. However, various limiting factors exist in the 3D imaging of dynamic scenes, such as inter-frame movements of objects and the low efficiency of pattern encoding, preventing FPP from achieving the same level of precision as in static scenes. Furthermore, in scenarios such as augmented reality and online monitoring, it is crucial to perform high-precision 3D imaging in real-time, which brings challenges to FPP algorithms for their complexity and robustness.

FPP-based 3D reconstruction consists of four pivotal stages: fringe pattern projection and modulation, phase demodulation, phase unwrapping, and 3D calibration and reconstruction. To solve the problems of dynamic scene measurement, these stages have been optimized by many researchers over the past decade to enhance the adaptability of the algorithms.

One intuitive way is to reduce the number of required projected patterns using composite fringes. Some researchers have proposed different ways to embed additional encoding information into the fringes, such as binary coding patterns^[8], speckle patterns^[9,10], triangular waves^[11], and coded phases^[12]. Through corresponding decoding algorithms, the composite fringes are separated into sinusoidal intensity and another portion used to compute fringe orders. This approach employs encoded information to assist with phase unwrapping, obviating the necessity for directly projecting additional images.

Besides, the reuse of fringes or redundant information can also reduce the number of images required for phase unwrapping. Wu et al. proposed cyclic complementary Gray-code^[13] and shifting Gray-code^[14] by re-encoding Gray-code sequences in both temporal and spatial domains, which circumvent edge errors while simultaneously reducing the number of fringes by one. Zuo et al.^[15] assumed that the background light intensity of the fringes remains constant throughout the measurement process and utilized the same background light intensity for two adjacent sets of phase-shifting fringes to compute phase values, while his alternative method (2 + 2)^[16] employed ramp patterns with a linear intensity variation to compute phase. The previously mentioned methods have been proven to be beneficial in improving measurement efficiency. However, due to the utilization of multiple fringes, the motion-induced error will be introduced, leading to periodic error distributions.

An alternative approach involves modulating multiple fringe patterns with different carrier frequencies along orthogonal directions and combining them into a single pattern image^[17,18]. The wrapped phases corresponding to these patterns can be separated from the Fourier spectrum of the composite fringe pattern and then unwrapped using temporal phase unwrapping (TPU)^[19,20]. While frequency multiplexing can enhance the efficiency of FPP, careful selection of frequencies is necessary to prevent spectrum aliasing. Additionally, the filtering operations involved in Fourier fringe analysis can pose challenges for achieving high-accuracy measurements on complex surfaces.

Recently, numerous deep learning-based FPP methods have been developed and have demonstrated outstanding performance in various aspects, including fringe enhancement^[21], fringe analysis^[22,23], and high-dynamic-range measurement^[24,25]. Among these, some methods are particularly suitable for dynamic scene measurement due to their reduced requirement for the number of fringes. Sam Van der Jeught et al.^[26] designed a convolutional neural network (CNN) to directly predict depth from a single fringe pattern without any additional intermediate processing steps, showcasing the potent mapping capability of deep learning technology. However, such direct mapping approaches often struggle to accurately learn the physical parameters during the calibration process, resulting in reduced precision when dealing with complex surfaces.

Many methods have instead adopted multi-stage processing pipelines, wherein deep learning is employed to obtain accurate fringe patterns or retrieve phase information, followed by the use of traditional methods to convert phase into 3D shapes—for instance, separating different grayscale phase-shifted fringes from a single-frame color composite fringe pattern through CNN^[27] and then using traditional phase-shifting profilometry (PSP)^[28,29] and TPU methods to compute the 3D results. Alternatively, the method of extracting phase information and coarse unwrapped phase from a frequency multiplexing fringe^[30], followed by utilizing TPU to obtain the unwrapped phase, can be used. Benefiting from the support of traditional methods, such approaches can achieve greater accuracy when dealing with complex surfaces, but the large number of parameters of neural networks and the complexity of network structures make it hard to achieve real-time performance.

To address this issue, Li et al.^[31] utilized network architecture search (NAS) to automatically design a lightweight phase demodulation network and combined it with a modified heterodyne phase unwrapping (MHPU) method, successfully achieving real-time 3D imaging at 58 frame/s. As the speed increases, the accuracy of this method tends to decrease compared to models with larger parameter numbers. Yin et al.^[32] introduced a physics-informed deep learning method for phase retrieval, which replaces the filtering process of Fourier fringe analysis with two learnable filters. By combining this approach with a lightweight network, they achieved high-precision measurements in low latency. However, this method requires prior knowledge similar to Fourier transform profilometry (FTP) to determine the positions, sizes, and initial weights of the learnable filters, which may limit its flexibility when faced with different frequencies.

In order to strike a better balance between the measuring accuracy, speed, and flexibility, we proposed a real-time 3D imaging method based on region of interest (ROI) projection and a phase-estimation network. For fringe projection, by determining an appropriate projection resolution for fringes, we effectively increased the number of fringe periods within the tested target (i.e., ROI) and thereby improved phase accuracy. For phase demodulation, a phase estimation module was designed to provide reliable initial phases for the subsequent lightweight neural network, resulting in the phase-estimation network (PE-Net). In formulating the loss function, the similarity of information in both spatial and frequency domains was considered for effectively improving the accuracy and speed of the phase prediction. For phase unwrapping, theoretical analysis was employed for the MHPU method from the perspective of depth constraint, resulting in a more rational selection of frequencies. Experimental results validate that the proposed method, by jointly improving the three stages mentioned above, can achieve real-time 3D imaging with an error of less than 0.031 mm at a resolution of over a million points. Moreover, the imaging speed exceeds 100 frame/s, which provides an efficient lightweight solution for deep-learning-assisted 3D shape measurement in dynamic scenes.

2. Principle

2.1. Basic principle of fringe projection profilometry

A typical monocular FPP system consists of three main components: a projector, a camera, and a computational unit. The projection device projects one or more encoded fringe patterns onto the target object. The imaging device captures the deformed fringes reflected from the object’s surface. Finally, the computational unit decodes the deformed fringes, providing the 3D shape information of the tested object. In $N$ -step PSP, the captured fringe patterns can be represented as $I_{n} = A + B \cos (ϕ - δ_{n}), δ_{n} = 2 π (n - 1) / N, n = 1, 2, \dots, N .$ (1)Here, $(x, y)$ donates the pixel coordinate, $A$ is background intensity, $B$ is intensity modulation, and $ϕ$ is the desired phase map. According to the least squares algorithm, the phase map $ϕ$ and intensity modulation can be calculated as $ϕ = \arctan \frac{M}{D} = \arctan \frac{\sum_{n = 1}^{N} I_{n} \sin δ_{n}}{\sum_{n = 1}^{N} I_{n} \cos δ_{n}},$ (2) $B = \frac{2}{N} \sqrt{M^{2} + D^{2}} .$ (3)Here, $M (x, y)$ and $D (x, y)$ represent the numerator and denominator of the arctangent function, respectively. Due to the nature of the distribution of values computed by the arctangent function, the phase obtained will be truncated within $(- π, π]$ and needs to be unwrapped. By employing an appropriate phase unwrapping method, the wrapped phase $ϕ$ can be mapped to its unwrapped counterpart: $Φ = ϕ + 2 π k .$ (4)Here, $k$ is the integer fringe order. Finally, after obtaining the unwrapped phase containing object information, further conversion of the unwrapped phase from pixel coordinates to real-world 3D coordinates can be achieved through calibration models such as the phase-height mapping model^[33,34] and triangular stereo model^[35,36]. This completes the overall process of 3D imaging. The triangular stereo model was employed in the experimental section.

Traditional FPP techniques are highly effective for measuring static scenes, but they often encounter limitations in achieving high-precision and real-time processing in dynamic scenes. Therefore, many recent works have attempted to integrate deep learning into FPP to overcome the efficiency bottlenecks in measurements.

2.2. Proposed real-time 3D imaging based on ROI projection and phase estimation

2.2.1. Framework of the proposed real-time 3D imaging method

Currently, most deep-learning-based 3D imaging methods adopt end-to-end CNNs, leveraging a large dataset to establish an accurate mapping relationship between input fringes and ground truth. UNet^[37] and its derivatives are the most popular networks used for fringe analysis. While such methods can achieve high precision in phase retrieval, the introduction of numerous convolutional layers or complex modules makes it challenging to meet the real-time measurement demands of dynamic scenes. Although some methods have proposed lightweight solutions to solve this problem, they often sacrifice accuracy or flexibility. Therefore, achieving improved inference speed without compromising the phase precision and method flexibility remains a pressing issue.

To address this issue, a real-time 3D imaging method based on ROI projection and phase estimation is proposed, as illustrated in Fig. 1. First, the ROI fringe projection strategy is employed to precisely target the area of interest on the tested object, narrowing down the ROI for fringe projection and imaging, thus reducing the fringe period and enhancing phase accuracy. Subsequently, a lightweight network architecture based on phase estimation is designed for real-time and high-precision phase demodulation. Finally, an improved dual-frequency phase unwrapping method is utilized to achieve robust and efficient single-frame 3D imaging. Since the a priori knowledge about phase unwrapping is required in ROI projection, the details about each module in the workflow will be introduced, respectively, in the order of phase demodulation, phase unwrapping, and fringe projection within Secs. 2.2.2 to 2.2.4.

Figure 1.Flowchart of real-time 3D imaging based on ROI projection and phase estimation. The green and orange arrows indicate the processing flow for low- and high-frequency fringe patterns, respectively.

Download full size

View all figures

2.2.2. Lightweight phase-estimation neural network

The proposed PE-Net, as shown in Fig. 2(a) and inspired by the work of Yin et al.^[32], utilizes a phase estimation module to pre-estimate initial phase information from a single-frame fringe pattern. Subsequently, this initial phase and the input fringe pattern are jointly used as inputs to the lightweight network, yielding the output phase information. Two learnable filters are constructed for estimating the initial phase, each aimed at removing zero frequency and filtering out the fundamental frequency in the FTP method. However, the sizes and center points of these learnable filters still need to be empirically chosen and cannot be flexibly applied to different numbers of fringe periods. Furthermore, existing deep learning frameworks still lack complete support for 2D fast Fourier transform operations. For example, in PyTorch^[38], FFT cannot be converted to open neural network exchange (ONNX), and its absence in the model may necessitate custom implementation or integration with specialized libraries for FFT operations, which could impact overall speed depending on the implementation efficiency. Therefore, learnable convolution operations in the spatial domain are employed to replace frequency-domain filtering operations, as shown in Fig. 2(b).

Figure 2.Real-time phase estimation network structure. (a) Outer structure of PE-Net, loss function design, and training process; (b) internal structure of the phase estimation module; (c) internal structure of the lightweight network.

Download full size

View all figures

The phase estimation module utilizes two convolutional layers combined with ReLU^[39] activation functions as the main components. It extracts the fundamental frequency phase information from the input image through spatial convolution. The first convolutional layer has 64 channels, responsible for converting the fringe pattern into multidimensional features, while the second one has 2 channels, responsible for learning the arctan numerator and denominator terms used to compute the initial phase from the multidimensional features. This module is relatively simple and does not require prior knowledge about FTP, making it flexible for various fringe analysis tasks.

The initial phase information predicted by the phase estimation module is concatenated with the input fringe pattern along the channel dimension and input into the lightweight network to obtain the final phase. The specific design of the lightweight network structure and the internal structures of each module are illustrated in Figs. 2(b), 2(c), and 3, respectively.

Figure 3.NonBt1d module, downsampling module, dilated convolution module, and upsampling module within the lightweight network.

Download full size

View all figures

The overall structure adopts an encoder-decoder architecture, incorporating modules from ERFNet^[40] to achieve a balance between high-quality phase and low computational requirements. The encoder is responsible for capturing hierarchical features from the input fringe images. It reduces the spatial resolution of the feature maps and increases the number of channels through three downsampling steps. The initial number of channels is 64, and the number of channels doubles with each subsequent downsampling step.

The downsampling module adopts the design of ENet^[41], which combines $2 \times 2$ max pooling and $3 \times 3$ convolution with a stride of 2. This combination extracts richer fringe feature information compared to a single convolution operation. The features obtained from downsampling are further processed by the non-bottleneck (NonBt1d) module. This module decomposes the standard convolution in the residual module into two convolutions of sizes $3 \times 1$ and $1 \times 3$ , ensuring approximate accuracy while reducing the required parameters, which is also known as depthwise separable convolution. The last downsampling module utilizes NonBt1d modules with different dilation rates for dilated convolutions to achieve a larger receptive field, which is advantageous for handling high-resolution fringe patterns.

In the decoder part, a symmetrical structure to the encoder is adopted. The upsampling module in the decoder uses $3 \times 3$ transposed convolution with a stride of 2 to adjust the spatial resolution of the feature maps. After three upsampling steps, the output is used to predict the numerator and denominator terms for phase computation, consistent with the $M$ and $D$ in Eq. (2).

The training process of PE-Net is illustrated in Fig. 2(a). The loss function evaluates the network’s output from both spatial and frequency domains. The spatial loss function measures the mean squared error (MSE) among the initial phase information ${\hat{y}}_{init}$ obtained by the phase estimation module, the phase $\hat{y}$ obtained by the lightweight network, and the ground truth phase $y$ : $L_{spatial} = α_{1} {[\hat{y} (w) - y]}^{2} + α_{2} {[N {\hat{y}}_{init} (w) - y]}^{2},$ (5)where $w$ represents the neural network parameters; the tensor sizes of ${\hat{y}}_{init}$ , $\hat{y}$ , and $y$ are $H \times W \times 2$ , where the first channel contains the numerator terms required for phase computation; and the second channel contains the corresponding denominator terms. The frequency domain loss function is given by $L_{Fourier-mse} = β_{1} {[\hat{G} (w) - G]}^{2} + β_{2} {[{\hat{G}}_{init} (w) - G]}^{2},$ (6) $L_{Fourier-ssi} = γ_{1} \frac{\sum | \hat{G} (w) - G |}{\sum | \hat{G} (w) + G |} + γ_{2} \frac{\sum | {\hat{G}}_{init} (w) - G |}{\sum | {\hat{G}}_{init} (w) + G |},$ (7)where ${\hat{G}}_{init}$ , $\hat{G}$ , and $G$ are the Fourier transforms of ${\hat{y}}_{init}$ , $\hat{y}$ , and $y$ , respectively. $L_{Fourier-mse}$ measures the pixel-wise difference between the network output and the ground truth from the perspective of spectral amplitude, while $L_{Fourier-ssi}$ evaluates the structural difference between the network output and the ground truth from the perspective of spectral similarity. The parameters $α_{1}$ , $α_{2}$ , $β_{1}$ , $β_{2}$ , $γ_{1}$ , and $γ_{2}$ are hyperparameters that adjust the importance of spatial and frequency domain components. The total loss function is expressed as $L = L_{spatial} + L_{Fourier-mse} + L_{Fourier-ssi} .$ (8)

The proposed PE-Net includes an additional phase estimation module, which provides more effective input features for the lightweight network. In terms of network structure, spatial separable convolutions are utilized to reduce computational complexity, and multiple layers of dilated convolution NonBt1d modules are stacked to increase the receptive field, facilitating the handling of high-resolution fringe patterns.

2.2.3. Theoretical analysis of the modified heterodyne phase unwrapping method

The aforementioned PE-Net achieves high-precision phase retrieval from fringe patterns captured by ROI projection. The MHPU, as reported in our previous work^[31], is adopted to unwrap the phase for a wider selection range of the frequencies, as shown in Fig. 4. The unwrapping process begins with taking the phase differences between the phases of two frequencies: $ϕ_{eq} = ϕ_{h} - ϕ_{l},$ (9)where $ϕ_{l}$ , $ϕ_{h}$ , and $ϕ_{eq}$ represent the low frequency, high frequency, and equivalent phase, respectively. The corresponding beat frequency $f_{eq} = f_{h} f_{l} / (f_{h} - f_{l})$ . Different from the traditional dual-frequency heterodyne phase unwrapping method, the MHPU method relaxes the limitation on the number of fringe periods, allowing the beat frequency to be greater than 1. This means that higher numbers of fringe periods can be used to modulate the information of the object, thereby improving phase accuracy. Then an unwrapped phase $ϕ_{r}$ of the reference plane is employed to eliminate the ambiguity of $ϕ_{eq}$ : $Φ_{eq} = ϕ_{eq} + 2 π \cdot Round [\frac{Φ_{r} - ϕ_{eq}}{2 π}] .$ (10)

Figure 4.Flowchart of the MHPU method. (a) Equivalent phase with the phase difference of low- and high-frequency fringe; (b) unwrapped equivalent phase assisted by a reference plane; (c) unwrapped phases of low frequency and high frequency.

Download full size

View all figures

Furthermore, the wrapped phases of low frequency and high frequency can be unwrapped by ${\begin{matrix} Φ_{l} = ϕ_{l} + 2 π \cdot Round [\frac{(f_{l} / f_{eq}) Φ_{eq} - ϕ_{l}}{2 π}] \\ Φ_{h} = ϕ_{h} + 2 π \cdot Round [\frac{(f_{h} / f_{eq}) Φ_{eq} - ϕ_{h}}{2 π}] \end{matrix} .$ (11)

Utilizing the pre-computed phase of the reference plane can significantly expand the selectable range of frequency for heterodyning, enhancing accuracy and aiding the network’s feature extraction from input fringes. However, the introduction of a reference plane may lead to a reduction in the measurement range. Therefore, theoretical analysis of MHPU is necessary to ensure a reasonable measurement range while avoiding errors in phase unwrapping.

The limitations of MHPU are similar to those of absolute phase unwrapping methods based on geometric constraints^[42], as shown in Fig. 5. The angle between the projector and camera optical axes is $θ$ , and the physical size of the fringe period is $Δ x$ . Points $P$ and $Q$ represent the intersection of the same ray projected by the projector with planes at two different depths, $z_{\min}$ and $z_{\max}$ . The maximum depth range of the measurement is $[z_{\min}, z_{\max}]$ , and $Δ z_{\max} = z_{\max} - z_{\min}$ . To avoid errors in phase unwrapping, the maximum depth range $Δ z_{\max}$ needs to satisfy $Δ z_{\max} = \frac{Δ x}{\tan θ} .$ (12)

Figure 5.Depth limitation of MHPU.

Download full size

View all figures

Assume the projector’s field of view (FOV) is $W_{FOV}$ (mm). When the camera can capture the entire projected fringe pattern, and the depth range of the object under test is $Δ z_{obj}$ . Then the following inequality holds: $\frac{W_{FOV}}{\tan θ \sqrt{f_{h}}} \geq Δ z_{obj} .$ (13)

The number of fringe periods can be obtained by solving the above inequality: $f_{h} \leq {(\frac{W_{FOV}}{Δ z_{obj} \tan θ})}^{2} .$ (14)

In conclusion, the selection of the low-frequency $f_{l}$ and high-frequency $f_{h}$ in the improved dual-frequency phase unwrapping strategy depends on the robustness requirements of the phase unwrapping and the desired range of measurement depth. To ensure good noise immunity in phase unwrapping, the low and high frequencies must satisfy $f_{l} = f_{h} - \sqrt{f_{h}}$ . Meanwhile, to ensure the maximum measurable depth range, the high-frequency $f_{h}$ needs to satisfy Eq. (14).

2.2.4. ROI fringe projection strategy

For a fixed FPP system, reducing camera resolution to decrease the number of pixels is a common approach to meet the real-time demands of dynamic scenes. However, this approach will result in the camera’s FOV being smaller than that of the projector, causing a reduction in the effective number of fringe periods from the camera’s perspective and subsequently decreasing phase accuracy. Additionally, the size of the target object is usually smaller than the projector’s FOV. If the generated fringes are projected according to the full resolution of the projector, it will lead to a limited number of fringe periods effectively modulated by the object within the FOV. To address these two problems, a fringe projection strategy based on the ROI is proposed.

Equation (14) can be rewritten as $p_{h} \geq W_{p} {(\frac{Δ z_{obj} \tan θ}{W_{FOV}})}^{2},$ (15)where $p_{h}$ represents the width of the high-frequency fringe period and $W_{p}$ represents the width of the generated fringe (pixels). It is evident that appropriately reducing $W_{p}$ while ensuring that the projected fringe can cover the ROI can effectively decrease the lower bound of $p_{h}$ , which can increase the corresponding number of fringe periods.

The normal fringe projection strategy, as shown in Fig. 6(a), utilizes the entire resolution $W_{p} \times H_{p}$ pixels of the projector to project fringes. For areas outside of the tested objects in the measurement scene, where the fringe does not contain valid information, the corresponding fringe pattern pixels under the projector’s perspective are not involved in calculations. Therefore, during fringe generation, actively constraining and reducing the area can maintain the same number of fringe periods while decreasing the fringe period, thereby enhancing measurement accuracy.

Figure 6.Two fringe projection strategies. (a) Normal fringe projection. (b) ROI fringe projection.

Download full size

View all figures

As shown in Fig. 6(b), the resolution of the ROI fringe pattern $W_{p}^{'} \times H_{p}$ is determined based on the actual occupied area of the object in the measurement FOV. The origin of the ROI has shifted by $Δ W$ relative to the original image in the $x$ -axis direction. The phase distribution of normal and ROI fringe projection can be expressed as ${\begin{matrix} Φ = \frac{2 π f x}{W_{p}} \\ Φ^{'} = \frac{2 π f}{W_{p}^{'}} (x - Δ W_{p}^{'}) \end{matrix} .$ (16)

The phase relationship before and after changing the projection strategy is $Φ = \frac{W_{p}^{'}}{W_{p}} Φ^{'} + \frac{2 π f Δ W_{p}^{'}}{W_{p}} .$ (17)

In the triangular stereo model, the calculation of pixel coordinates is based on the origin of the phase in the normal projection strategy. Therefore, when using the ROI projection strategy, the obtained phase needs to be transformed according to Eq. (17) to obtain the correct pixel coordinates, ensuring the phase starts at the appropriate zero point.

In conclusion, the ROI fringe projection makes it more effective to utilize the measurement system’s FOV. In practical applications, only the approximate FOV of the measurement scene and the angle between the projector and camera optical axes are required. This allows for increasing the fringe frequency while keeping the number of fringe periods constant, thereby enhancing the precision of 3D measurements.

2.3. Performance analysis of the proposed real-time 3D imaging method

The proposed real-time 3D imaging method introduces several innovations. In the aspect of phase demodulation, a neural network-driven approach is adopted, using more flexible, learnable spatial convolution layers to replace manually designed operations in the spectral domain. This provides reliable initial phase estimation for the subsequent lightweight network, allowing the network to learn accurate mapping relationships with fewer parameters without sacrificing phase accuracy. In the loss function of PE-Net, the network training is constrained in both spatial and frequency domains with the consideration of pixel-level differences and structural similarity differences, resulting in more accurate output phase information.

In the aspect of phase unwrapping, the conditions required for the MHPU method to successfully unwrap the phase are analyzed from the perspective of depth constraint. The selectable range of low frequency and high frequency is determined to ensure that the correct 3D shape can be obtained while the highest frequency can be used to modulate the object information.

In the aspect of fringe projection, based on the conclusions drawn from analyzing the MHPU method, the region of actually projected fringes is reduced to within the ROI, decreasing the number of fringe periods without losing valid information. Additionally, the phase relationship before and after using ROI fringe projection is derived to provide the correct phase when performing 3D reconstruction with the fixed calibration parameters.

Through the analysis and improvements in ROI fringe projection, PE-Net phase demodulation, and MHPU phase unwrapping, the entire 3D imaging process gains the capability to efficiently process fringe images while maintaining high precision and flexibility.

3. Experiment

3.1. ROI determination and dataset establishment

The experimental setup of the FPP system consists of a camera (Baumer, VCXU-31M) with a resolution of 1280 pixel × 800 pixel and a projector (TI, DLP4500) with a resolution of 912 pixel × 1140 pixel. The focal length of the camera lens is 12 mm. Both the camera and projector have an exposure time set to 10 ms. The system’s FOV is approximately 300 mm × 150 mm, with a working distance of around 600 mm.

During the measurement, the projector’s FOV exceeds that of the camera, allowing for further reduction in the fringe period using ROI fringe projection. First, a white pattern was projected onto the surface of the target object. Next, the valid area of the object within the image captured by the camera can be determined. Finally, the resolution of the ROI under the projector’s perspective was calculated. An additional redundancy is added to the maximum size of the object area to calculate the ROI. The actual required fringe width $W_{p}^{'}$ for projection is 500 pixel, with an offset of 256 pixel in the $x$ -axis direction.

After determining the ROI for projecting fringes, it is necessary to calculate the suitable number of fringe periods $f_{l}$ and $f_{h}$ required in MHPU. The projected fringes occupy the entire FOV of the camera, with the measured $W_{FOV}$ approximately 300 mm. The depth range $Δ z_{obj}$ of the object under test was approximately 120 mm. The angle $θ$ between the optical axes of the camera and projector was obtained through calibration as 16.32°. Substituting $W_{FOV} = 300$ , $Δ z_{obj} = 120$ , and $θ = 16.32 °$ into Eq. (14) yields $f_{h} \leq {(\frac{300}{120 \times \tan 16.32})}^{2} \approx 72.9 .$ (18)

The maximum selectable fringe period number is 72.9. Therefore, $f_{h}$ was selected as 72. The maximum measurement depth range can be driven by Eq. (14), $Δ z_{\max} = 120.7 mm$ , which is quite reasonable in many application scenarios. Consequently, $f_{l}$ was selected as $f_{l} = 72 - \sqrt{72} \approx 64$ to ensure better noise resistance, which can be derived from the traditional three-frequency phase unwrapping method.

12-step phase-shifting fringe patterns with $f_{l} = 64$ and $f_{h} = 72$ were projected to calculate the ground truth of the numerator and denominator terms in the phase demodulation task. A total of 100 sets of data have been collected under different scenes and divided into training and validation sets at a ratio of $9 ∶ 1$ . A mask was obtained by setting the average intensity threshold to 10, which is then applied to the fringe pixels to remove unreliable regions with low-intensity modulation. To further improve the model’s generalization and robustness, crop transformation was adopted to input images during the training stage. Specifically, a random region with a width of 64 to 256 pixel was selected from the image, and the pixel values within that region were set to zero.

3.2. Quantitative evaluation of ROI fringe projection

In this section, we compared the accuracy variation between the normal and the ROI projection strategy, using the three-step phase shift method to calculate the wrapped phase and MHPU for phase unwrapping.

Figure 7 shows a comparison between two projection strategies, using a number of fringe periods of 72 for both strategies. It can be seen that, after determining the appropriate ROI, the fringe period width is significantly reduced, indicating that more phase values are used for modulation of the target object in the ROI, resulting in higher measurement accuracy.

Figure 7.Comparison of two fringe projection strategies. (a) Generated fringe pattern (top left) and perspective view (top right) of normal fringe projection. (b) Corresponding results of ROI fringe projection.

Download full size

View all figures

Two standard spheres were measured to quantitatively evaluate the accuracy of the ROI fringe projection strategy. The diameters of the two standard spheres are $D_{A} = 50.7991 mm$ (left sphere A) and $D_{B} = 50.7970 mm$ (right sphere B), with a distance between their centers $D_{C}$ of 100.2537 mm. 3D reconstruction results of the two strategies are shown in Fig. 8, and the corresponding values are given in Table 1. After using ROI projection, the fringe period width on the surface of the standard sphere was significantly reduced, and the ${MAE}_{A}$ and ${RMS}_{A}$ errors decreased by 0.0225 and 0.0137 rad, respectively. The ${MAE}_{B}$ and ${RMS}_{B}$ decreased by 0.028 and 0.0243 rad, respectively, proving the effectiveness of the proposed ROI projection strategy.

Figure 8.Measurement results of standard spheres using three-step PSP and two projection strategies. (a), (b) Fringe patterns of two projection strategies. (c),(d) 3D reconstruction results of two projection strategies.

Download full size

View all figures

Table 1. Quantitative Results of Standard Sphere Measurement Using Three-Step PSP and Two Projection Strategies

View table
View all Tables
Table 1. Quantitative Results of Standard Sphere Measurement Using Three-Step PSP and Two Projection Strategies

$D_{A}$ $D_{B}$ $D_{C}$ ${MAE}_{A}$ ${MAE}_{B}$ ${RMS}_{A}$ ${RMS}_{B}$
Normal 50.8002 50.8158 100.1140 0.0995 0.0855 0.1207 0.1100
ROI 50.8000 50.8062 100.1362 0.0770 0.0718 0.0927 0.0857

3.3. Training and validation of PE-Net

PE-Net was implemented by PyTorch 1.13.1. The training and validation process was completed in a hardware environment with an Intel Xeon Gold 622 R 2.90 GHz CPU and one 40 GB NVIDIA A40 GPU. The inference time measurement was conducted with FP16 precision on a laptop with an Intel i9-13800HX 2.20 GHz and 8 GB NVIDIA GeForce RTX 4070 Laptop GPU through TensorRT. An AdamW^[43] optimizer with an initial learning rate of $10^{- 3}$ was used to minimize the joint loss function in Eq. (9). During the training stage. The learning rate was adjusted by the cosine annealing strategy within the range of $10^{- 3}$ to $10^{- 5}$ . The hyperparameters of the loss function were set to $α_{1} = 1$ , $β_{1} = 0.5$ , and $γ_{1} = γ_{2} = 1000$ . The initial values of $α_{2}$ and $β_{2}$ were set to 0.1 and gradually decreased using the step decay strategy to 0.02 as the epoch increased. The batch size was set to 4. The train and validation loss curves are shown in Fig. 9.

Figure 9.Train and validation loss curves of PE-Net.

Download full size

View all figures

The phase error of PE-Net relative to the 12-step PSP was given in Table 2, and Feng’s method^[22], UNet method^[23], and NAS method^[31] were trained and validated on the same dataset for comparison. The inference time was calculated by averaging the execution time of each algorithm, obtained from running it 1000 times in FP16. It can be seen that the MAE of the proposed method is 0.02979 rad, which is the smallest among the four methods, and the corresponding RMS is 0.05163 rad, only 0.0016 rad larger than that of UNet with the largest parameter number. The mean inference time of PE-Net is 6.01 ms (about 166 frame/s), which is the fastest among the four methods. The inference time of Feng’s method, UNet method, and NAS method are approximately 11.26 times, 6.103 times, and 2.83 times that of PE-Net, respectively. The experimental results preliminarily demonstrate the superiority of the proposed method in terms of speed and accuracy.

Table 2. MAE and RMS of the Phase Error, Inference Time, and Model Size of Four Methods

View table
View all Tables
Table 2. MAE and RMS of the Phase Error, Inference Time, and Model Size of Four Methods

Method MAE (rad) RMS (rad) Inference time (ms) Model size (M)
Feng 0.03417 0.05811 67.70 0.70
UNet 0.03103 0.05007 41.02 31.00
NAS 0.03728 0.06364 17.02 0.13
PE-Net 0.02979 0.05163 6.01 2.10

3.4. Quantitative evaluation of the proposed method

To verify the effectiveness of the proposed method in complex scenarios, we first measured a sculpture of David. The 3D reconstruction results obtained from the 12-step PSP (ground truth), Feng’s method, UNet method, NAS method, and PE-Net are shown in Fig. 10. Additionally, Figure 11 illustrates the depth error distributions relative to the 12-step PSP, along with the corresponding MAE and RMS values. It can be seen that near the area around David’s nose, the errors are more pronounced for Feng’s method and the NAS method, whereas UNet and the proposed method exhibit smaller errors. Benefiting from the phase estimation module providing initial phase information to the network, the proposed PE-Net achieves lower MAE and RMS errors compared to the UNet method by 0.001 and 0.0009 mm, respectively, while maintaining relatively fast inference speed and lower computational complexity. The results indicate that the proposed method can obtain accurate 3D reconstruction results when dealing with complex objects.

Figure 10.3D reconstruction results of the David sculpture. (a) Ground truth. (b) Feng’s method; (c) UNet method; (d) NAS method; (e) proposed PE-Net.

Download full size

View all figures

Figure 11.Error distributions of the David sculpture. (a) Feng’s method; (b) UNet method; (c) NAS method; (d) proposed PE-Net. (e)–(h) Localized enlarged images of the dashed-boxed regions corresponding to each method.

Download full size

View all figures

Two standard spheres were measured to quantitatively evaluate the four methods, with parameters identical to those in Sec. 3.2. The 3D reconstruction results and error values of the four methods are shown in Fig. 12. From the depth error distributions, it can be observed that the proposed method exhibits smaller residual periodic errors compared to the other three methods. This is reflected in the corresponding MAE and RMS errors, all of which are less than 0.031 mm for the proposed method, which is attributed to the PE module proposed in the method, which simulates the FTP method’s extraction of coarse phase, allowing the overall network to achieve higher accuracy with fewer parameters and faster inference speed. These results demonstrate that, by combining ROI projection, PE-Net phase demodulation, and MHPU phase unwrapping, high-precision real-time 3D imaging can be achieved through the entire pipeline, balancing the speed, accuracy, and flexibility of the algorithm.

Figure 12.3D reconstruction results and error values of the standard spheres. (a) Feng’s method; (c) UNet method; (e) NAS method; (g) proposed PE-Net. (b), (d), (f), (h) Corresponding depth error distributions and error values.

Download full size

View all figures

3.5. Real-time 3D imaging of the dynamic scene

To validate the performance of the proposed method in dynamic scene measurement, we developed a real-time 3D imaging system on the laptop with an Intel i9-13800HX 2.20 GHz and 8 GB NVIDIA GeForce RTX 4070 Laptop GPU, which is mentioned in Sec. 3.2. Before the measurement process, it is necessary to generate ROI patterns based on parameters such as FOV and working distance and upload them to the projector firmware. Additionally, the network model and pre-calibrated parameters are stored in the GPU. During the measurement process, the projector triggered the camera to capture images via hardware signals. The captured image stream was transferred sequentially from the CPU to the GPU. Following normalization, PE-Net execution, phase calculation, MHPU phase unwrapping, and 3D reconstruction, point cloud results were obtained. Finally, the results can be visualized through OpenGL in real-time.

As shown in Figs. 13(a) and 13(b), the measurement scene contained a static David and a rotating Nike sculpture. Two fringe patterns with 64 and 72 fringe periods were projected cyclically. Although the inference time of PE-Net is 6.01 ms (166 frame/s), consideration must be given to the extra time consumed by data transmission between the CPU and GPU, point cloud rendering time, and GPU synchronization waiting time. Therefore, the projection period of the projector was set to 10 ms (100 frame/s) to prevent potential data transmission bottlenecks and thread blocking, thereby ensuring real-time data processing efficiency. Since the entire 3D imaging can be completed within the projection interval, the speed of the entire system is consistent with the projection speed of the projector, both being 100 frame/s. The real-time measurement result is shown in Visualization 1.

Figure 13.Real-time 3D reconstruction results (Visualization 1). (a), (b) Measurement scene comprising a stationary David and a rotating Nike sculpture. (c)–(f) 3D point cloud of different frames.

Download full size

View all figures

Figures 13(c)–13(f) present the point cloud results of different frames during the imaging process. It is evident that intricate details such as the wings and waist of the rotating statue of Nike, as well as complex variations in the position of details like the hair and neck of David, are accurately reconstructed. The experimental results demonstrate that PE-Net can accurately recover the phase information of objects with complex surfaces. Combined with the ROI fringe projection strategy and MHPU, high-accuracy real-time 3D imaging has been achieved with a speed of 100 frame/s and a resolution of $1280 \times 800$ .

4. Conclusion and Discussion

In this paper, the spatial and spectral information of the phase is utilized to provide a reasonable initial phase for the neural network. The lightweight network, combined with the phase estimation module, can predict high-precision phase results in real time. Additionally, theoretical analysis is conducted on the optimal selection of fringe cycle numbers in the MHPU method considering the FOV, measurement depth, and projector resolution. Based on this analysis, adjustments are made to the ROI of the fringe patterns from the perspective of the projector, further enhancing the phase accuracy. Our main contributions can be summarized in the following three points: 1.Efficient phase demodulation with PE-Net. The phase estimation module is designed to provide a reliable initial phase, enabling the lightweight neural network to achieve faster inference speeds without sacrificing the final accuracy. In the design of the loss function, both spatial and spectral information similarity are simultaneously considered.2.ROI fringe projection strategy. To address mismatches among the projector’s FOV, camera’s FOV, and target object size, appropriate ROI and resolution for actually projected fringes are carefully selected based on system parameters and theoretical analysis, which enables the reduction of fringe period width and thus improvement of the phase accuracy.3.Real-time 3D imaging that achieves a balance among speed, accuracy, and flexibility. Optimization is conducted at different stages of the 3D imaging process, including ROI fringe projection strategy, PE-Net phase demodulation, and MHPU phase unwrapping, which allows for one 3D reconstruction result for each newly captured image. The entire pipeline achieves real-time 3D imaging with an RMS error of less than 0.031 mm, a resolution of $1280 \times 800$ , and a speed over 100 frame/s, providing a more efficient lightweight solution for dynamic scene measurement.

Several aspects require further improvement in future investigations. First, with the increase in imaging speed, the data transmission bottleneck becomes the primary constraint on the algorithm performance. The transmission time of images between CPU and GPU already accounts for approximately 30% (1.79/6.01) of the entire algorithmic process time. Hence, exploring methods to reduce the data transmission time is crucial. This could involve enhancing data compression algorithms or computing pixel values solely for regions experiencing scene changes, thereby refining the algorithm’s performance.

Second, when dealing with objects with large-scale motion, it is challenging to flexibly determine suitable ROI projection strategies based on the current object pose in real-time. This limitation hampers the effective enhancement of measurement accuracy. Utilizing common techniques in deep learning such as object detection and semantic segmentation, it is possible to predict the spatial positions of the objects of interest in the current frame based on the captured fringe sequence. Continuously adjusting the ROI of fringe projection accordingly represents a promising avenue for future research in achieving high-precision real-time 3D imaging.

Lastly, recollecting data is necessary if the system parameters change. The sensitivity of the model to changes in specific system parameters (e.g., fringe frequency selection based on optical setup) plays a crucial role in the performance of the proposed method. Techniques like fine-tuning or transfer learning can enhance the model’s adaptability. If system parameters change within certain bounds or patterns, fine-tuning the pre-trained model on new data reflecting these changes can update the model without starting from scratch. This approach can reduce the need for extensive data recollection.

Category: Research Article

Received: Jun. 11, 2024

Accepted: Sep. 2, 2024

Published Online: Sep. 25, 2024

The Author Email: Qican Zhang (zqc@scu.edu.cn)

DOI:10.3788/AI.2024.10008

Table 1. Quantitative Results of Standard Sphere Measurement Using Three-Step PSP and Two Projection Strategies

Table 1. Quantitative Results of Standard Sphere Measurement Using Three-Step PSP and Two Projection Strategies

Table 2. MAE and RMS of the Phase Error, Inference Time, and Model Size of Four Methods

Table 2. MAE and RMS of the Phase Error, Inference Time, and Model Size of Four Methods