Predictive pixel-wise optical encoding: towards single-shot high dynamic range moving object recognition

Yutong He; Yu Liang; Honghao Huang; Chengyang Hu; Sigang Yang; Hongwei Chen

doi:10.1364/PRJ.533288

1. INTRODUCTION

High dynamic range (HDR) imaging is an essential imaging technology which aims to preserve as much effective information as possible from the natural scene, which satisfies the needs of applications in extreme conditions in which low dynamic range (LDR) imaging fails. For example, autonomous driving using the LDR visual pipeline could be saturated and over-exposed by highlight, resulting in the loss of information and an impact on decision-making, while HDR imaging alleviates these problems and guarantees the safety of autonomous driving.

Dynamic range (DR) is expressed as the ratio between the highest and lowest luminance value [1], also defined as the ratio of the highest to the lowest pixel value in logarithmic form within digital image domain [2]: $DR = 20 \lg (\frac{I_{\max}}{I_{\min}}) .$ (1)

From the moonlight illumination at night to bright sunshine conditions, the DR of the natural world reaches approximately 280 dB. The widest range of luminance that human eyes can perceive is 120 dB [3], while the DR of a conventional 8-bit camera cuts down to only 48 dB, according to Eq. (1). Generally, the process of capturing an HDR scene to acquire one LDR image via a conventional LDR camera can be modeled by three major steps: dynamic range clipping, non-linear mapping, and quantization [4]. Dynamic range clipping leads to a hard cutoff at some top irradiance determined by the full well capacity of the sensor, while quantization operation causes missing information for pixels in the under-exposed regions determined by the inherent noise of the sensor. It can be imagined how tremendous the information loss is during such a process in an LDR imaging pipeline. Yet, abundant applications are posing more requirements on extending dynamic range, including robotics, driver assistance systems, self-driving, and drones. Thus, in order to capture and accurately represent the rich information in natural scene content, plenty of HDR imaging techniques have emerged and drawn much attention from researchers over the last decades.

The mainstream approach which is widely deployed on current smartphone devices is to fuse a series of LDR images captured by a conventional sensor under different exposures [5 –8]. By sequentially capturing LDR measurements and merging them through brackets [9], these conventional HDR methods successfully expand the dynamic range and perform well with static scenes. However, these multi-shot fusion methods suffer from motion artifacts and difficulty with image alignment when dealing with motional scenes or a moving target. Although a batch of modified fusion solutions aims to eliminate these problems by leveraging optical flow [10] or using deep learning [11,12], they still have to face same inherent problems of these multi-shot fusion approaches: (1) the high memory consumption due to the large data volume (at least 3 LDR images for a single HDR output); (2) the high computational workload caused by complicated fusing algorithms. The above drawbacks limit the usage of multi-shot fusion methods in application scenes whose performance requirements are beyond mobile phone HDR.

To eliminate the aforementioned limitations of fusion methods, researchers dived into achieving HDR via the single-shot scheme. The prevailing single-shot HDR schemes are broadly categorized into two domains: software solutions concentrating on algorithmic enhancements and hardware strategies achieved through the revision of the sensor or optical path.

In the software domain, inverse tone mapping or inverse LDR camera pipelines, which developed deep networks to explicitly learn to reverse the image formation process [4], have provided a way for the reconstruction of HDR image from a single LDR image. Yet the end-to-end deep learning method still posed challenge on computational workload and generalization. Besides, HDR reconstruction is hallucinated from LDR images, which cannot be a credible alternative for downstream vision tasks that require accurate content from the real world.

In the hardware domain, researchers have attempted to capture multiple exposures simultaneously, within a single exposure period. To this end, some works aimed to redesign the pixel structure of the imaging sensor [13,14] and fuse multi-exposure measurements of optically aligned multiple sensors [15,16]. These methods achieve single-shot HDR reconstruction at the cost of fabrication complexity, device cost, alignment difficulty, footprint, and practical applicability.

As mentioned above, reconstructing HDR images from a single LDR image captured by a conventional sensor is an ill-posed inverse problem. A line of studies has attempted to overcome the challenge of ill-posed problems through the hardware domain by leveraging similar strategies from the field of computational imaging. They placed various optical components as optical encoders in front of the sensor in order to modulate the light field flexibly. The optical encoder encoded the saturated details, providing physical priors for the subsequent algorithms’ decoder to reconstruct an HDR image from a single LDR capture. The utilization of diffractive optical elements (DOEs) with point spread function (PSF) optimization strategies [17,18] was one aspect of these approaches, while spatially varying pixel exposures (SVEs) [2,19] were proposed as another aspect. Loading the custom patterns on spatial light modulators (SLMs) is a representative way of SVE methods, in which liquid crystals on silicon (LCoS) [20,21] or a digital micromirror device (DMD) [22,23] was employed to modulate the light field and encode the HDR information within the spatial domain or frequency domain. These existing optical-encoding-based single-shot HDR methods suffer drawbacks either in a further increase on DR, the recovery of an HDR scene with larger saturated regions, or time latency for a single HDR capture.

In this paper, we proposed a novel framework for a specific HDR vision task, namely a pixel-wise optical encoder driven by video prediction (POE-VP), which mainly consists of two modules: (1) the pixel-wise optical encoder module (POEM), as the hardware part; (2) the coded mask prediction module (CMPM), as the software counterpart. In POEM, we employ a DMD as an optical encoder to modulate the incident light intensity and selectively control the exposure, providing physical fundamentals for single-shot pixel-wise highlight suppression. The modulation speed of the DMD can reach a high level, up to several hundred or thousand hertz. In order to make full use of the DMD’s performance, we introduce the CMPM into our framework, which activates the high-speed light modulation potential of the DMD. Drawing inspiration from predictive models which perform well on decision-making applications and motional scenes, we utilize a deep-learning-based video prediction (VP) model as the CMPM. We leverage its ability to infer the motions in a video via extracting representative spatiotemporal correlations within the frame sequence [24] so as to predict the motion and variation of highlight regions in motional scenes. Thus, by inputting current and previous captured frames, the CMPM can predict a coded mask for the future frame in advance. Then, the mask will be loaded on the DMD to suppress local highlight during the capture of next frame, allowing us to acquire high-speed HDR capture via a single-shot method. To evaluate the proposed system’s performance on revealing lost information in saturated regions, we select moving license plate recognition (LPR) in highlight conditions as an HDR vision task on a motional scene. We established the first dataset consisting of paired HDR-LDR moving license plate videos for model training and framework simulation tests. To further demonstrate our methodology, a DMD camera prototype is fabricated to evaluate an HDR moving object recognition scene. Driven by the VP-based mask prediction, the proposed single-shot framework is adaptive to the real motional scene, performing well on both the simulated dataset and real scene. To the best of our knowledge, this is the first work investigating the integration of per-pixel exposure control and video prediction in the HDR imaging domain. The proposed single-shot framework overcomes the data volume and artifact drawbacks in multi-shot techniques. Our approach is able to faithfully recover the HDR information at relatively short time cost, which addresses the trade-off between DR and time efficiency. Overall, our approach raises a new single-shot solution in the field of motional HDR imaging, especially HDR vision tasks which desire both sufficient information and high time efficiency.

2. PRELIMINARIES ABOUT VIDEO PREDICTION

VP is a self-supervised extrapolation task, aiming to infer the subsequent frames in a video, based on a sequence of previous frames used as a context [24]. For deep-learning-based VP, the spatial-temporal correlations within the consecutive input frames are learned to form a prediction of the future. Drawing inspiration from its application in motional scenes such as autonomous driving, we adopt VP in POE-VP for HDR motional scenes. For the VP algorithm utilized in our work, we put much emphasis on the inference speed and generalization ability, in order to achieve high-speed HDR and make our framework adaptive for various over-exposure scenes. The VP model employed in our proposed framework is mainly based on the architecture of the dynamic multi-scale voxel flow network (DMVFN) [25]. By estimating the dynamic optical flow between adjacent frames, the DMVFN models various scales of motions between frames in order to predict future frames. The utilization of optical flow enhances the quality of prediction frames as well as the generalization ability and stability over diverse kinds of scenes. On the other hand, the DMVFN consists of a differentiable routing module, which is able to select different architecture sizes adaptive to the inputs with different motion scales. The routing module lowers the computational costs and significantly reduces the inference time. Specifically, for predicting a frame with resolution of $368 \times 640$ , the DMVFN costs only 0.016 s on average. To sum up, the DMVFN achieves state-of-the-art performance on inference speed while maintaining the high quality of predicted images under a strong generalization ability. These two advantages match well with the requirement of our work, that is, to generalize the VP model to a motional scene consisting of an over-exposed moving license plate in order to predict a DMD mask in advance and achieve a single-shot HDR at a relatively high speed. More information about the DMVFN is provided in Appendix B.

3. PRINCIPLE

As mentioned in Section 1, the proposed POE-VP in this paper consists of two modules, namely the POEM and CMPM, respectively. The POEM is driven by the CMPM; in particular, the coded grayscale mask loaded on the DMD in the POEM is predictively computed by the VP model in the CMPM. The conceptual framework of POE-VP performing in an HDR motional license plate scene is illustrated in Fig. 1.

Figure 1.The simplified illustration of the conceptual framework of POE-VP, demonstrating with an HDR motional license plate recognition downstream vision task. POEM (shown in the first row in beige color) and CMPM (shown in the second row in purple color) are two core modules that build up POE-VP.

Download full size

View all figures

We employed the DMD to achieve per-pixel optical encoding in order to control the exposure on each spatial position. The imaging hardware used in POE-VP is summarized inside the orange dashed box in Fig. 1. The arrangement of these optical components of a practical prototype will be discussed in Section 7.

The first row of Fig. 1 (in the beige background color) illustrates the circumstance when no mask is loaded on DMD. In this case, the whole imaging pipeline can be seen as a conventional LDR imaging system, which allows the entire original light from the HDR scene to enter the sensor without any modulations. The highlight in the HDR scene forms the over-exposure regions on the imaging sensor, leading to LDR captures (Frame t−1/t) whose license plate cannot be recognized by any license plate detection and recognition (LPDR) approaches because of the information loss caused by saturation.

As for the second row of Fig. 1 (in the purple background color), it depicts the performance of the CMPM. In the purple dashed box, the VP model, DMVFN, takes the previous captured LDR frame sequence (Frame t−1/t, $x_{t - 1} / x_{t}$ ) as input. By estimating the optical flow in Frame t−1/t, the VP model infers the motional trends within the scene, especially the motion of saturated regions, and ultimately yields the prediction of the future frame (Predicted Frame $t + 1$ , ${\tilde{x}}_{t + 1}$ ). Then, on the basis of Predicted Frame $t + 1$ , the exposure control algorithm is utilized to compute a predictive grayscale mask (Predicted Mask Frame $t + 1$ ). Inspired by Ref. [22], we leverage the thought of threshold segmentation as the exposure control algorithm. The grayscale mask in the 8-bit unsigned integer data type, ${\tilde{M}}_{t + 1} (x, y)$ , is calculated as follows: ${\tilde{M}}_{t + 1} (x, y) = {\begin{matrix} 255, & {\tilde{x}}_{t + 1} (x, y) < V_{s} \\ 255 - a {\tilde{x}}_{t + 1} (x, y) - b \frac{V_{s}}{255} {\tilde{x}}_{t + 1} (x, y), & {\tilde{x}}_{t + 1} (x, y) \geq V_{s} \end{matrix},$ (2)where $V_{s}$ is the threshold pixel value, and $a$ and $b$ are mask coefficients of light intensity suppression. The threshold value should be set properly at less than but close to the saturated value (255) in order to suppress the highlight in the license plate region and maintain a smoother pixel value gradient within the same region. We experimentally set $V_{s}$ to 200 for a better evaluation performance of the framework on the target HDR vision task.

As shown in the third row of Fig. 1, as the objects in the scene keep moving, the predicted mask will be loaded on the DMD at a relatively high refresh rate concomitantly. Then, in the next frame’s capture process, the pattern of the mask (Predicted Mask Frame $t + 1$ , ${\tilde{M}}_{t + 1}$ ) matches with the highlight region in the HDR scene (Frame $t + 1$ , $x_{t + 1}^{'}$ ). In other words, each one of our spatiotemporal varying masks is customized for its corresponding scene at the same time stamp. This mask $t + 1$ is computed in a previous time stamp t instead of being iteratively generated after capturing Frame $t + 1$ , which reduces the motional artifact. As a result, we are able to realize a high-speed single-shot highlight suppression, and finally acquire an HDR capture (HDR image Frame $t + 1$ ). In HDR capture, the over-exposure regions of the moving car scene are successfully suppressed. By revealing the license plate information, which is saturated due to the highlight, the content on the license plate is clear enough for the LPDR net to recognize.

According to the aforementioned mechanism, for the first 2 frames at the very beginning, there are no predicted masks for them because of the absence of previous frames as inputs. Therefore, when operating, our framework has an initialization process. The first 2 frames captured by the POEM are used for initializing the VP model instead of HDR imaging. After the initialization, the whole system will operate under the aforementioned mechanism, capturing HDR images through POE-VP frames by frames at a certain rate.

To sum up, the key features of POE-VP are as follows. (1) We employ the DMD as the optical encoder in order to achieve pixel-wise exposure control, which is the hardware basis for expanding DR. The utilization of the DMD places a part of the computational procedures into the optical domain, which is originally done in the algorithm domain by electronics. Compared with approaches in the software domain, this pre-sensor optical processing lowers the backend computational costs and enhances the processing speed due to the high refresh rate of the DMD. (2) We use VP for predictively computing the coded mask to drive DMD, so that the patterns of grayscale masks are able to follow the motional change in the real scene. As a result, there are marked distinctions between our masks and other SVE methods’ masks. Our spatiotemporal varying mask sequence contains the spatiotemporal correlations depicting the motion of over-exposure regions. In contrast, the SVE methods generally use periodic static masks missing the temporal dimension or iteratively generated masks, which have an attenuation in temporal dimension. The principle of our per-pixel exposure control hits the goal of single-shot HDR, alleviating the motional artifacts and reducing data size.

4. IMPLEMENTATION OF LEARNING IMAGE SEQUENCE PREDICTION

A. Dataset Preparation

In the field of HDR, there is a common challenge on acquiring paired HDR-LDR datasets. In general, a natural HDR scene turns out to be the LDR image via an LDR camera. As a result, it is difficult to get an HDR ground truth from the real scene, which is corresponding to LDR measurements captured by the conventional LDR camera. Among the existing benchmark datasets for HDR, there is still a gap in single exposure with motional scenes [26].

In our work, we prepared our dataset mainly based on the LSV-LP dataset, a large-scale and multi-source video-based LPDR dataset [27]. The LSV-LP consists of three categories: move versus static, static versus move, and move versus move. The move versus static means that the data collection device is moving while the vehicle stays still. The static versus move means that the device is static while the vehicle is moving. The move versus move shows that the device and vehicle are both moving. We extract a subset from the category move versus static, including 68 videos, with 18,322 frames which are divided into a training dataset (62 videos, 18,185 frames) and a test dataset (6 videos, 137 frames). We also collect 26 videos with 2298 frames of a motional license plate on the road and organize them as another part of the test dataset. A mobile phone camera is used as the video data collection device. While the mobile phone keeps moving, the videos are recorded in 1080p, 60 fps (frames per second) from a road driving vehicle and a parked vehicle. All frames are resized to $720 \times 1280$ .

To set up a paired HDR-LPR license plate dataset, we need to acquire HDR information at first. However, the LSV-LP dataset is collected via a conventional LDR camera; thus, HDR information has already been cut off during the collection process in this dataset. Therefore, it is necessary to preprocess the original LSV-LP dataset. The preprocess approach is illustrated in Fig. 2. Based on these videos, we apply a process strategy called artificial highlight to generate HDR information. Specifically, we operate a Hadamard product at each frame from the original LSV-LP dataset with a grayscale map. The map, namely the artificial highlight map, follows a 2D Gaussian distribution whose extreme point aligns with the center point of license plate, written as $H (x, y) = O (x, y) ⊙ [E \cdot G (x, y) + 1], G (x, y) = e^{- \frac{\sqrt{{(x - c_{x})}^{2} + {(y - c_{y})}^{2}}}{2 R}},$ (3)where $O (x, y)$ represents the original frames in the LSV-LP dataset and $H (x, y)$ represents the generated HDR maps. $G (x, y)$ represents the grayscale Gaussian map, in which $(c_{x}, c_{y})$ is the coordinate of the license plate center point. The over-exposure rate E and radius R are parameters that control the intensity and size of the Gaussian map. The operation simulates the scene in which a Gaussian highlight (e.g., high-beam light) illuminates the license plate region, which makes some pixel values of the original LDR frames increase beyond 255, forming HDR information artificially. Using a heat map, Fig. 2(c) figuratively explains the artificial HDR information we create around the license plate region. Note that the real HDR frames we create here cannot be displayed without tone mapping and other processes because some of their pixel values exceed the maximum pixel value of the 8-bit display. In other words, Fig. 2(c) is not the real HDR frame displayed on the screen; it is only a visualization about how we exceed the boundary pixel value of 255 and create artificial HDR information.

Figure 2.(a) The normalized artificial highlight map $G (x, y)$ illustrated in the form of a heat map. The map follows a 2D Gaussian distribution. (b) The frame $O (x, y)$ extracted from an LDR original video in test dataset. The yellow crossline marks the center position of the license plate bounding box, which is coincident with the extreme point of Gaussian distribution in (a). (c) The figurative visualization of HDR information maps $H (x, y)$ . The pixel values over 255 in the deep blue region indicate the created artificial HDR information. (d) The simulated LDR frame.

Download full size

View all figures

Then, we imitate the main step of the LDR image formation pipeline, the dynamic range clipping, to simulate the scene of capturing LDR measurements via a conventional LDR camera.

In each HDR frame we generate, those pixels whose values are over 255 are cut off to 255, while other pixels remain unchanged. The over-255 pixels are saturated and form over-exposure regions, which produces a simulated LDR frame for us, as shown in Fig. 2(d). Via the aforementioned steps, we collected the paired HDR-LDR license plate dataset, namely HDR-LDR-LP.

B. VP Model Training

The preprocessing parameters E and R for the training dataset are set to 5 and 50, respectively. The DMVFN model used in our framework is trained on $368 \times 640$ image patches, which are resized from $720 \times 1280$ training data after the preprocessing mentioned above. The model is trained on 4 NVIDIA GeForce RTX3090 GPUs for 300 epochs, with a batch size of 16. Following the configuration in Ref. [25], the AdamW optimizer with a learning rate which reduces from $10^{- 4}$ to $10^{- 5}$ is employed in model training.

5. EVALUATION METRICS

Our framework in this work is emphasized on HDR imaging towards specific vision tasks, instead of HDR photography for human vision. As a result, the evaluation metrics in this paper put more weights on the amount of information we can reveal to enhance the downstream vision tasks in highlight conditions than image quality and perceptual quality. The main quantitative metrics we use are introduced and defined as follows.

A. Recognition Accuracy

Recognition accuracy (Rec-Acc) is calculated per test video, defined as Eq. (4). $N_{T}$ represents the total count of frames in a video. For the same video, $N_{p}$ represents the count of frames whose license plate is correctly recognized (every character in the recognition result matches the corresponding character in license plate label): $Rec-Acc = \frac{N_{p}}{N_{T}} .$ (4)

B. Information Entropy

To quantitatively measure the amount of information around the license plate region in both the HDR frame sequence and LDR frame sequence, we introduce the information entropy S [28] defined as follows: $S = - \sum_{i = 0}^{255} p_{i} \log_{2} p_{i},$ (5)where $p_{i}$ is the probability of pixels with a value of i ( $0 \leq i \leq 255$ ) in the image. The information entropy is calculated within the bounding-box of the license plate.

C. Local Information Entropy

A certain pixel’s local information entropy is calculated in a $3 \times 3$ neighborhood kernel. The average information entropy of 9 pixels in the kernel represents the local information entropy of the centric pixel. We use the local information entropy value for each pixel to draw a local information entropy heat map. As illustrated, we visualize the information difference between HDR frames and LDR frames for further analysis.

6. SIMULATION ASSESSMENT

A. Simulation Experiment

The test dataset prepared for the simulation experiment consists of two parts. The first part is the test set extracted from the original LSV-LP dataset, which contains 6 videos, with 137 frames. We employ the preprocess strategy mentioned above for the test set, setting the Gaussian map parameters E to 9 and R to 50. Furthermore, in order to evaluate the generalization ability of our framework, we capture additional videos of the license plate from a different province in China via an iPhone 13 Pro Max camera and fill up the second part of test dataset, which contains 26 videos. The second part of test dataset is then divided into three subsets, with a different E (9, 29, 49) and the same R (50) in preprocessing. The three configurations of over-exposure rate E simulate the various highlight conditions, which allow us to evaluate the generalization performance and stability of our POE-VP.

Figure 3 illustrates the simulation pipeline. The original frame sequence extracted from each video in the test dataset is sent into preprocess unit, yielding paired HDR information maps and simulated LDR frames. The simulated LDR frames are resized to $368 \times 640$ , which fit the input size of the VP model. The trained VP model will use the previous two simulated LDR frames to predict the next frame. For example, as shown in Fig. 3, simulated LDR frames $T = 0 /T = 1 (T = 1 /T = 2)$ are used to produce predicted LDR frames $T = 2 (T = 3)$ . The rest of the frames of this sequence follow the same method. One predicted LDR frame will form its corresponding mask prediction via the mask calculation unit. Then, the Hadamard product is operated between the mask predictions and HDR information maps with the same time stamp T, which simulates the optical encoding and light field modulation process of the DMD. Note that the first two original frames ( $T = 0$ and $T = 1$ in Fig. 3) are used for initializing the framework. After the over-255-cutoff that simulates the sensor’s dynamic range clipping effect, we can obtain the simulated sensor captures with HDR information beginning with $T = 2$ .

Figure 3.The illustration of the simulation pipeline. The exemplary frames of “JingA 3MU55” video from the test dataset is used as an example. For this example, preprocessing parameters (E, R) are set to (29, 50). The red arrow indicates the moving direction of license plate. “T” in the top left corner of each frame stands for the frame index of the video. In the simulation experiment, the mask coefficients (a, b) are set to (0.9, 0).

Download full size

View all figures

We evaluate our framework’s performance on HDR via a downstream license plate recognition (LPR) task. We use the architecture in Ref. [29] as ALPR Net, which employs yolov5 [30] and CRNN [31] as the license plate detection model and license plate recognition model, respectively. For comparative experiment, the simulated LDR frames are evaluated via the same ALPR Net.

Note that we directly send the simulated sensor captures into ALPR Net, as our target task is recognition task. As for HDR image reconstruction, the simulated sensor captures need to be first divided by corresponding mask predictions to obtain the full HDR reconstruction matrix, in which pixel values over 255 appear. Then, by utilizing tone mapping [32], the HDR results can be displayed properly on the LDR screen for human observation.

B. Simulation Results

Figure 4.(a) Simulation results of “JingA 3MU55” license video, which is from the R50E29 class of test datasets. Exemplary frames are shown with the frame index annotations (T). The “Simulated sensor captures” row shows the simulations of captures before “decoding” the inverse process and tone mapping, while the “HDR results” row shows the HDR frames for better human-eye perception by implementing tone mapping operation. The “Simulated sensor captures” in the red font color and “Simulated LDR results” in the blue font color are compared quantitatively and qualitatively in this experiment. The red arrow in the first frame of the original scene indicates the moving direction of the license plate. The first two frames of the video are used for initialization. Frames in the orange box are chosen as exemplary frames to calculate corresponding local information entropy maps. (b) The local information entropy heat maps of the exemplary frames. The license plate regions are zoomed in for better visualization and comparison. See Visualization 1 for video results.

Download full size

View all figures

We also compare the proposed framework with existing techniques, including classical multi-exposure fusion [8], modified multi-exposure fusion with artifact reduction [33], and deep optics single-shot HDR [17]. The qualitative results are shown in Fig. 5, and all the methods are evaluated using the same dataset mentioned in Section 4.A. Note that in the implementation of multi-exposure fusion methods [8,33], we use three consecutive frames as a group for generating a fused HDR frame result. We simulated the exposure difference on three frames via multiplying them by three different global exposure coefficients. The coefficients we set in comparison simulations are 1/8, 1/5, and 1, corresponding to low exposure, medium exposure, and high exposure, respectively. The 3-frame groups were selected one by one, from the first three frames to the last three frames, covering all consecutive 3-frame groups in the video data. It can be observed that our POE-VP outperforms other methods in highlight suppression and HDR information preserving (around target regions). Specifically, as a single-shot HDR method, our POE-VP is able to eliminate the artifacts, which are easily found in the classical exposure fusion method due to the multiple captures of moving scenes. Though the multi-exposure fusion with artifact reduction method remarkably reduces the artifact, it still faces a performance limit in some extreme illumination scenes. For example, in the bottom row of Fig. 5, modified multi-exposure fusion suppresses less highlighting around the character “M” in the license plate, compared to our POE-VP. For the test videos in the case of minimal over-exposure (preprocess parameter $E = 9$ ), the deep optics method is able to recover part of the HDR information around the license plate region in some of the test videos (see the top row of Fig. 5 as an example). At the same time, our framework achieves a more stable and more effective highlight suppression in the same test video. The bottom row of Fig. 5 shows an example of $E = 29$ . It can be seen that the deep optics method fails to suppress a large area of highlighting and hardly recovers any useful information for license plate recognition. The reason for the failure is mentioned in Ref. [17]: the mechanism that uses nearby pixels to encode the brightest pixel value has an inherent limitation when working in the case of extremely bright and large-area highlighting. On the contrary, our proposed framework is able to control the exposure pixel-wise, which allows the highlight suppression occurring within a single pixel without the occupation of nearby pixel values.

Figure 5.The exemplary frames of comparison simulations with the other 3 existent methods. The results of our POE-VP are highlighted with the red color. Parts of license plate are zoomed in for detailed comparison on motional artifacts between the multi-shot method and proposed single-shot POE-VP. Note that all the HDR results of the 4 methods are tone mapped under the same approach for displaying.

Download full size

View all figures

It can also be evaluated quantitatively in Table 2, which shows the comparison on the average recognition accuracy of the LPR task. The result shows a prominent superiority of our work towards other methods when working in higher saturation cases. Besides, time efficiency is an important feature of our work as our target vision task is a motional HDR task. We compare the lowest running time of the algorithms for a single frame. The comparison results in Table 2 show that POE-VP has a higher time efficiency, which enables faster capture speed.Table 2.

Quantitative Results of Comparison Simulation^a

Capture Mode	Method	Average Recognition Accuracy	Average Running Time/ms
Multi-shots (at least 3 shots)	Exposure fusion (classical)	65.07%	135.261
Multi-shots (at least 3 shots)	Exposure fusion (artifact reduced)	88.56%	424.831
Single-shot (with physical setup)	Deep optics	21.64%	1161.458
Single-shot (with physical setup)	POE-VP (ours)	94.86%	16.928

The numbers in bold represent the best performance.

7. REAL SCENE EXPERIMENT

A. DMD Camera Prototype

Figure 6.(a) The schematic of optical layout inside the DMD prototype. (b) The DMD prototype. (c) The photograph of the light box in the experimental setup. The red arrow shows the moving direction of the license plate. (d) The overall view of the experimental setup.

Download full size

View all figures

B. Experimental Setup

The real scene experimental setup is shown in Fig. 6(d). The moving license plate scene in the highlight condition we create is shown in Fig. 6(c), corresponding to the red dashed box part in Fig. 6(d). We use two different real license plates as the objects in the experiment. The license plate is fixed on a sliding optical bench, which makes the license plate move along the oblique optical rail. We use an LED luminescent array in the light box (VC-118-X 3nh) as highlight source A, which is tuned to the max illuminance. Besides, we employ another LED flash light (MSN-400pro) as highlight source B, which is tuned to its max illuminance as well. The two light sources emit the light towards the license plate and form a highlight illuminate area, which leads to the saturation of a conventional image sensor without the DMD’s modulation. The DMD is set in front of the light box, at a distance of 5 m. In this experiment, we only use a part of prototype’s field of view, that is, a region of interest (ROI) at the size of $720 \times 1280$ .

Figure 7 illustrates the mechanism and imaging procedure of our framework implemented via the DMD prototype, including the trigger mode and timing scheme. By generating control signals via an arbitrary waveform generator (AWG, UNI-T UTG2000B), we synchronize the DMD and image sensor [34]. As shown in Fig. 7(e), the trigger signal of the image sensor is a relatively long rectangular wave with pulse width $t_{I}$ . The trigger signal of the DMD is a periodical rectangular wave as shown in Fig. 7(d). Within time $t_{D}$ , the DMD refreshes a series of binary patterns [Fig. 7(c)] at each rising edge. Via pulse-width modulation (PWM) [36], the binary patterns form the grayscale coded masks [Fig. 7(b)], which are predicted using previous captures. The DMD and image sensor are connected with a PC, which reads the previous captures from the image sensor, calculates the predicted mask for the next shot, and drives the DMD with the mask. In our experiment, $t_{I}$ and $t_{D}$ are set to 33.33 ms and 20 ms, respectively. Within each single-shot $t_{I}$ , we obtain the HDR captures of the highlight license plate as shown in Fig. 7(f). To implement a fair comparison, the LDR captures of the same motional scene are acquired via the same hardware setup without masks on the DMD, i.e., fully white patterns are loaded within the entire LDR acquisition period. The prototype with the fully white DMD pattern is equivalent to an LDR camera with the same LDR sensor, which is proved in detail in Appendix C. The HDR captures as well as the LDR captures [Fig. 7(a)] are evaluated via the same ALPR Net for comparative experiment. Note that the capture results are grayscale images due to the grayscale image sensor we use in the prototype.

Figure 7.(a) LDR capture sequence. The red arrow shows the moving direction of the license plate. (b) The grayscale masks predicted by VP. In the real scene experiment, we set mask coefficients ( $V_{s}$ , a, b) to (200, 0.7, 0.2). (c), (d) Schematic of the PWM and trigger mode of the DMD. The DMD refreshes the binary patterns at each rising edge and generates the grayscale masks within each $t_{D}$ ; for the rest time, fully white patterns are loaded (the DMD works as a reflective mirror when all micromirrors are “on”). (e) The exposure periods of the image sensor. $t_{I}$ is the effective exposure time of every single shot. (f) The HDR captures acquired by the prototype with the DMD on. The first two frames are under initialization.

Download full size

View all figures

C. Results and Evaluations

Qualitative evaluation. The exemplary HDR results captured using our prototype are shown in Fig. 8, along with LDR captures and intermediate results. It can be observed that highlights on LDR license plates are suppressed in HDR captures. Thus, the revealed license plate characters are able to be accurately recognized. The results show that our framework successfully preserves HDR information of a moving target. Furthermore, one can observe in Fig. 8 that, for each test video, the grayscale distribution of the predicted mask changes dynamically along with the movement of the highlighted region in the LDR frame sequence. In other words, it is validated that the grayscale mask sequence has similar spatiotemporal correlations with the LDR frame sequence of the motional scene. This time varying optical encoding strategy driven by VP is a unique feature of our framework, which ensures that the grayscale pattern coincides with the highlighted region in the scene at every time stamp. As mentioned above, the mask loaded at a certain time stamp is calculated at previous time stamp. As a result, the single-shot HDR imaging is realized. It is clear that the combination of pixel-wise optical encoding with VP enables single-shot HDR imaging for a motional scene.

Figure 8.(a) Hardware experiment results of “JingN F9H28” license plate motional scene. (b) Hardware experiment results of “JingL J7629” license plate motional scene. Exemplary frames of 2 result videos are shown with the frame index annotations (T). The red arrow indicates the moving direction of the license plate. The first two frames of the video are used for initialization. Frames in orange box are chosen as exemplary frames to calculate corresponding local information entropy maps. See Visualization 1 for video results.

Download full size

View all figures

Quantitative evaluation. In order to conduct a fair comparison, LDR captures and HDR captures are sent into the same ALPR Net for quantitative validation. The quantitative results of three evaluation metrics are summarized in Table 4.Table 4.

Quantitative Results of Two Real HDR License Plate Recognition Scenes^a

License Plate	Captured Frame Count	HDR Rec-Acc	LDR Entropy	HDR Entropy	Time/ms
JingN F9H28	28	92.86%	1.2018	6.9323	16.807
JingL J7629	39	97.44%	1.2038	7.2774	14.317
Average		95.15%	1.2028	7.1049	15.562

The numbers in bold represent the best performance.

It can be seen that the POE-VP overcomes the failure that LDR captures have in the recognition task. The Rec-Acc of HDR captured videos has an improvement of 95.15% on average, compared to the Rec-Acc of LDR videos. The information entropy calculated on HDR captures is 5.9021 better on average than on LDR captures, which shows a 490.70% increase. POE-VP successfully suppresses the highlight and preserves much richer information on the license plate from the original HDR scene. The local information entropy maps of exemplary frames are shown in Fig. 9 for further demonstration, which visualizes the considerable information entropy increase obtained by the involvement of POE-VP. In the heat maps, the deep blue color in LDR license plate regions indicates that there are information losses. In contrast, the warm colored HDR license plate regions show the ability that POE-VP can preserve a relatively large amount of information in the same highlight condition. Using a 36-patch dynamic range test chart (Sine image), we estimate the DR of the prototype. Without optical encoding (fully white patterns on the DMD), the prototype scores only 70.35 dB in DR, while the prototype with optical encoding expands DR to 120.74 dB, which achieves a gain of 71.63%.

Figure 9.(a) The local information entropy heat maps of the exemplary frames in “JingN F9H28” license videos. (b) The local information entropy heat maps of the exemplary frames in “JingL J7629” license videos. The HDR and LDR license plate results are set by corresponding entropy maps for comparison.

Download full size

View all figures

The above results indicate that POE-VP enhances the performance on motional object recognition tasks in highlight conditions to a great extent. Furthermore, we also calculate the time consumption and analyze the high-speed property of POE-VP. In the real scene experiment, we utilize a PC to receive previous captures from the image sensor, operating VP, and output masks to control the DMD, as illustrated in Fig. 6(a). We run the VP and mask calculation algorithm on an NVIDIA GeForce RTX3090 GPU on a PC and summarize the running time in Table 4. The average running time for one frame in the real scene experiment is 15.562 ms, which is nearly the theoretical bound of time consumption of our POE-VP working in the real scene. However, due to the limitation on the data transfer rate of the USB and the DMD driving mode of the PC, the actual capturing speed cannot reach the ideal one, despite that the DMD can refresh its patterns at a rate over thousands of Hz. Thus, the example video results we demonstrate in Visualization 1 take more than 100 ms to complete a single shot of HDR, resulting in a relatively low frame rate. In engineering implementation, edge computing technology, such as the field programmable gate array (FPGA), is a possible solution to accelerate the whole pipeline. Employing FPGA linked with an image sensor and DMD by a data bus, we can not only reduce the time cost of data transfer but also speed up the VP and mask calculation algorithm. According to our theoretical calculation, the proposed POE-VP can reach over 100 fps when implementing via FPGA (Zynq XC7Z035) [37] and the data bus (200 MHz, 32-bit), which is promising to satisfy the needs for real-time HDR motional target recognition. Please refer to Appendix D for the detailed calculation of the theoretical frame rate.

Last but not least, acquiring a single HDR output via our framework occupies memory with only two previous frames. Compared with multi-frame fusion methods which at least take 3 frames of different exposures, POE-VP achieves lower captured data size, reducing the data volume at a ratio of 66.67% at a minimum.

8. DISCUSSION AND CONCLUSION

In summary, POE-VP offers a new paradigm for single-shot HDR vision tasks of motional scenes. By implementing via a DMD, the POEM realizes pixel-wise spatially varying exposure control for suppressing highlight of each captured frame. As the motional artifact and time latency are two critical challenges to overcome in motional scene HDR, we propose the CMPM utilizing VP for ahead-of-time prediction. The CMPM introduces a temporal consistency between the masks for consecutive captures and the motional scene. Driven by VP, DMDs are capable of producing a mask sequence whose spatiotemporal correlation is highly similar to the capture sequence of the highlight motional scene. This spatiotemporally varying optical encoding mask is the core feature of POE-VP, distinguished from other SVE approaches in the field of HDR. Thereby, a single-shot HDR is achieved, giving the credit to the ahead-of-time mask prediction via VP and the high-speed mask loading via the high refresh rate DMD. The dynamic range of the prototype could reach 120 dB under test. We choose the highlight motional license plate recognition as the downstream HDR vision task to evaluate the performance of POE-VP. According to subjective and objective analysis of results in simulation and the real scene experiment, POE-VP shows obvious advantages in expanding DR and preserving HDR information. Compared with conventional LDR captures, POE-VP enhances the recognition performance, which reflected the recognition accuracy and information entropy. Running on an NVIDIA GeForce RTX3090 GPU, the model and algorithm of POE-VP take 15.562 ms in total for a single HDR capture. In the practical engineering deployment, the frame rate can reach over 100 fps by using a large bandwidth and a high transmission rate edge computing accelerator assisted with a data bus, which achieves real-time HDR for motional scenes to some extent. Thus, POE-VP alleviates the artifacts in motional HDR imaging, which we have discussed in comparison simulation in Section 6. Furthermore, compared to the mainstream multi-frame fusion methods, the data volume is also reduced by POE-VP, leading to a lower data size. It should be mentioned that the performance improvements on dynamic range expansion and artifact suppression of POE-VP mainly benefit from its physical setup, i.e., the use of the DMD as a pixel-wise optical modulator and its controller, which of course has a limitation on the device size and additional cost compared to multi-exposure methods with a conventional camera in some practical cases, for example, HDR imaging for static scenes (almost no artifact) with more requirements on device portability and cost. In summary, limited by current technology on the DMD and optical design, POE-VP trades some miniaturization and hardware cost for the prominent improvement on time efficiency, data size, artifact reduction, and dynamic range expansion. Besides, POE-VP is always able to provide additional dynamic range for the used sensor in imaging systems via its physical setup regardless of the sensor’s bit depth. Overall, it can play the role of an enhancement component for various systems in practical applications, which is able to bring additional dynamic range and fast speed, as well as additional size and cost.

Our POE-VP can be improved in several aspects. First, with the DR, we expand focusing on the high illuminance conditions, while at the opposite side in the axis of DR, there is low illuminance scenario to be extended. Expanding the ability of lowlight sensing is a future work. Second, the POE-VP proposed in this paper is optimized and finetuned for a specific vision task. A generic HDR scheme and HDR imaging at the photography level for human visualization based on the framework of POE-VP are attractive works for further exploration. For instance, the parameters in Eq. (2) that control the mask generation need to be optimized more finely for more complicated HDR tasks or even photography-level HDR imaging. Making them differentiable and training it end-to-end with a video prediction network and other imaging process networks for specific HDR tasks or general HDR photography are further research subjects, which we will dive into. Third, POE-VP has a limitation based on the characteristic of the VP algorithm we used. That is, when the moving object has a precipitous large margin motion, e.g., a sudden 180° turn, the video prediction may fail to catch up with such sudden variation of the scene within two frames, resulting in the imperfect highlight suppression among those sudden movement frames. To achieve improvement, we will dive into the joint optimization of imaging optics, optical encoding strategies, and image processing algorithms.

Category: Imaging Systems, Microscopy, and Displays

Received: Jun. 24, 2024

Accepted: Aug. 11, 2024

Published Online: Oct. 31, 2024

The Author Email: Hongwei Chen (chenhw@tsinghua.edu.cn)

DOI:10.1364/PRJ.533288

CSTR:32188.14.PRJ.533288

Training Dataset	Test Data	PSNR/dB	MS-SSIM	LPIPS
Regular image sequences (RGB)	RGB	25.37	0.8469	0.1397
Synthesized image sequences (RGB)	RGB	25.71	0.8603	0.1379
Synthesized image sequences (RGB)	Grayscale	25.78	0.8620	0.1338

Notation	Value	Description
$H$	1280	Image height
$W$	720	Image width
$BD$	8-bit	Image bit depth
$f$	200 MHz	Clock frequency of AHB
$BW$	32-bit	Bit width of AHB
$T_{AHB}$	1.152 ms	Time consumption on AHB
$CC$	$\sim 10^{10} FLOPS$	Computational complexity
$CP$	${2.525 \times 10}^{12} FLOPS$	Computing power
$T_{FPGA}$	$\sim 3.960 ms$	Running time on FPGA
$f_{DMD}$	300 Hz	Modulation speed of DMD
$T_{DMD}$	3.333 ms	Time consumption on DMD
$T_{total}$	9.597 ms	Total time consumption

微信扫一扫：分享