Photonics Research, Volume. 12, Issue 11, 2524(2024)

Predictive pixel-wise optical encoding: towards single-shot high dynamic range moving object recognition

Yutong He1,2, Yu Liang1,2, Honghao Huang1,2, Chengyang Hu1,2, Sigang Yang1,2, and Hongwei Chen1,2、*
Author Affiliations
  • 1Department of Electronic Engineering, Tsinghua University, Beijing 100084, China
  • 2Beijing National Research Center for Information Science and Technology (BNRist), Beijing 100084, China
  • show less

    Conventional low dynamic range (LDR) imaging devices fail to preserve much information for further vision tasks because of the saturation effect. Thus, high dynamic range (HDR) imaging is an important imaging technology in extreme illuminance conditions, which enables a wide range of applications, including photography, autonomous driving, and robotics. Mainstream approaches require multi-shot methods because the conventional camera can only control the exposure globally. Although they perform well on static HDR imaging, they face a challenge with real-time HDR imaging for motional scenes because of the artifact and time latency caused by multi-shot and post-processing. To this end, we propose a framework, termed POE-VP, which achieves single-shot HDR imaging via a pixel-wise optical encoder driven by video prediction. We use highlighted motional license plate recognition as a downstream vision task to demonstrate the performance of POE-VP. From the results of simulation and real scene experiments, we validate that POE-VP outperforms conventional LDR cameras by more than 5 times in recognition accuracy and by more than 200% in information entropy. The dynamic range could reach 120 dB, and the captured data size is verified to be lower than mainstream multi-shot methods by 67%. The running time of POE-VP is also validated to satisfy the needs of high-speed HDR imaging.

    1. INTRODUCTION

    High dynamic range (HDR) imaging is an essential imaging technology which aims to preserve as much effective information as possible from the natural scene, which satisfies the needs of applications in extreme conditions in which low dynamic range (LDR) imaging fails. For example, autonomous driving using the LDR visual pipeline could be saturated and over-exposed by highlight, resulting in the loss of information and an impact on decision-making, while HDR imaging alleviates these problems and guarantees the safety of autonomous driving.

    Dynamic range (DR) is expressed as the ratio between the highest and lowest luminance value [1], also defined as the ratio of the highest to the lowest pixel value in logarithmic form within digital image domain [2]: DR=20lg(ImaxImin).

    From the moonlight illumination at night to bright sunshine conditions, the DR of the natural world reaches approximately 280 dB. The widest range of luminance that human eyes can perceive is 120 dB [3], while the DR of a conventional 8-bit camera cuts down to only 48 dB, according to Eq. (1). Generally, the process of capturing an HDR scene to acquire one LDR image via a conventional LDR camera can be modeled by three major steps: dynamic range clipping, non-linear mapping, and quantization [4]. Dynamic range clipping leads to a hard cutoff at some top irradiance determined by the full well capacity of the sensor, while quantization operation causes missing information for pixels in the under-exposed regions determined by the inherent noise of the sensor. It can be imagined how tremendous the information loss is during such a process in an LDR imaging pipeline. Yet, abundant applications are posing more requirements on extending dynamic range, including robotics, driver assistance systems, self-driving, and drones. Thus, in order to capture and accurately represent the rich information in natural scene content, plenty of HDR imaging techniques have emerged and drawn much attention from researchers over the last decades.

    The mainstream approach which is widely deployed on current smartphone devices is to fuse a series of LDR images captured by a conventional sensor under different exposures [58]. By sequentially capturing LDR measurements and merging them through brackets [9], these conventional HDR methods successfully expand the dynamic range and perform well with static scenes. However, these multi-shot fusion methods suffer from motion artifacts and difficulty with image alignment when dealing with motional scenes or a moving target. Although a batch of modified fusion solutions aims to eliminate these problems by leveraging optical flow [10] or using deep learning [11,12], they still have to face same inherent problems of these multi-shot fusion approaches: (1) the high memory consumption due to the large data volume (at least 3 LDR images for a single HDR output); (2) the high computational workload caused by complicated fusing algorithms. The above drawbacks limit the usage of multi-shot fusion methods in application scenes whose performance requirements are beyond mobile phone HDR.

    To eliminate the aforementioned limitations of fusion methods, researchers dived into achieving HDR via the single-shot scheme. The prevailing single-shot HDR schemes are broadly categorized into two domains: software solutions concentrating on algorithmic enhancements and hardware strategies achieved through the revision of the sensor or optical path.

    In the software domain, inverse tone mapping or inverse LDR camera pipelines, which developed deep networks to explicitly learn to reverse the image formation process [4], have provided a way for the reconstruction of HDR image from a single LDR image. Yet the end-to-end deep learning method still posed challenge on computational workload and generalization. Besides, HDR reconstruction is hallucinated from LDR images, which cannot be a credible alternative for downstream vision tasks that require accurate content from the real world.

    In the hardware domain, researchers have attempted to capture multiple exposures simultaneously, within a single exposure period. To this end, some works aimed to redesign the pixel structure of the imaging sensor [13,14] and fuse multi-exposure measurements of optically aligned multiple sensors [15,16]. These methods achieve single-shot HDR reconstruction at the cost of fabrication complexity, device cost, alignment difficulty, footprint, and practical applicability.

    As mentioned above, reconstructing HDR images from a single LDR image captured by a conventional sensor is an ill-posed inverse problem. A line of studies has attempted to overcome the challenge of ill-posed problems through the hardware domain by leveraging similar strategies from the field of computational imaging. They placed various optical components as optical encoders in front of the sensor in order to modulate the light field flexibly. The optical encoder encoded the saturated details, providing physical priors for the subsequent algorithms’ decoder to reconstruct an HDR image from a single LDR capture. The utilization of diffractive optical elements (DOEs) with point spread function (PSF) optimization strategies [17,18] was one aspect of these approaches, while spatially varying pixel exposures (SVEs) [2,19] were proposed as another aspect. Loading the custom patterns on spatial light modulators (SLMs) is a representative way of SVE methods, in which liquid crystals on silicon (LCoS) [20,21] or a digital micromirror device (DMD) [22,23] was employed to modulate the light field and encode the HDR information within the spatial domain or frequency domain. These existing optical-encoding-based single-shot HDR methods suffer drawbacks either in a further increase on DR, the recovery of an HDR scene with larger saturated regions, or time latency for a single HDR capture.

    In this paper, we proposed a novel framework for a specific HDR vision task, namely a pixel-wise optical encoder driven by video prediction (POE-VP), which mainly consists of two modules: (1) the pixel-wise optical encoder module (POEM), as the hardware part; (2) the coded mask prediction module (CMPM), as the software counterpart. In POEM, we employ a DMD as an optical encoder to modulate the incident light intensity and selectively control the exposure, providing physical fundamentals for single-shot pixel-wise highlight suppression. The modulation speed of the DMD can reach a high level, up to several hundred or thousand hertz. In order to make full use of the DMD’s performance, we introduce the CMPM into our framework, which activates the high-speed light modulation potential of the DMD. Drawing inspiration from predictive models which perform well on decision-making applications and motional scenes, we utilize a deep-learning-based video prediction (VP) model as the CMPM. We leverage its ability to infer the motions in a video via extracting representative spatiotemporal correlations within the frame sequence [24] so as to predict the motion and variation of highlight regions in motional scenes. Thus, by inputting current and previous captured frames, the CMPM can predict a coded mask for the future frame in advance. Then, the mask will be loaded on the DMD to suppress local highlight during the capture of next frame, allowing us to acquire high-speed HDR capture via a single-shot method. To evaluate the proposed system’s performance on revealing lost information in saturated regions, we select moving license plate recognition (LPR) in highlight conditions as an HDR vision task on a motional scene. We established the first dataset consisting of paired HDR-LDR moving license plate videos for model training and framework simulation tests. To further demonstrate our methodology, a DMD camera prototype is fabricated to evaluate an HDR moving object recognition scene. Driven by the VP-based mask prediction, the proposed single-shot framework is adaptive to the real motional scene, performing well on both the simulated dataset and real scene. To the best of our knowledge, this is the first work investigating the integration of per-pixel exposure control and video prediction in the HDR imaging domain. The proposed single-shot framework overcomes the data volume and artifact drawbacks in multi-shot techniques. Our approach is able to faithfully recover the HDR information at relatively short time cost, which addresses the trade-off between DR and time efficiency. Overall, our approach raises a new single-shot solution in the field of motional HDR imaging, especially HDR vision tasks which desire both sufficient information and high time efficiency.

    2. PRELIMINARIES ABOUT VIDEO PREDICTION

    VP is a self-supervised extrapolation task, aiming to infer the subsequent frames in a video, based on a sequence of previous frames used as a context [24]. For deep-learning-based VP, the spatial-temporal correlations within the consecutive input frames are learned to form a prediction of the future. Drawing inspiration from its application in motional scenes such as autonomous driving, we adopt VP in POE-VP for HDR motional scenes. For the VP algorithm utilized in our work, we put much emphasis on the inference speed and generalization ability, in order to achieve high-speed HDR and make our framework adaptive for various over-exposure scenes. The VP model employed in our proposed framework is mainly based on the architecture of the dynamic multi-scale voxel flow network (DMVFN) [25]. By estimating the dynamic optical flow between adjacent frames, the DMVFN models various scales of motions between frames in order to predict future frames. The utilization of optical flow enhances the quality of prediction frames as well as the generalization ability and stability over diverse kinds of scenes. On the other hand, the DMVFN consists of a differentiable routing module, which is able to select different architecture sizes adaptive to the inputs with different motion scales. The routing module lowers the computational costs and significantly reduces the inference time. Specifically, for predicting a frame with resolution of 368×640, the DMVFN costs only 0.016 s on average. To sum up, the DMVFN achieves state-of-the-art performance on inference speed while maintaining the high quality of predicted images under a strong generalization ability. These two advantages match well with the requirement of our work, that is, to generalize the VP model to a motional scene consisting of an over-exposed moving license plate in order to predict a DMD mask in advance and achieve a single-shot HDR at a relatively high speed. More information about the DMVFN is provided in Appendix B.

    3. PRINCIPLE

    As mentioned in Section 1, the proposed POE-VP in this paper consists of two modules, namely the POEM and CMPM, respectively. The POEM is driven by the CMPM; in particular, the coded grayscale mask loaded on the DMD in the POEM is predictively computed by the VP model in the CMPM. The conceptual framework of POE-VP performing in an HDR motional license plate scene is illustrated in Fig. 1.

    The simplified illustration of the conceptual framework of POE-VP, demonstrating with an HDR motional license plate recognition downstream vision task. POEM (shown in the first row in beige color) and CMPM (shown in the second row in purple color) are two core modules that build up POE-VP.

    Figure 1.The simplified illustration of the conceptual framework of POE-VP, demonstrating with an HDR motional license plate recognition downstream vision task. POEM (shown in the first row in beige color) and CMPM (shown in the second row in purple color) are two core modules that build up POE-VP.

    We employed the DMD to achieve per-pixel optical encoding in order to control the exposure on each spatial position. The imaging hardware used in POE-VP is summarized inside the orange dashed box in Fig. 1. The arrangement of these optical components of a practical prototype will be discussed in Section 7.

    The first row of Fig. 1 (in the beige background color) illustrates the circumstance when no mask is loaded on DMD. In this case, the whole imaging pipeline can be seen as a conventional LDR imaging system, which allows the entire original light from the HDR scene to enter the sensor without any modulations. The highlight in the HDR scene forms the over-exposure regions on the imaging sensor, leading to LDR captures (Frame t−1/t) whose license plate cannot be recognized by any license plate detection and recognition (LPDR) approaches because of the information loss caused by saturation.

    As for the second row of Fig. 1 (in the purple background color), it depicts the performance of the CMPM. In the purple dashed box, the VP model, DMVFN, takes the previous captured LDR frame sequence (Frame t−1/t, xt1/xt) as input. By estimating the optical flow in Frame t−1/t, the VP model infers the motional trends within the scene, especially the motion of saturated regions, and ultimately yields the prediction of the future frame (Predicted Frame t+1, x˜t+1). Then, on the basis of Predicted Frame t+1, the exposure control algorithm is utilized to compute a predictive grayscale mask (Predicted Mask Frame t+1). Inspired by Ref. [22], we leverage the thought of threshold segmentation as the exposure control algorithm. The grayscale mask in the 8-bit unsigned integer data type, M˜t+1(x,y), is calculated as follows: M˜t+1(x,y)={255,x˜t+1(x,y)<Vs255ax˜t+1(x,y)bVs255x˜t+1(x,y),x˜t+1(x,y)Vs,where Vs is the threshold pixel value, and a and b are mask coefficients of light intensity suppression. The threshold value should be set properly at less than but close to the saturated value (255) in order to suppress the highlight in the license plate region and maintain a smoother pixel value gradient within the same region. We experimentally set Vs to 200 for a better evaluation performance of the framework on the target HDR vision task.

    As shown in the third row of Fig. 1, as the objects in the scene keep moving, the predicted mask will be loaded on the DMD at a relatively high refresh rate concomitantly. Then, in the next frame’s capture process, the pattern of the mask (Predicted Mask Frame t+1, M˜t+1) matches with the highlight region in the HDR scene (Frame t+1, xt+1). In other words, each one of our spatiotemporal varying masks is customized for its corresponding scene at the same time stamp. This mask t+1 is computed in a previous time stamp t instead of being iteratively generated after capturing Frame t+1, which reduces the motional artifact. As a result, we are able to realize a high-speed single-shot highlight suppression, and finally acquire an HDR capture (HDR image Frame t+1). In HDR capture, the over-exposure regions of the moving car scene are successfully suppressed. By revealing the license plate information, which is saturated due to the highlight, the content on the license plate is clear enough for the LPDR net to recognize.

    According to the aforementioned mechanism, for the first 2 frames at the very beginning, there are no predicted masks for them because of the absence of previous frames as inputs. Therefore, when operating, our framework has an initialization process. The first 2 frames captured by the POEM are used for initializing the VP model instead of HDR imaging. After the initialization, the whole system will operate under the aforementioned mechanism, capturing HDR images through POE-VP frames by frames at a certain rate.

    To sum up, the key features of POE-VP are as follows. (1) We employ the DMD as the optical encoder in order to achieve pixel-wise exposure control, which is the hardware basis for expanding DR. The utilization of the DMD places a part of the computational procedures into the optical domain, which is originally done in the algorithm domain by electronics. Compared with approaches in the software domain, this pre-sensor optical processing lowers the backend computational costs and enhances the processing speed due to the high refresh rate of the DMD. (2) We use VP for predictively computing the coded mask to drive DMD, so that the patterns of grayscale masks are able to follow the motional change in the real scene. As a result, there are marked distinctions between our masks and other SVE methods’ masks. Our spatiotemporal varying mask sequence contains the spatiotemporal correlations depicting the motion of over-exposure regions. In contrast, the SVE methods generally use periodic static masks missing the temporal dimension or iteratively generated masks, which have an attenuation in temporal dimension. The principle of our per-pixel exposure control hits the goal of single-shot HDR, alleviating the motional artifacts and reducing data size.

    4. IMPLEMENTATION OF LEARNING IMAGE SEQUENCE PREDICTION

    A. Dataset Preparation

    In the field of HDR, there is a common challenge on acquiring paired HDR-LDR datasets. In general, a natural HDR scene turns out to be the LDR image via an LDR camera. As a result, it is difficult to get an HDR ground truth from the real scene, which is corresponding to LDR measurements captured by the conventional LDR camera. Among the existing benchmark datasets for HDR, there is still a gap in single exposure with motional scenes [26].

    In our work, we prepared our dataset mainly based on the LSV-LP dataset, a large-scale and multi-source video-based LPDR dataset [27]. The LSV-LP consists of three categories: move versus static, static versus move, and move versus move. The move versus static means that the data collection device is moving while the vehicle stays still. The static versus move means that the device is static while the vehicle is moving. The move versus move shows that the device and vehicle are both moving. We extract a subset from the category move versus static, including 68 videos, with 18,322 frames which are divided into a training dataset (62 videos, 18,185 frames) and a test dataset (6 videos, 137 frames). We also collect 26 videos with 2298 frames of a motional license plate on the road and organize them as another part of the test dataset. A mobile phone camera is used as the video data collection device. While the mobile phone keeps moving, the videos are recorded in 1080p, 60 fps (frames per second) from a road driving vehicle and a parked vehicle. All frames are resized to 720×1280.

    To set up a paired HDR-LPR license plate dataset, we need to acquire HDR information at first. However, the LSV-LP dataset is collected via a conventional LDR camera; thus, HDR information has already been cut off during the collection process in this dataset. Therefore, it is necessary to preprocess the original LSV-LP dataset. The preprocess approach is illustrated in Fig. 2. Based on these videos, we apply a process strategy called artificial highlight to generate HDR information. Specifically, we operate a Hadamard product at each frame from the original LSV-LP dataset with a grayscale map. The map, namely the artificial highlight map, follows a 2D Gaussian distribution whose extreme point aligns with the center point of license plate, written as H(x,y)=O(x,y)[E·G(x,y)+1],G(x,y)=e(xcx)2+(ycy)22R,where O(x,y) represents the original frames in the LSV-LP dataset and H(x,y) represents the generated HDR maps. G(x,y) represents the grayscale Gaussian map, in which (cx,cy) is the coordinate of the license plate center point. The over-exposure rate E and radius R are parameters that control the intensity and size of the Gaussian map. The operation simulates the scene in which a Gaussian highlight (e.g., high-beam light) illuminates the license plate region, which makes some pixel values of the original LDR frames increase beyond 255, forming HDR information artificially. Using a heat map, Fig. 2(c) figuratively explains the artificial HDR information we create around the license plate region. Note that the real HDR frames we create here cannot be displayed without tone mapping and other processes because some of their pixel values exceed the maximum pixel value of the 8-bit display. In other words, Fig. 2(c) is not the real HDR frame displayed on the screen; it is only a visualization about how we exceed the boundary pixel value of 255 and create artificial HDR information.

    (a) The normalized artificial highlight map G(x,y) illustrated in the form of a heat map. The map follows a 2D Gaussian distribution. (b) The frame O(x,y) extracted from an LDR original video in test dataset. The yellow crossline marks the center position of the license plate bounding box, which is coincident with the extreme point of Gaussian distribution in (a). (c) The figurative visualization of HDR information maps H(x,y). The pixel values over 255 in the deep blue region indicate the created artificial HDR information. (d) The simulated LDR frame.

    Figure 2.(a) The normalized artificial highlight map G(x,y) illustrated in the form of a heat map. The map follows a 2D Gaussian distribution. (b) The frame O(x,y) extracted from an LDR original video in test dataset. The yellow crossline marks the center position of the license plate bounding box, which is coincident with the extreme point of Gaussian distribution in (a). (c) The figurative visualization of HDR information maps H(x,y). The pixel values over 255 in the deep blue region indicate the created artificial HDR information. (d) The simulated LDR frame.

    Then, we imitate the main step of the LDR image formation pipeline, the dynamic range clipping, to simulate the scene of capturing LDR measurements via a conventional LDR camera.

    In each HDR frame we generate, those pixels whose values are over 255 are cut off to 255, while other pixels remain unchanged. The over-255 pixels are saturated and form over-exposure regions, which produces a simulated LDR frame for us, as shown in Fig. 2(d). Via the aforementioned steps, we collected the paired HDR-LDR license plate dataset, namely HDR-LDR-LP.

    B. VP Model Training

    The preprocessing parameters E and R for the training dataset are set to 5 and 50, respectively. The DMVFN model used in our framework is trained on 368×640 image patches, which are resized from 720×1280 training data after the preprocessing mentioned above. The model is trained on 4 NVIDIA GeForce RTX3090 GPUs for 300 epochs, with a batch size of 16. Following the configuration in Ref. [25], the AdamW optimizer with a learning rate which reduces from 104 to 105 is employed in model training.

    5. EVALUATION METRICS

    Our framework in this work is emphasized on HDR imaging towards specific vision tasks, instead of HDR photography for human vision. As a result, the evaluation metrics in this paper put more weights on the amount of information we can reveal to enhance the downstream vision tasks in highlight conditions than image quality and perceptual quality. The main quantitative metrics we use are introduced and defined as follows.

    A. Recognition Accuracy

    Recognition accuracy (Rec-Acc) is calculated per test video, defined as Eq. (4). NT represents the total count of frames in a video. For the same video, Np represents the count of frames whose license plate is correctly recognized (every character in the recognition result matches the corresponding character in license plate label): Rec-Acc=NpNT.

    B. Information Entropy

    To quantitatively measure the amount of information around the license plate region in both the HDR frame sequence and LDR frame sequence, we introduce the information entropy S [28] defined as follows: S=i=0255pilog2pi,where pi is the probability of pixels with a value of i (0i255) in the image. The information entropy is calculated within the bounding-box of the license plate.

    C. Local Information Entropy

    A certain pixel’s local information entropy is calculated in a 3×3 neighborhood kernel. The average information entropy of 9 pixels in the kernel represents the local information entropy of the centric pixel. We use the local information entropy value for each pixel to draw a local information entropy heat map. As illustrated, we visualize the information difference between HDR frames and LDR frames for further analysis.

    6. SIMULATION ASSESSMENT

    A. Simulation Experiment

    The test dataset prepared for the simulation experiment consists of two parts. The first part is the test set extracted from the original LSV-LP dataset, which contains 6 videos, with 137 frames. We employ the preprocess strategy mentioned above for the test set, setting the Gaussian map parameters E to 9 and R to 50. Furthermore, in order to evaluate the generalization ability of our framework, we capture additional videos of the license plate from a different province in China via an iPhone 13 Pro Max camera and fill up the second part of test dataset, which contains 26 videos. The second part of test dataset is then divided into three subsets, with a different E (9, 29, 49) and the same R (50) in preprocessing. The three configurations of over-exposure rate E simulate the various highlight conditions, which allow us to evaluate the generalization performance and stability of our POE-VP.

    Figure 3 illustrates the simulation pipeline. The original frame sequence extracted from each video in the test dataset is sent into preprocess unit, yielding paired HDR information maps and simulated LDR frames. The simulated LDR frames are resized to 368×640, which fit the input size of the VP model. The trained VP model will use the previous two simulated LDR frames to predict the next frame. For example, as shown in Fig. 3, simulated LDR frames T=0/T=1  (T=1/T=2) are used to produce predicted LDR frames T=2  (T=3). The rest of the frames of this sequence follow the same method. One predicted LDR frame will form its corresponding mask prediction via the mask calculation unit. Then, the Hadamard product is operated between the mask predictions and HDR information maps with the same time stamp T, which simulates the optical encoding and light field modulation process of the DMD. Note that the first two original frames (T=0 and T=1 in Fig. 3) are used for initializing the framework. After the over-255-cutoff that simulates the sensor’s dynamic range clipping effect, we can obtain the simulated sensor captures with HDR information beginning with T=2.

    The illustration of the simulation pipeline. The exemplary frames of “JingA 3MU55” video from the test dataset is used as an example. For this example, preprocessing parameters (E, R) are set to (29, 50). The red arrow indicates the moving direction of license plate. “T” in the top left corner of each frame stands for the frame index of the video. In the simulation experiment, the mask coefficients (a, b) are set to (0.9, 0).

    Figure 3.The illustration of the simulation pipeline. The exemplary frames of “JingA 3MU55” video from the test dataset is used as an example. For this example, preprocessing parameters (E, R) are set to (29, 50). The red arrow indicates the moving direction of license plate. “T” in the top left corner of each frame stands for the frame index of the video. In the simulation experiment, the mask coefficients (a, b) are set to (0.9, 0).

    We evaluate our framework’s performance on HDR via a downstream license plate recognition (LPR) task. We use the architecture in Ref. [29] as ALPR Net, which employs yolov5 [30] and CRNN [31] as the license plate detection model and license plate recognition model, respectively. For comparative experiment, the simulated LDR frames are evaluated via the same ALPR Net.

    Note that we directly send the simulated sensor captures into ALPR Net, as our target task is recognition task. As for HDR image reconstruction, the simulated sensor captures need to be first divided by corresponding mask predictions to obtain the full HDR reconstruction matrix, in which pixel values over 255 appear. Then, by utilizing tone mapping [32], the HDR results can be displayed properly on the LDR screen for human observation.

    B. Simulation Results

    (a) Simulation results of “JingA 3MU55” license video, which is from the R50E29 class of test datasets. Exemplary frames are shown with the frame index annotations (T). The “Simulated sensor captures” row shows the simulations of captures before “decoding” the inverse process and tone mapping, while the “HDR results” row shows the HDR frames for better human-eye perception by implementing tone mapping operation. The “Simulated sensor captures” in the red font color and “Simulated LDR results” in the blue font color are compared quantitatively and qualitatively in this experiment. The red arrow in the first frame of the original scene indicates the moving direction of the license plate. The first two frames of the video are used for initialization. Frames in the orange box are chosen as exemplary frames to calculate corresponding local information entropy maps. (b) The local information entropy heat maps of the exemplary frames. The license plate regions are zoomed in for better visualization and comparison. See Visualization 1 for video results.

    Figure 4.(a) Simulation results of “JingA 3MU55” license video, which is from the R50E29 class of test datasets. Exemplary frames are shown with the frame index annotations (T). The “Simulated sensor captures” row shows the simulations of captures before “decoding” the inverse process and tone mapping, while the “HDR results” row shows the HDR frames for better human-eye perception by implementing tone mapping operation. The “Simulated sensor captures” in the red font color and “Simulated LDR results” in the blue font color are compared quantitatively and qualitatively in this experiment. The red arrow in the first frame of the original scene indicates the moving direction of the license plate. The first two frames of the video are used for initialization. Frames in the orange box are chosen as exemplary frames to calculate corresponding local information entropy maps. (b) The local information entropy heat maps of the exemplary frames. The license plate regions are zoomed in for better visualization and comparison. See Visualization 1 for video results.

    We also compare the proposed framework with existing techniques, including classical multi-exposure fusion [8], modified multi-exposure fusion with artifact reduction [33], and deep optics single-shot HDR [17]. The qualitative results are shown in Fig. 5, and all the methods are evaluated using the same dataset mentioned in Section 4.A. Note that in the implementation of multi-exposure fusion methods [8,33], we use three consecutive frames as a group for generating a fused HDR frame result. We simulated the exposure difference on three frames via multiplying them by three different global exposure coefficients. The coefficients we set in comparison simulations are 1/8, 1/5, and 1, corresponding to low exposure, medium exposure, and high exposure, respectively. The 3-frame groups were selected one by one, from the first three frames to the last three frames, covering all consecutive 3-frame groups in the video data. It can be observed that our POE-VP outperforms other methods in highlight suppression and HDR information preserving (around target regions). Specifically, as a single-shot HDR method, our POE-VP is able to eliminate the artifacts, which are easily found in the classical exposure fusion method due to the multiple captures of moving scenes. Though the multi-exposure fusion with artifact reduction method remarkably reduces the artifact, it still faces a performance limit in some extreme illumination scenes. For example, in the bottom row of Fig. 5, modified multi-exposure fusion suppresses less highlighting around the character “M” in the license plate, compared to our POE-VP. For the test videos in the case of minimal over-exposure (preprocess parameter E=9), the deep optics method is able to recover part of the HDR information around the license plate region in some of the test videos (see the top row of Fig. 5 as an example). At the same time, our framework achieves a more stable and more effective highlight suppression in the same test video. The bottom row of Fig. 5 shows an example of E=29. It can be seen that the deep optics method fails to suppress a large area of highlighting and hardly recovers any useful information for license plate recognition. The reason for the failure is mentioned in Ref. [17]: the mechanism that uses nearby pixels to encode the brightest pixel value has an inherent limitation when working in the case of extremely bright and large-area highlighting. On the contrary, our proposed framework is able to control the exposure pixel-wise, which allows the highlight suppression occurring within a single pixel without the occupation of nearby pixel values.

    The exemplary frames of comparison simulations with the other 3 existent methods. The results of our POE-VP are highlighted with the red color. Parts of license plate are zoomed in for detailed comparison on motional artifacts between the multi-shot method and proposed single-shot POE-VP. Note that all the HDR results of the 4 methods are tone mapped under the same approach for displaying.

    Figure 5.The exemplary frames of comparison simulations with the other 3 existent methods. The results of our POE-VP are highlighted with the red color. Parts of license plate are zoomed in for detailed comparison on motional artifacts between the multi-shot method and proposed single-shot POE-VP. Note that all the HDR results of the 4 methods are tone mapped under the same approach for displaying.

    It can also be evaluated quantitatively in Table 2, which shows the comparison on the average recognition accuracy of the LPR task. The result shows a prominent superiority of our work towards other methods when working in higher saturation cases. Besides, time efficiency is an important feature of our work as our target vision task is a motional HDR task. We compare the lowest running time of the algorithms for a single frame. The comparison results in Table 2 show that POE-VP has a higher time efficiency, which enables faster capture speed.

    Quantitative Results of Comparison Simulationa

    Capture ModeMethodAverage Recognition AccuracyAverage Running Time/ms
    Multi-shots (at least 3 shots)Exposure fusion (classical)65.07%135.261
    Exposure fusion (artifact reduced)88.56%424.831
    Single-shot (with physical setup)Deep optics21.64%1161.458
    POE-VP (ours)94.86%16.928

    The numbers in bold represent the best performance.

    7. REAL SCENE EXPERIMENT

    A. DMD Camera Prototype

    (a) The schematic of optical layout inside the DMD prototype. (b) The DMD prototype. (c) The photograph of the light box in the experimental setup. The red arrow shows the moving direction of the license plate. (d) The overall view of the experimental setup.

    Figure 6.(a) The schematic of optical layout inside the DMD prototype. (b) The DMD prototype. (c) The photograph of the light box in the experimental setup. The red arrow shows the moving direction of the license plate. (d) The overall view of the experimental setup.

    B. Experimental Setup

    The real scene experimental setup is shown in Fig. 6(d). The moving license plate scene in the highlight condition we create is shown in Fig. 6(c), corresponding to the red dashed box part in Fig. 6(d). We use two different real license plates as the objects in the experiment. The license plate is fixed on a sliding optical bench, which makes the license plate move along the oblique optical rail. We use an LED luminescent array in the light box (VC-118-X 3nh) as highlight source A, which is tuned to the max illuminance. Besides, we employ another LED flash light (MSN-400pro) as highlight source B, which is tuned to its max illuminance as well. The two light sources emit the light towards the license plate and form a highlight illuminate area, which leads to the saturation of a conventional image sensor without the DMD’s modulation. The DMD is set in front of the light box, at a distance of 5 m. In this experiment, we only use a part of prototype’s field of view, that is, a region of interest (ROI) at the size of 720×1280.

    Figure 7 illustrates the mechanism and imaging procedure of our framework implemented via the DMD prototype, including the trigger mode and timing scheme. By generating control signals via an arbitrary waveform generator (AWG, UNI-T UTG2000B), we synchronize the DMD and image sensor [34]. As shown in Fig. 7(e), the trigger signal of the image sensor is a relatively long rectangular wave with pulse width tI. The trigger signal of the DMD is a periodical rectangular wave as shown in Fig. 7(d). Within time tD, the DMD refreshes a series of binary patterns [Fig. 7(c)] at each rising edge. Via pulse-width modulation (PWM) [36], the binary patterns form the grayscale coded masks [Fig. 7(b)], which are predicted using previous captures. The DMD and image sensor are connected with a PC, which reads the previous captures from the image sensor, calculates the predicted mask for the next shot, and drives the DMD with the mask. In our experiment, tI and tD are set to 33.33 ms and 20 ms, respectively. Within each single-shot tI, we obtain the HDR captures of the highlight license plate as shown in Fig. 7(f). To implement a fair comparison, the LDR captures of the same motional scene are acquired via the same hardware setup without masks on the DMD, i.e., fully white patterns are loaded within the entire LDR acquisition period. The prototype with the fully white DMD pattern is equivalent to an LDR camera with the same LDR sensor, which is proved in detail in Appendix C. The HDR captures as well as the LDR captures [Fig. 7(a)] are evaluated via the same ALPR Net for comparative experiment. Note that the capture results are grayscale images due to the grayscale image sensor we use in the prototype.

    (a) LDR capture sequence. The red arrow shows the moving direction of the license plate. (b) The grayscale masks predicted by VP. In the real scene experiment, we set mask coefficients (Vs, a, b) to (200, 0.7, 0.2). (c), (d) Schematic of the PWM and trigger mode of the DMD. The DMD refreshes the binary patterns at each rising edge and generates the grayscale masks within each tD; for the rest time, fully white patterns are loaded (the DMD works as a reflective mirror when all micromirrors are “on”). (e) The exposure periods of the image sensor. tI is the effective exposure time of every single shot. (f) The HDR captures acquired by the prototype with the DMD on. The first two frames are under initialization.

    Figure 7.(a) LDR capture sequence. The red arrow shows the moving direction of the license plate. (b) The grayscale masks predicted by VP. In the real scene experiment, we set mask coefficients (Vs, a, b) to (200, 0.7, 0.2). (c), (d) Schematic of the PWM and trigger mode of the DMD. The DMD refreshes the binary patterns at each rising edge and generates the grayscale masks within each tD; for the rest time, fully white patterns are loaded (the DMD works as a reflective mirror when all micromirrors are “on”). (e) The exposure periods of the image sensor. tI is the effective exposure time of every single shot. (f) The HDR captures acquired by the prototype with the DMD on. The first two frames are under initialization.

    C. Results and Evaluations

    Qualitative evaluation. The exemplary HDR results captured using our prototype are shown in Fig. 8, along with LDR captures and intermediate results. It can be observed that highlights on LDR license plates are suppressed in HDR captures. Thus, the revealed license plate characters are able to be accurately recognized. The results show that our framework successfully preserves HDR information of a moving target. Furthermore, one can observe in Fig. 8 that, for each test video, the grayscale distribution of the predicted mask changes dynamically along with the movement of the highlighted region in the LDR frame sequence. In other words, it is validated that the grayscale mask sequence has similar spatiotemporal correlations with the LDR frame sequence of the motional scene. This time varying optical encoding strategy driven by VP is a unique feature of our framework, which ensures that the grayscale pattern coincides with the highlighted region in the scene at every time stamp. As mentioned above, the mask loaded at a certain time stamp is calculated at previous time stamp. As a result, the single-shot HDR imaging is realized. It is clear that the combination of pixel-wise optical encoding with VP enables single-shot HDR imaging for a motional scene.

    (a) Hardware experiment results of “JingN F9H28” license plate motional scene. (b) Hardware experiment results of “JingL J7629” license plate motional scene. Exemplary frames of 2 result videos are shown with the frame index annotations (T). The red arrow indicates the moving direction of the license plate. The first two frames of the video are used for initialization. Frames in orange box are chosen as exemplary frames to calculate corresponding local information entropy maps. See Visualization 1 for video results.

    Figure 8.(a) Hardware experiment results of “JingN F9H28” license plate motional scene. (b) Hardware experiment results of “JingL J7629” license plate motional scene. Exemplary frames of 2 result videos are shown with the frame index annotations (T). The red arrow indicates the moving direction of the license plate. The first two frames of the video are used for initialization. Frames in orange box are chosen as exemplary frames to calculate corresponding local information entropy maps. See Visualization 1 for video results.

    Quantitative evaluation. In order to conduct a fair comparison, LDR captures and HDR captures are sent into the same ALPR Net for quantitative validation. The quantitative results of three evaluation metrics are summarized in Table 4.

    Quantitative Results of Two Real HDR License Plate Recognition Scenesa

    License PlateCaptured Frame CountLDR Rec-AccHDR Rec-AccLDR EntropyHDR EntropyTime/ms
    JingN F9H2828092.86%1.20186.932316.807
    JingL J762939097.44%1.20387.277414.317
    Average095.15%1.20287.104915.562

    The numbers in bold represent the best performance.

    It can be seen that the POE-VP overcomes the failure that LDR captures have in the recognition task. The Rec-Acc of HDR captured videos has an improvement of 95.15% on average, compared to the Rec-Acc of LDR videos. The information entropy calculated on HDR captures is 5.9021 better on average than on LDR captures, which shows a 490.70% increase. POE-VP successfully suppresses the highlight and preserves much richer information on the license plate from the original HDR scene. The local information entropy maps of exemplary frames are shown in Fig. 9 for further demonstration, which visualizes the considerable information entropy increase obtained by the involvement of POE-VP. In the heat maps, the deep blue color in LDR license plate regions indicates that there are information losses. In contrast, the warm colored HDR license plate regions show the ability that POE-VP can preserve a relatively large amount of information in the same highlight condition. Using a 36-patch dynamic range test chart (Sine image), we estimate the DR of the prototype. Without optical encoding (fully white patterns on the DMD), the prototype scores only 70.35 dB in DR, while the prototype with optical encoding expands DR to 120.74 dB, which achieves a gain of 71.63%.

    (a) The local information entropy heat maps of the exemplary frames in “JingN F9H28” license videos. (b) The local information entropy heat maps of the exemplary frames in “JingL J7629” license videos. The HDR and LDR license plate results are set by corresponding entropy maps for comparison.

    Figure 9.(a) The local information entropy heat maps of the exemplary frames in “JingN F9H28” license videos. (b) The local information entropy heat maps of the exemplary frames in “JingL J7629” license videos. The HDR and LDR license plate results are set by corresponding entropy maps for comparison.

    The above results indicate that POE-VP enhances the performance on motional object recognition tasks in highlight conditions to a great extent. Furthermore, we also calculate the time consumption and analyze the high-speed property of POE-VP. In the real scene experiment, we utilize a PC to receive previous captures from the image sensor, operating VP, and output masks to control the DMD, as illustrated in Fig. 6(a). We run the VP and mask calculation algorithm on an NVIDIA GeForce RTX3090 GPU on a PC and summarize the running time in Table 4. The average running time for one frame in the real scene experiment is 15.562 ms, which is nearly the theoretical bound of time consumption of our POE-VP working in the real scene. However, due to the limitation on the data transfer rate of the USB and the DMD driving mode of the PC, the actual capturing speed cannot reach the ideal one, despite that the DMD can refresh its patterns at a rate over thousands of Hz. Thus, the example video results we demonstrate in Visualization 1 take more than 100 ms to complete a single shot of HDR, resulting in a relatively low frame rate. In engineering implementation, edge computing technology, such as the field programmable gate array (FPGA), is a possible solution to accelerate the whole pipeline. Employing FPGA linked with an image sensor and DMD by a data bus, we can not only reduce the time cost of data transfer but also speed up the VP and mask calculation algorithm. According to our theoretical calculation, the proposed POE-VP can reach over 100 fps when implementing via FPGA (Zynq XC7Z035) [37] and the data bus (200 MHz, 32-bit), which is promising to satisfy the needs for real-time HDR motional target recognition. Please refer to Appendix D for the detailed calculation of the theoretical frame rate.

    Last but not least, acquiring a single HDR output via our framework occupies memory with only two previous frames. Compared with multi-frame fusion methods which at least take 3 frames of different exposures, POE-VP achieves lower captured data size, reducing the data volume at a ratio of 66.67% at a minimum.

    8. DISCUSSION AND CONCLUSION

    In summary, POE-VP offers a new paradigm for single-shot HDR vision tasks of motional scenes. By implementing via a DMD, the POEM realizes pixel-wise spatially varying exposure control for suppressing highlight of each captured frame. As the motional artifact and time latency are two critical challenges to overcome in motional scene HDR, we propose the CMPM utilizing VP for ahead-of-time prediction. The CMPM introduces a temporal consistency between the masks for consecutive captures and the motional scene. Driven by VP, DMDs are capable of producing a mask sequence whose spatiotemporal correlation is highly similar to the capture sequence of the highlight motional scene. This spatiotemporally varying optical encoding mask is the core feature of POE-VP, distinguished from other SVE approaches in the field of HDR. Thereby, a single-shot HDR is achieved, giving the credit to the ahead-of-time mask prediction via VP and the high-speed mask loading via the high refresh rate DMD. The dynamic range of the prototype could reach 120 dB under test. We choose the highlight motional license plate recognition as the downstream HDR vision task to evaluate the performance of POE-VP. According to subjective and objective analysis of results in simulation and the real scene experiment, POE-VP shows obvious advantages in expanding DR and preserving HDR information. Compared with conventional LDR captures, POE-VP enhances the recognition performance, which reflected the recognition accuracy and information entropy. Running on an NVIDIA GeForce RTX3090 GPU, the model and algorithm of POE-VP take 15.562 ms in total for a single HDR capture. In the practical engineering deployment, the frame rate can reach over 100 fps by using a large bandwidth and a high transmission rate edge computing accelerator assisted with a data bus, which achieves real-time HDR for motional scenes to some extent. Thus, POE-VP alleviates the artifacts in motional HDR imaging, which we have discussed in comparison simulation in Section 6. Furthermore, compared to the mainstream multi-frame fusion methods, the data volume is also reduced by POE-VP, leading to a lower data size. It should be mentioned that the performance improvements on dynamic range expansion and artifact suppression of POE-VP mainly benefit from its physical setup, i.e., the use of the DMD as a pixel-wise optical modulator and its controller, which of course has a limitation on the device size and additional cost compared to multi-exposure methods with a conventional camera in some practical cases, for example, HDR imaging for static scenes (almost no artifact) with more requirements on device portability and cost. In summary, limited by current technology on the DMD and optical design, POE-VP trades some miniaturization and hardware cost for the prominent improvement on time efficiency, data size, artifact reduction, and dynamic range expansion. Besides, POE-VP is always able to provide additional dynamic range for the used sensor in imaging systems via its physical setup regardless of the sensor’s bit depth. Overall, it can play the role of an enhancement component for various systems in practical applications, which is able to bring additional dynamic range and fast speed, as well as additional size and cost.

    Our POE-VP can be improved in several aspects. First, with the DR, we expand focusing on the high illuminance conditions, while at the opposite side in the axis of DR, there is low illuminance scenario to be extended. Expanding the ability of lowlight sensing is a future work. Second, the POE-VP proposed in this paper is optimized and finetuned for a specific vision task. A generic HDR scheme and HDR imaging at the photography level for human visualization based on the framework of POE-VP are attractive works for further exploration. For instance, the parameters in Eq. (2) that control the mask generation need to be optimized more finely for more complicated HDR tasks or even photography-level HDR imaging. Making them differentiable and training it end-to-end with a video prediction network and other imaging process networks for specific HDR tasks or general HDR photography are further research subjects, which we will dive into. Third, POE-VP has a limitation based on the characteristic of the VP algorithm we used. That is, when the moving object has a precipitous large margin motion, e.g., a sudden 180° turn, the video prediction may fail to catch up with such sudden variation of the scene within two frames, resulting in the imperfect highlight suppression among those sudden movement frames. To achieve improvement, we will dive into the joint optimization of imaging optics, optical encoding strategies, and image processing algorithms.

    APPENDIX A: DIGITAL MICROMIRROR DEVICE USING A PIXEL-WISE OPTICAL ENCODER

    The digital micromirror device (DMD), a high-speed amplitude SLM, carries several thousand microscopic-mirrors in a rectangular array configuration on its surface. Being controlled by two pairs of electrodes, each of those micromirrors can be individually tilted ±12° to reflect the incident light towards two different directions. Thus, the DMD has the ability to control the dual on-off state of each pixel, which can be regarded as an optical switch. Based on the aforementioned mechanism, the DMD is able to produce a high contrast binary pattern, at a high modulation speed up to tens of kilohertz [38]. However, binary patterns cannot satisfy the need of our HDR occasion, in which a coding under floating-point precision is required for suppressing the highlight accurately. As a result, the DMD needs to work under the pulse-width modulation (PWM) mode [36] in order to load non-negative grayscale coded masks. Although the modulation speed reduces from over 10 kHz to 300 Hz [39] in PWM mode, it still satisfies the needs of the HDR moving object recognition task. Placed in front of an imaging sensor, a DMD under PWM mode selectively determines the transmittance of incident light that reaches the sensor.

    Here, we dive into the control mechanism of micromirrors to further discuss the pixel-wise features of POE-VP’s special exposure modulation strategy. In PWM mode, the mirrors are held in the state corresponding to the previous loaded data until the next bit is loaded [36]. If the adjacent data loaded on a certain micromirror are in the same state, the micromirror will just need to remain its pose instead of its tilt angle. In PWM mode, a fully white pixel (255 value in 8-bit grayscale) means that the micromirror stays at the “on” state during the whole refresh time. We can see from the consecutive masks of POE-VP in Fig. 4 that there is a range of pixels remaining fully white (pixel value of 255), which means they do not need to have suppressed high illumination. For these pixels, the corresponding micromirrors keep their pose under the “on” state according to the aforementioned principle. Only the micromirrors corresponding to those pixels which have a gradient of pixel values between consecutive masks need to tilt under PWM mode during each mask refresh time. After all, the core target of POE-VP is adaptively modulating the light intensity in the moving highlight area while maintaining the intensity in the normal illuminance area. Note that, within the micromirrors that need to tilt, the control from the DMD is also spatially varying for modulating different pixel intensities described in Eq. (2), which reflect different duty cycles [36], according to the PWM principle. Therefore, in terms of power consumption, our pixel-wise exposure modulation can save energy to a certain extent because we only consume energy for driving a part of micromirrors to tilt. In summary, the pixel-wise features are reflected in two aspects: (1) the exposure modulation, i.e., each mask loaded on the DMD; (2) the control on each individual micromirror.

    By ensuring each mirror of the DMD aligns with its corresponding pixel on the sensor, the precision of optical coding can be enhanced. After aligning the DMD with the sensor, the process in which DMD modulates the light field via a coded mask and forms a measurement on sensor can be described mathematically as follows [36]: V(x,y)=M(x,y)I(x,y),where (x,y) is the spatial coordinate, is the element-wise multiplication, V(x,y) represents the measurement captured by the sensor, M(x,y) denotes the DMD modulation function, and I(x,y) stands for the light intensity of scene.

    APPENDIX B: VIDEO PREDICTION MODEL

    The video prediction model, the dynamic multi-scale voxel flow network (DMVFN), utilizes 9 convolutional sub blocks named the multi-scale voxel flow block (MVFB) to estimate dynamic optical flow between adjacent frames. The ith MVFB takes two frames (xt1/xt) as well as the outputs of the previous i−1 block (the synthesized frame and the voxel flow estimated by the i−1 block) as inputs and predicts the voxel flow for obtaining predictive frames via pixel-wise backward warping [40]. By setting different receptive fields, 9 MVFBs provide the possibility of predicting diverse motion scales from different pairs of input adjacent frames. Specifically, the input is down-sampled by different scaling factors in the ith MVFB, from large to small values. In our work, the scaling factors of 9 sub blocks are set as [4,4,4,2,2,2,1,1,1]. Besides, the DMVFN has a superiority on time efficiency, which is essential for our work of moving HDR scenes. Different down-sample factors cause MVFBs to have different computational workloads. A routing module is employed to adaptively select dynamic architectures of the DMVFN by weighting each block with 0 or 1 (0 for skipping the block and 1 for activating the block). The learnable routing module consists of a small neural network to generate the binary vector according to the motion scale of the input sample. Based on the design of the routing module, the DMVFN is able to obtain a dynamic adaptive architecture, which saves inference time cost while maintaining the representation performance on large scale motion.

    There are two further evaluations on the VP model that need to be mentioned here. The first evaluation is about the training dataset of the prediction model. The preprocess method in Section 4.A synthesizes our paired HDR-LDR dataset by creating HDR information in a part of the whole image and cutting it off to form an LDR image. A certain level of information loss created by the preprocess strategy is necessary for: (1) simulating the conventional sensor capture results for comparison; (2) training and testing DMVFN’s performance on predicting the movement of saturated objects, which is significant for POE-VP to adaptively generate spatiotemporally varying masks that catch up with the motional highlight area. It needs to be assessed if the VP model trained by our synthesized LDR dataset can retain a similar performance to the model trained with the regular dataset, i.e., the normal exposure image sequences without the preprocess strategy. We conduct a supplementary network training experiment. In the experiment, we retrained the VP network using the same subset of the LSV-LP training dataset mentioned in the main text, which stands for the model trained with regular image sequences. Meanwhile, the network we have already trained using the dataset after preprocessing stands for the model trained with synthesized LDR image sequences. Two models use the same configuration mentioned in Section 4.B for training. We use the same test dataset to compare two trained models in terms of the peak signal to noise ratio (PSNR) [41], the multi-scale structural similarity index measure (MS-SSIM) [42], and the learned perceptual image patch similarity (LPIPS) [43]. Note that a lower LPIPS means the prediction results are more similar to the ground truth. The results are summarized in the first and second rows of Table 5. From the results we can see that the model trained by the synthesized dataset performs slightly better than the model trained by the regular dataset. In conclusion, the synthesized dataset is suitable for training the VP network, which infers in the scene of this work.

    Results of Models Trained on the Regular RGB Dataset (Test on RGB Images), Models Trained on the Synthesized RGB Dataset (Test on RGB Images), and Models Trained on the Synthesized RGB Dataset (Test on Grayscale Images)a

    Training DatasetTest DataPSNR/dBMS-SSIMLPIPS
    Regular image sequences (RGB)RGB25.370.84690.1397
    Synthesized image sequences (RGB)RGB25.710.86030.1379
    Synthesized image sequences (RGB)Grayscale25.780.86200.1338

    The higher PSNR and SSIM scores, and lower LPIPS scores, the better quality of predictions.

    In our hardware experiment, we transfer the single-channel grayscale capture data to the 3-channel image as the input of the video prediction net by copying the single channel thrice and concatenating them along the channel dimension. So, the second evaluation is about the generalization of DMVFN, which was trained on RGB data but tested on such grayscale data. We use the synthesized test dataset (RGB) mentioned in Section 4 as the evaluation dataset. The test dataset retaining the original RGB data format is the one used in Section 6 for the simulation test, while the test dataset reading to grayscale format and then transferring into 3-channel images is the one for comparison evaluation. Both of the evaluation datasets are assessed via the same training video prediction model on the synthesized training dataset (RGB). We compare the PSNR, MS-SSIM, and LPIPS between two evaluations. The results are summarized in the second and third rows of Table 5, which show that there are only very slight differences between two input datasets on the three metrics. Therefore, the video prediction net has the generalization on grayscale data.

    APPENDIX C: PROOF DETAIL OF LDR CAPTURES VIA THE POE-VP PROTOTYPE WITH DMD-OFF IN THE HARDWARE EXPERIMENT

    When a fully white pattern is loaded on the DMD, all the micromirrors are switched “on” and tilted to the same angle, which reflects the light to the sensor direction. In this case, the DMD looks like a big mirror made up of all micromirrors with the same angle, which reflect the incident light of each micromirror to its corresponding pixel in the imaging sensor. We conduct an additional experiment in order to prove that POE-VP with the fully white DMD pattern is equivalent to a traditional LDR camera, i.e., an off-state DMD does not affect the LDR image capture. We set up a high illumination scene using the same LED luminescent array and light box mentioned in Section 7.B, as shown in Fig. 10(a). As shown in Fig. 10(b), the POE-VP prototype (with DMD off) and the LDR camera, using the same type of 8-bit LDR sensor (FLIR GS3-U3-123S6M-C), are set parallelly towards the same scene in the same exposure time of 15 ms. The LDR captured images of each device are demonstrated in Fig. 10(c). It can be seen that the highlight regions of two LDR captures are equally saturated. The very subtle differences in two captures are caused by the slight angle difference and horizontal mis-alignment of two devices. We also calculate the information entropy and the local information entropy under the same definition mentioned in Section 5.B/5.C, for quantitative comparison of two LDR captures. The information entropy of the POE-VP prototype and LDR camera is 5.4868 and 5.5181, respectively. The local information entropy maps are compared in Fig. 10(d). On the other hand, when utilizing the prototype to capture HDR and LDR frames in the hardware experiment, we do not need to be concerned much about the misplacement and mis-alignment towards the scene between two experimental devices. Based on the above evidence and analysis, we prove that our experiment setting is able to provide a fair comparison between our HDR and traditional LDR capturing.

    (a) The high illuminance scene of the additional experiment. (b) The setting of the two compared LDR capture devices, with the POE-VP prototype in the orange dashed box and the LDR camera in the blue dashed box. (c) The LDR images captured by two devices. (d) The local information entropy maps of two LDR captures.

    Figure 10.(a) The high illuminance scene of the additional experiment. (b) The setting of the two compared LDR capture devices, with the POE-VP prototype in the orange dashed box and the LDR camera in the blue dashed box. (c) The LDR images captured by two devices. (d) The local information entropy maps of two LDR captures.

    APPENDIX D: THEORETICAL ANALYSIS ON TIME CONSUMPTION AND REAL-TIME PROPERTY

    As mentioned in the main paper, the limitation on time consumption in our real scene experiment is mainly caused by the limited running speed of algorithms and low data transfer rate. For engineering implementation, by using a field programmable gate array (FPGA) linked with an image sensor and DMD by a data bus, we can not only reduce the time cost of data transfer but also speed up the VP and mask calculation algorithm. In this section, we provide detailed theoretical calculation of possible time consumption for capturing a single HDR frame via the proposed POE-VP in engineering condition, from which we derived the estimated theoretical frame rate of POE-VP.

    In our estimated calculation, we select a 32-bit width advanced high performance bus (AHB) under 200 MHz clock frequency for data transfer between the computing unit and DMD, as well as the computing unit and image sensor. The time consumption brought by AHB data transfer for a single frame is calculated as follows: TAHB=H·W·BDf·BW,where H, W, and BD represent the image height, width, and bit depth, respectively. f and BW stand for the clock frequency and bit width of AHB. We use the grayscale image in this calculation, which conforms to the real scene experiment setup. As a result, the values of H, W, and BD are 1280, 720, and 8, respectively.

    For the role of the edge computing unit, we adopt the FPGA (Zynq XC7Z035)-based CNN accelerator designed for pixel-to-pixel applications from Ref. [37]. The FPGA provides a computing power of 2.525×1012 floating point operations per second (FLOPS). According to our measurement and calculation, the total computational complexity of algorithms (video prediction + mask calculation) in POE-VP is 1010 FLOPS approximately. Then, the running time of algorithms on the FPGA can be estimated as follows: TFPGA=CCCP,where CC and CP represent the computational complexity and computing power. As mentioned in Section 1, the modulation speed of the DMD, fDMD, is 300 Hz in PWM mode. The time cost of the DMD is TDMD=1/fDMD=3.333  ms.

    The total time consumption is calculated as follows: Ttotal=2TAHB+TFPGA+TDMD.

    The TAHB is counted twice, corresponding to the data transfer from the image sensor to the FPGA and from the FPGA to DMD. The final result of Ttotal is 9.597 ms, and the theoretical frame rate is 104 fps. All notations are summarized in Table 6 with their values and descriptions.

    Main Notations for Calculation of Time Consumption

    NotationValueDescription
    H1280Image height
    W720Image width
    BD8-bitImage bit depth
    f200 MHzClock frequency of AHB
    BW32-bitBit width of AHB
    TAHB1.152 msTime consumption on AHB
    CC1010  FLOPSComputational complexity
    CP2.525×1012  FLOPSComputing power
    TFPGA3.960  msRunning time on FPGA
    fDMD300 HzModulation speed of DMD
    TDMD3.333 msTime consumption on DMD
    Ttotal9.597 msTotal time consumption

    [1] E. Reinhard, G. Ward, S. Pattanaik. High Dynamic Range Imaging: Acquisition, Display, and Image-Based Lighting(2010).

    [2] S. K. Nayar, T. Mitsunaga. High dynamic range imaging: spatially varying pixel exposures. IEEE Conference on Computer Vision and Pattern Recognition, 472-479(2000).

    [3] F. Dufaux, P. Le Callet, R. Mantiuk. High Dynamic Range Video: From Acquisition, to Display and Applications(2016).

    [4] Y.-L. Liu, W.-S. Lai, Y.-S. Chen. Single-image HDR reconstruction by learning to reverse the camera pipeline. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1651-1660(2020).

    [5] P. E. Debevec, J. Malik. Recovering high dynamic range radiance maps from photographs. 24th Annual Conference on Computer Graphics and Interactive Techniques, 369-378(1997).

    [6] F. Banterle, A. Artusi, K. Debattista. Advanced High Dynamic Range Imaging(2017).

    [7] S. W. Hasinoff, F. Durand, W. T. Freeman. Noise-optimal capture for high dynamic range photography. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 553-560(2010).

    [9] E. Onzon, F. Mannan, F. Heide. Neural auto-exposure for high-dynamic range object detection. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7710-7720(2021).

    [10] Ce. Liu. Beyond Pixels: Exploring New Representations and Applications for Motion Analysis(2009).

    [12] Q. Yan, D. Gong, Q. Shi. Attention-guided network for ghost-free high dynamic range imaging. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1751-1760(2019).

    [13] T. Asatsuma, Y. Sakano, S. Iida. Sub-pixel architecture of CMOS image sensor achieving over 120 dB dynamic range with less motion artifact characteristics. International Image Sensor Workshop, R31(2019).

    [14] S. Iida, Y. Sakano, T. Asatsuma. A 0.68 e-rms random-noise 121 dB dynamic-range sub-pixel architecture CMOS image sensor with LED flicker mitigation. IEEE International Electron Devices Meeting (IEDM), 10.2.1-10.2.4(2018).

    [16] J. Han, C. Zhou, P. Duan. Neuromorphic camera guided high dynamic range imaging. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1730-1739(2020).

    [17] C. A. Metzler, H. Ikoma, Y. Peng. Deep optics for single-shot high-dynamic-range imaging. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1375-1385(2020).

    [18] Q. Sun, E. Tseng, Q. Fu. Learning rank-1 diffractive optics for single-shot high dynamic range imaging. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1386-1396(2020).

    [19] S. Hajisharif, J. Kronander, J. Unger. Adaptive dualiSO HDR reconstruction. EURASIP Journal on Image and Video Processing, 1-13(2015).

    [25] X. Hu, Z. Huang, A. Huang. A dynamic multi-scale voxel flow network for video prediction. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6121-6131(2023).

    [32] E. Reinhard, M. Stark, P. Shirley. Photographic tone reproduction for digital images. Seminal Graphics Papers: Pushing the Boundaries, 2, 661-670(2023).

    [36] D. Doherty, G. Hewlett. 10.4: Phased reset timing for improved digital micromirror device (DMD) brightness. SID Symposium Digest of Technical Papers 29, 125-128(1998).

    [37] W. Sun, C. Tang, Z. Yuan. A 112-765 GOPS/W FPGA-based CNN accelerator using importance map guided adaptive activation sparsification for pix2pix applications. IEEE Asian Solid-State Circuits Conference (A-SSCC), 1-4(2020).

    [40] M. Jaderberg, K. Simonyan, A. Zisserman. Spatial transformer networks. Advances in Neural Information Processing Systems, 1-9(2015).

    [42] Z. Wang, E. Simoncelli, A. Bovik. Multiscale structural similarity for image quality assessment. 37th Asilomar Conference on Signals, Systems & Computers, 2, 1398-1402(2003).

    [43] R. Zhang, P. Isola, A. A. Efros. The unreasonable effectiveness of deep features as a perceptual metric. IEEE Conference on Computer Vision and Pattern Recognition, 586-595(2018).

    Tools

    Get Citation

    Copy Citation Text

    Yutong He, Yu Liang, Honghao Huang, Chengyang Hu, Sigang Yang, Hongwei Chen, "Predictive pixel-wise optical encoding: towards single-shot high dynamic range moving object recognition," Photonics Res. 12, 2524 (2024)

    Download Citation

    EndNote(RIS)BibTexPlain Text
    Save article for my favorites
    Paper Information

    Category: Imaging Systems, Microscopy, and Displays

    Received: Jun. 24, 2024

    Accepted: Aug. 11, 2024

    Published Online: Oct. 31, 2024

    The Author Email: Hongwei Chen (chenhw@tsinghua.edu.cn)

    DOI:10.1364/PRJ.533288

    Topics