Advanced Photonics Nexus, Volume. 4, Issue 4, 046017(2025)

High-speed video imaging via multiplexed temporal gradient snapshot Editors' Pick

Yifei Zhang1, Xing Liu2, Lishun Wang2, Ping Wang2, Ganzhangqin Yuan1, Mu Ku Chen3, Kui Jiang4, Xin Yuan2、*, and Zihan Geng1,5、*
Author Affiliations
  • 1Tsinghua University, Tsinghua Shenzhen International Graduate School, Shenzhen, China
  • 2Westlake University, Research Center for Industries of the Future and School of Engineering, Hangzhou, China
  • 3City University of Hong Kong, Department of Electrical Engineering, Hong Kong, China
  • 4Harbin Institute of Technology, School of Computer Science and Technology, Harbin, China
  • 5Pengcheng Laboratory, Shenzhen, China
  • show less

    High-speed imaging is crucial for understanding the transient dynamics of the world, but conventional frame-by-frame video acquisition is limited by specialized hardware and substantial data storage requirements. We introduce “SpeedShot,” a computational imaging framework for efficient high-speed video imaging. SpeedShot features a low-speed dual-camera setup, which simultaneously captures two temporally coded snapshots. Cross-referencing these two snapshots extracts a multiplexed temporal gradient image, producing a compact and multiframe motion representation for video reconstruction. Recognizing the unique temporal-only modulation model, we propose an explicable motion-guided scale-recurrent transformer for video decoding. It exploits cross-scale error maps to bolster the cycle consistency between predicted and observed data. Evaluations on both simulated datasets and real imaging setups demonstrate SpeedShot’s effectiveness in video-rate up-conversion, with pronounced improvement over video frame interpolation and deblurring methods. The proposed framework is compatible with commercial low-speed cameras, offering a versatile low-bandwidth alternative for video-related applications, such as video surveillance and sports analysis.

    Keywords

    1 Introduction

    High-speed imaging has the advantage of giving direct visual information about transient processes such as fluid dynamics and sports movements, but its adoption is limited by complicated hardware, data volume, and high power needs.14 Unlike ordinary cameras, which capture frame-by-frame sharp images, this paper introduces a novel low-bandwidth video imaging method using two coded exposure snapshots to reconstruct up to 32 frames from a highly dynamic scene. This approach, named SpeedShot, is compatible with standard low-speed cameras and presents a memory-efficient alternative to capture the transient moments more efficiently.

    High-speed video reconstruction from low-speed, low-storage captures has been extensively studied. One major research direction focuses on software post-processing. Some methods, known as video frame interpolation (VFI), predict multiple high-speed frames from two consecutive low-speed frames. Based on the assumption of motion continuity, VFI spatially warps the pixels to generate unseen frames. Although optical flow analysis can accurately recover transformations between low-speed frames, its performance is limited by the reliance on precise motion assumptions, which are often impractical for complex, large-scale movements. As a result, the maximum ratio of video rate up-conversion via VFI is below 4× speed-up.

    Advancements in computational imaging inspire researchers to develop novel hardware systems that capture temporal dynamics more effectively. Video snapshot compressive imaging (VSCI) multiplexes a sequence of video frames, each modulated with temporally varying masks, into a single-shot observation.57 Although VSCI can speed up the video frame rate by tens of times, the need for pixel-wise exposure control lays pressure on circuit design and system calibration, limiting its practicality. Another example is the emerging event camera, which employs the dynamic vision sensor (DVS) to record the pixels that have a light intensity change in time, resulting in much less data to be processed than the conventional image sensor. However, the DVS technology is still in its infancy, with a high economic cost. It also needs to work in pairs with traditional RGB cameras to restore static textures in the scene.

    In this work, we explore the opportunity of only using low-cost commercial hardware components for high-speed capture and reconstruction. We propose SpeedShot, a temporal-modulation-only framework for computational high-speed video acquisition. As shown in Fig. 1(a), we consider the inherent redundancy of video data and recognize the spatial sparsity of temporal gradient images. The temporal difference between neighboring high-speed video frames reveals the frame-wise object motion and provides crucial information for video reconstruction. On the other hand, the frame-wise dynamic is spatially sparse and affects only a few pixels in the temporal gradient image. Thus, several such images can be reasonably multiplexed with minimal information loss, as shown in Fig. 1(b). Observing this, SpeedShot employs a dual-camera system with coded exposure for multiplexed temporal gradient imaging. During one long exposure, the cameras’ shutters are programmed with a predefined modulation pattern and let the input video signal alternatively switch between the two cameras, producing two coded snapshots. The subtraction between the coded exposure images creates a compact multiframe motion representation for video reconstruction. In this way, SpeedShot offers data-efficient and hardware-friendly high-speed video capture. The proposed framework can be easily implemented with two low-speed cameras with built-in or external programmable shutters. We also expect its compatibility with other emerging camera systems, where a two-tap sensor enables the acquisition of two temporally weighted images within a snapshot.8,9

    Working principles for SpeedShot. (a) Temporal gradient (TG) images are sparse motion representations. (b) SpeedShot multiplexes TG images for multiframe motion representations, which assists in high-speed video reconstruction. (c) The proposed framework is compatible with low-end commercial cameras with coded exposure photography.

    Figure 1.Working principles for SpeedShot. (a) Temporal gradient (TG) images are sparse motion representations. (b) SpeedShot multiplexes TG images for multiframe motion representations, which assists in high-speed video reconstruction. (c) The proposed framework is compatible with low-end commercial cameras with coded exposure photography.

    The high-speed video reconstruction from SpeedShot measurements involves solving an ill-posed inverse problem. Although several pioneering methods have been proposed for video compressive sensing, they do not consider SpeedShot’s unique temporal-only modulation model. Inspired by the plug-and-play (PnP) framework, we introduce the motion-guided scale-recurrent transformer (MSRT) for effective video restoration. MSRT is structured around two innovative components: 1) motion-guided hybrid enhancement (MoGHE), which extracts and utilizes the motion cues from the observed data to adaptively refine the video restoration process, and 2) an error-aware scale recurrent network, which operates on multiple scales to reconstruct the video, informed by the forward imaging model. This combination allows for the precise reconstruction of static and dynamic regions in the scene.

    The key contributions of this paper are summarized as follows:

    In addition, SpeedShot’s performance was rigorously evaluated through extensive experiments on various open video datasets. Comparative analysis with VFI, VSCI, and single-frame deblurrer highlights SpeedShot’s advantages in high-quality, hardware-efficient high-speed imaging.

    2 Related Work

    2.1 Compressive Video Acquisition

    Numerous modalities have been developed to encode high-speed frames with hardware for low-speed embeddings, enhancing temporal resolution.10,11 Predominantly, these methods rely on specialized spatial-temporal modulation devices, such as VSCI and compressed ultrafast photography.12 The associated high costs underscore the importance of developing cost-effective alternatives.13,14 However, current methods still rely on spatial coding, leading to the heavy burden of the alignment and calibration between the sensor and the modulator. This paper presents a novel perspective by demonstrating that spatial modulation is not indispensable for achieving high-quality video compressive imaging. We utilize a dual-camera setup—a common configuration in computational imaging1517—where the difference between two coded observations significantly improves the temporal decipherability of complex motions.

    2.2 Coded Exposure Photography

    Also known as flutter shutter photography, coded exposure photography (CEP) is a classic method in computational imaging.1822 In conventional photography, the camera is exposed only once, followed by the sensor read-out. CEP, on the other hand, allows for multiple exposures to obtain a single image. Many modern digital cameras support the coded exposure as a built-in function. CEP was originally designed for motion deblurring. Research revealed that multiple exposures alter the shape of the blurring kernel and make motion deblurring easier. Holloway et al.18 combined coded exposure with video frame interpolation, achieving a 7× frame rate up-conversion. However, the application of CEP-based video acquisition is limited to simple translational motion. When the object motion in the scene becomes more complex, the reconstructions suffer from blurs and unrealistic artifacts. We show that instead of conducting coded exposure directly in the image domain, conducting temporal multiplexing in the gradient domain is more effective for video restoration.

    3 Multiplexed Temporal Gradient Imaging

    3.1 Motivation

    Our method aims to enhance the efficiency of video acquisition by leveraging a powerful perspective on video data: the temporal gradient (TG), which captures differences between successive frames. As illustrated in Fig. 1, TG images filter out static backgrounds, leaving only motion-related signals. Non-zero pixels in a TG image correspond to moving parts of objects, serving as indicators of the objects’ speed and direction. Previous work in video processing has used TG images as an additional modality to better describe and understand the transient motions.23

    Unlike previous approaches that use TG to capture single motion information,8,23 SpeedShot captures multiplexed TG images for multiframe motion representation. Compared with conventional intensity images, TG images exhibit a high degree of spatial sparsity, making them highly compressible. Over short intervals, motion cues in TG images appear as sharp lines occupying only a few pixels, allowing multiple TG images from different time points to be multiplexed without compromising the separability of motion cues. This approach enables the multiplexed TG image to effectively capture frame-by-frame details of objects’ translations, rotations, and deformations, facilitating high-speed video reconstruction with enhanced accuracy.

    3.2 Observation Formation Model

    SpeedShot simultaneously captures two temporally coded observations for an N-frame video clip, reducing the hardware frame rate required to 1/N. This process can be expressed mathematically as Y=[Yc1;Yc2]=i=1N[mi·Xi;(1mi)·Xi],where Yc1 and Yc2Rnx×ny are the coded snapshots captured by the first and second cameras. XiRnx×ny represents the ith sharp frame of the N-frame video X.

    The binary vector m={mi}i=1N{0,1} defines the shutter states of the first camera for each frame, forming the temporal modulation pattern. In m, half of the elements are randomly set to “1”, indicating that the frame is exposed, whereas the other half are “0”, indicating it is blocked. The modulation patterns for the two cameras are complementary: each frame Xi contributes to either Yc1 or Yc2, ensuring full coverage across the frames. The temporal gradient image T of the video clip X can be obtained by calculating the difference between Yc1 and Yc2T=Yc1Yc2=j=1N/2Tj,where each Tj represents the difference between two exposed frames, Xp and Xq, from the video. This formulation explicitly encodes temporal information across the video clip. The overall process is visualized in Fig. 2.

    (a) Mathematical model for hardware encoding. (b) Visualization of SpeedShot’s splitting an uncoded long exposure image into two coded exposures, Yc1 and Yc2, which leads to a multiplexed TG image T.

    Figure 2.(a) Mathematical model for hardware encoding. (b) Visualization of SpeedShot’s splitting an uncoded long exposure image into two coded exposures, Yc1 and Yc2, which leads to a multiplexed TG image T.

    4 MSRT for Software Decoding

    MSRT takes the dual observation as input and reconstructs multiframe videos. As shown in Fig. 3, MSRT follows a scale-recurrent structure. At each stage, the network mainly consists of an error-aware cross-scale fusion (ECSF) module and N MoGHE blocks. Two simple modules with three stacked Conv3D activated by ReLu are used for shallow feature extraction (FE) and video restoration (VR), respectively.

    Overview of MSRT. A three-level snapshot pyramid is first generated. For each pyramid level k, the dual observation Y(k) passes through the network to obtain a predicted video X^(k) and a refined feature F^(k). Guided by the error E(k) between X^(k) and Y(k), F^(k) is then fused into the next-level reconstruction on a larger scale. The network is recurrently passed through three iterations.

    Figure 3.Overview of MSRT. A three-level snapshot pyramid is first generated. For each pyramid level k, the dual observation Y(k) passes through the network to obtain a predicted video X^(k) and a refined feature F^(k). Guided by the error E(k) between X^(k) and Y(k), F^(k) is then fused into the next-level reconstruction on a larger scale. The network is recurrently passed through three iterations.

    4.1 Scale-Recurrent Unfolding Network

    The frame-wise temporal information is explicitly encoded in dual observations and their difference image. To better use this information, it is crucial to enhance the consistency between the reconstruction and the dual observations so that the reconstructed video has a similar motion pattern to what is recorded in the temporal gradient image. Addressing this, a scale-recurrent unfolding network (SRUN) is proposed. Video estimation from observation Y=[Yc1;Yc2] is treated as an optimization problem24,25 as in Eq. (3). The first term controls data fidelity, the similarity between the prediction and observation with coding M=[m;1m]. The second term λR(X) is the prior regularization X^=argminX12YMX22+λR(X).

    To solve Eq. (3), we unfold the optimization problem with the plug-and-play (PnP) framework,6,25,26 which leads to an alternative iterative update equation as Eq. (4), where G and R are responsible for data fidelity and prior updates, respectively. X^(k)=R[G(M,X^(k1),Y)].

    Traditional compressive sensing frameworks often necessitate a uniform input form throughout iterative processes. Breaking from this convention, MSRT facilitates a unique coarse-to-fine reconstruction approach. Owing to the absence of spatial modulation in its forward model, both G and R are spatially invariant. This allows for their approximation through upscaling and downscaling operations. Such flexibility enables MSRT to manage varying reconstruction levels with multiscale processing.

    In MSRT, the input Y is down-sampled into a three-level pyramid, denoted as [Y(0),Y(1),Y(2)] for multiscale analysis. The update equation [Eq. (4)] becomes X^(k)=R[G(M,X^(k1),Y(k1),Y(k))],where X^(k) is a kth level down-scaled estimation of X. The scaling factor is 2, and k[0,1,2]. Starting from the smallest scale Y(0), G not only ensures the consistency between X^(k1) and Y(k1) but also incorporates details from Y(k) for enhanced resolution in a larger scale. Meanwhile, R progressively refines the feature at each resolution, leading to a fine-detailed prediction. The output of the last scale X^(2) is the final video reconstruction.

    G and R, unlike other deep unfolding networks, are recurrently used across all stages with shared weights for parameter efficiency and training effectiveness. Addressing SpeedShot’s dual-input nature, we introduce forward and backward projection operators P and P1. These operators link the observation and video domains, assisting in initial prediction and data consistency evaluation Y=P(X)=Concat[mX,(1m)X],P1(Y)=Concat[mYc1,(1m)Yc2],where † denotes the Moore–Penrose inverse for the coding patterns. P1 is a least-squares estimation of the original video content. For instance, as the starting point of iteration, X(0) is initialized as P1(Y(0)), whereas when each iteration ends, an error map can be obtained with E(k)=P(X^(k))Y(k).

    In the following section, we describe the key components for G and R, namely, error-aware cross-scale fusion and motion-guided hybrid enhancement.

    4.2 Error-Aware Cross-Scale Fusion (ECSF)

    ECSF is designed to supervise data consistency in the feature domain, incorporating a cross-channel-attention mechanism to balance physics and learned priors across scales. As shown in Fig. 4, ECSF processes shallow features F(k) from the current pyramid level and upscaled features F^(k1) from the previous stage. These features are split and partially mixed via transposed attention, guided by the spatial error map E(k1) that encodes discrepancies between the inputs and predictions from the last stage. This map, after processing through ConvBlock (Conv3DLReLuConv3D), is combined with F^(k1) for refinement. The attention mechanism then uses refined F^(k1) for query Q generation, and F(k) for key K and value V. Mixed features, further processed with FFN and concatenated with unaltered halves of the splits, pass through a 1×1×1 convolution to yield the fused F(k).

    Details for the ECSF. Feature Fin(k) from the feature extraction block is fused with F(k−1), yielding a fused feature Fout(k).

    Figure 4.Details for the ECSF. Feature Fin(k) from the feature extraction block is fused with F(k1), yielding a fused feature Fout(k).

    4.3 Motion-Guided Hybrid Enhancement

    4.3.1 Motion-guided space-channel selection (MGSCS)

    MGSCS enhances the motion-static relationship in video features. This process, defined as M=std(F)1+abs[mean(F)],involves calculating the motion prior MRC×1×H×W from FRC×T×H×W. M can be seen as the equivalent difference between Yc1 and Yc2 in the feature domain. As shown in Fig. 5, extracted motion priors are processed through a ConvBlock and transformed into a spatial-channel attention map.2729 This map selectively enhances important regions in the feature both spatial-wise and channel-wise. An additional Inception-like branch is incorporated to prevent blurriness.30,31

    4.3.2 Dense pyramid temporal self-attention (DPTSA)

    DPTSA builds upon the temporal self-attention (TSA) mechanism.32,33 To overcome TSA’s limitation of a small receptive field, DPTSA incorporates atrous spatial pyramid pooling (ASPP)34 to enable hierarchical spatial interactions. An input feature LRC×T×H×W is divided into three groups along the channel dimension, forming [L0,L1,L2], each LiRC/3×T×H×W. In each level, Li undergoes fusion with outputs from previous levels via a 1×1×1 convolution, followed by a dilated residual block. The dilation rates are set at [4, 2, 1] for different scales, facilitating a broad range of spatial interactions. Subsequently, each level undergoes temporal self-attention and a 3D feed-forward operation via the TSA block. The process concludes with the concatenation and fusion of features from all levels, again using a 1×1×1 convolution for comprehensive feature integration.

    5 Experiments

    5.1 Simulated Experiments

    We experimentally demonstrate the efficacy of the SpeedShot framework. In PyTorch, we adopt data augmentation from BIRNAT.35 Paired observations and ground truths were generated following Eq. (1). The networks are initially trained using the DAVIS201736 dataset for 300 epochs using the Adam optimizer with a batch size of 4. The base model, aimed at 8× restoration, started with a learning rate of 104, gradually reducing to 105 with cosine annealing. Subsequently, the model is fine-tuned on Adobe24037 for 8×, 16×, and 32× restoration over 30 epochs with a learning rate of 105. The training objective is to minimize the mean square error (MAE).

    5.1.1 Comparison with VFI in multi-frame restoration

    We evaluate SpeedShot’s video speed-up capability on two widely used video benchmarks, Adobe240 and GoPro.38 The state-of-the-art image-only3942 and event-based VFI methods4345 are selected for comparison. For a fair assessment, videos are segmented into N-frame clips corresponding to the intended speed-up rate, with VFI methods tasked with predicting N1 frames from the initial frames of each clip. For VFI methods, the inputs are the first and final sharp frames extracted from high-speed video clips, from which intermediate frames are interpolated. For event-based VFI methods, event data cubes are also given to provide additional temporal information. The quantitative results are reported in Table 1. The proposed model and VFI methods are trained on the Adobe240 dataset and inferred on two datasets. The results of the three event camera-assisted methods are directly taken from their original paper. MSRT shows a significant performance gap (611  dB) over frame-based VFI. Given that SpeedShot doubles the data throughput by introducing an extra camera, we also address the cross-compression-rate comparison (e.g., 32× SpeedShot versus 16× VFI). MSRT’s superiority is still pronounced, with up to 7 dB improvements. Visual comparison in Fig. 7 shows that VFIs’ reconstructions either blur out or deviate a few pixels from the ground truth, whereas SpeedShot correctly restores the fine textures and details, as shown in Fig. 6.

    Details for the motion-guided hybrid enhancement (MOGHE) block.

    Figure 5.Details for the motion-guided hybrid enhancement (MOGHE) block.

    Example of SpeedShot’s reconstruction from Adobe240. With a pair of simultaneously taken observations, SpeedShot records and restores tens of frames of a dynamic scene with nonlinear motions. Interpolation methods often fail in such scenarios due to a lack of pixel correspondence between the first and the last frames. (a) Input. (b) Reconstructions.

    Figure 6.Example of SpeedShot’s reconstruction from Adobe240. With a pair of simultaneously taken observations, SpeedShot records and restores tens of frames of a dynamic scene with nonlinear motions. Interpolation methods often fail in such scenarios due to a lack of pixel correspondence between the first and the last frames. (a) Input. (b) Reconstructions.

    • Table 1. Results for multiframe restoration on Adobe240 and GoPro. Best results are in bold. Note that known sharp frames are not included in the metrics for VFI methods, whereas all frames are considered for MSRT.

      Table 1. Results for multiframe restoration on Adobe240 and GoPro. Best results are in bold. Note that known sharp frames are not included in the metrics for VFI methods, whereas all frames are considered for MSRT.

      MethodNetwork inputsAdobe240GoProParameters (M)/Runtime(s)
      16×32×16×32×
      PSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIM
      SuperSloMo39The first and the last clear images28.640.88422.800.72820.860.60228.980.87524.380.74720.450.61839.6/0.38
      IFRNet4130.490.91625.590.80721.430.65329.810.89324.380.74620.690.62019.7/0.42
      GiMM-VFI4233.000.93926.830.83721.930.66330.310.89425.030.75221.560.63830.6/0.93
      UPR-Net4032.340.93426.460.82321.840.66129.690.88524.860.74921.500.6353.7/0.76
      TimeReplayer44The first and the last clear images + event data stream34.140.95034.020.960
      TimeLens*4334.450.95134.810.95933.210.94272.2/—
      A2OF4536.590.96036.610.971
      MSRT (Ours)Two coded blurry images39.080.98335.220.96833.260.95839.420.98436.440.96832.540.93914.9/0.74

    Visual comparison of the GoPro dataset at 8× speed-up.

    Figure 7.Visual comparison of the GoPro dataset at 8× speed-up.

    The computational efficiency of our MSRT and other VFI methods in terms of model size and inference speed is compared on 640×480 input images using an RTX 3090 GPU. Our MSRT contains 14.9 million parameters with 0.74 s of inference time. The number of parameters and inference time of other methods are listed in Table 1. Our MSRT offers superior reconstruction quality with a resonably lightweight architecture, achieved by reusing the same network module across multiple iterations. Future improvements in real-time performance can be anticipated through model compression techniques such as pruning and quantization.

    The event camera and SpeedShot are theoretically similar. Event cameras record pixel intensity changes as sparse data, whereas SpeedShot captures these changes through spatial multiplexing. Experimental results show that SpeedShot outperforms event-assisted methods by 1.6 dB. This improvement is due to the binary quantization in event cameras, which loses substantial information during event generation. In contrast, SpeedShot preserves full precision in recording intensity changes, resulting in more detailed data.

    5.1.2 Comparison with image deblurring

    The SpeedShot system simultaneously captures two coded blurry exposures. Multiframe decoding can be interpreted as a more challenging task of image deblurring. From this perspective, we compare SpeedShot’s performance with several SOTA deblurring methods4650 on the GoPro dataset. These methods take a single uncoded long-exposure blurry frame generated by averaging high-speed frames as the deblur input. As shown in Table 2, although MSRT faces a more challenging task of restoring all N latent frames, it still leads to an additional 0.5 to 3.5 dB improvement over the SOTA event-based deblurring method. It also outperforms the best CEP-based deblurring method by 4.4 dB, validating the effectiveness of recording the multiframe motion dynamics in the temporal gradient domain.

    • Table 2. Comparison with image deblurring methods on GoPro. Best result is in bold, second best in italic.

      Table 2. Comparison with image deblurring methods on GoPro. Best result is in bold, second best in italic.

      MethodInputAmplitudeOutputPSNR/SSIM
      NAFNet511 blurry frame7 to 13 frames1 sharp frame33.69/0.967
      Restormer467 to 13 frames32.92/0.961
      Stripformer477 to 13 frames33.08/0.962
      DeepCE211 coded frame32 frames1 sharp frame28.10/0.8627
      EFNet481 blurry frame + event frames7 to 13 frames1 sharp frame35.46/0.972
      REFID497 to 13 frames35.91/0.973
      MSRT (Ours)2 coded frames8 frames8 sharp frames39.42/0.984
      16 frames16 sharp frames36.44/0.968
      32 frames32 sharp frames32.54/0.939
    • Table 3. Comparison for 8× video reconstruction on Set6. Best results are in bold, second best in italic.

      Table 3. Comparison for 8× video reconstruction on Set6. Best results are in bold, second best in italic.

      MethodInputPSNRSSIM
      UPR-net40VFI31.080.872
      IFRNet41VFI28.500.813
      EfficientSCI33VSCI35.430.959
      MSRT (Ours)SpeedShot34.720.925

    5.1.3 Comparison with VSCI

    The proposed SpeedShot framework has a close relationship with VSCI.6,7,33,52 Both methods reconstruct multiple frames from a highly compressed measurement. Although VSCI employs both spatial and temporal dimensions for video compressive encoding, SpeedShot only works on the temporal dimension. We test the 8× speed-up performance of SpeedShot on Set6,53 a widely used dataset for VSCI. The SOTA method, EfficientSCI,33 is chosen for comparison. As shown in Table 3, Speedshot achieves comparable performance. As visualized in Fig. 8, both methods semantically and quantitatively recover the details of motion. SpeedShot even outperforms VSCI in scenes with a linear motion, e.g., “Traffic.”

    Selected reconstructions of Set6. EfficientSCI is the SOTA VSCI method.

    Figure 8.Selected reconstructions of Set6. EfficientSCI is the SOTA VSCI method.

    5.2 Optical Prototype

    To validate SpeedShot’s practicality, we construct two optical prototypes with commercial low-speed cameras. The first prototype, shown in Fig. 1(c), utilizes dual RGB cameras [OmniVision OV4689, 4MP, 60 frames per second (fps)] operating in multi-exposure mode. Each camera’s sensor combines four coded frames into a single observation at 7.5 fps. Spatial alignment is achieved using homography-based calibration with a checkerboard pattern, ensuring accurate registration of the paired observations. The MSRT network reconstructs a 60 fps video from these coded observations, as shown in Fig. 9. Speedshot manages to recover the nonlinear moving trajectory of the bottle while reducing the motion blur in the observations.

    Paired observations from our SpeedShot dual RGB camera prototype, and the corresponding 8× reconstructions. The temporal gradient image highlights the frame-wise object motion and reflects the trajectory of the movement.

    Figure 9.Paired observations from our SpeedShot dual RGB camera prototype, and the corresponding 8× reconstructions. The temporal gradient image highlights the frame-wise object motion and reflects the trajectory of the movement.

    The second prototype, depicted in Fig. 10, employs an advanced optical setup with an external shutter (Fldiscovery F4320) for precise temporal encoding. The imaging system splits the light path into two using a beam splitter (Lbtek MBS1455-A, R:T = 5:5). The first path captures high-speed dynamic scenes directly on the Blurry Cam (HIKROBOT MV-CS020-10 UC, 1624  pixel×1240  pixel) through a Nikon Macro lens (L1, 35 mm). The second path encodes the scenes using the reflective shutter. The encoded light passes through a Coolens WWK10-110-111 lens [L2, field of view (FOV): 18 mm] and a high-performance lens (L3) before being captured by the Coded Cam (HIKROBOT MV-CA050-20 UM, 2591  pixel×2048  pixel). Both cameras are triggered simultaneously after encoding 16 shutter frames at a 960 Hz frame rate. This system is synchronized via a laptop-controlled Arduino microcontroller. Temporal calibration of this setup involves synchronization of the shutter and cameras, verified using a high-frequency LED signal to ensure precise modulation timing. This setup reconstructs a 960 fps video from 60 fps observations, as shown in Fig. 11. Although the captured image is severely blurred, the proposed MSRT can still reconstruct the intermediate frames of the abrupt process correctly with fine details.

    Imaging prototype with an external shutter for optical modulation. It accelerates a 60 Hz camera to 960 Hz.

    Figure 10.Imaging prototype with an external shutter for optical modulation. It accelerates a 60 Hz camera to 960 Hz.

    Inputs and reconstructions from the SpeedShot prototype with an external mechanical shutter at 960 Hz. One camera captures a temporally coded observation, whereas the other captures a blurry, uncoded image. Subtracting the coded observation from the blurry one yields an additional coded observation, enabling 960 fps video reconstruction.

    Figure 11.Inputs and reconstructions from the SpeedShot prototype with an external mechanical shutter at 960 Hz. One camera captures a temporally coded observation, whereas the other captures a blurry, uncoded image. Subtracting the coded observation from the blurry one yields an additional coded observation, enabling 960 fps video reconstruction.

    5.3 Ablation Study

    5.3.1 Multiplexed temporal gradient imaging

    A key contribution of our paper is the introduction of a second camera and the performance of CEP in the gradient domain. To validate this, an adjusted version of MSRT is implemented by taking two consecutive coded exposure snapshots from a single camera as input and predicting N frames corresponding to one coded snapshot. The result on 8× Set6 is 26.58 dB, an 8 dB decrease in PSNR. As shown in Fig. 12, the existing CEP struggles to tell nonlinear motion cues from static textures of the scene, leading to a blurry prediction.18,54 In contrast, our dual camera resolves this ambiguity, enabled by the explicitly defined and extracted motion map.

    Comparison with previous single-camera CEP.

    Figure 12.Comparison with previous single-camera CEP.

    5.3.2 Coding pattern selection

    We recommend using random asymmetric coding patterns M. As shown in Table 4, the ablation experimental results are averaged over 10 different patterns to ensure statistical reliability. Each pattern is generated by randomly shuffling a binary sequence with half 0 s and half 1 s. It can be seen that pattern selection has minimal impact on reconstruction performance22 as variations in PSNR are relatively small. However, symmetric patterns can introduce ambiguity in determining motion direction and lead to blur.54

    • Table 4. Analysis of coding pattern selection on Set6.

      Table 4. Analysis of coding pattern selection on Set6.

      Coding lengthPSNR (dB)Standard deviation of PSNR across patterns
      34.080.45
      8×, symmetric26.13
      16×29.170.31
      32×25.250.27

    5.3.3 Noise robustness

    Using the camera noise model,55 we train and test three MSRT variants, each considering different noise sources. As shown in Table 5, the results demonstrate that the proposed SpeedShot framework is resilient to common noise, maintaining relatively high performance and making it practical for real-world applications.

    • Table 5. Analysis of noise and calibration robustness on 8× Set6.

      Table 5. Analysis of noise and calibration robustness on 8× Set6.

      Read noiseShot noiseTiming jitterPSNR/SSIM
      34.15/0.904
      34.25/0.912
      33.82/0.896
      0% to 1%34.27/0.921
      0% to 3%33.52/0.916
    • Table 6. Ablation on MSRT network design. Best result is in bold.

      Table 6. Ablation on MSRT network design. Best result is in bold.

      SRUNError-awareMotion-guidancePSNR/SSIM
      33.44/0.915
      33.76/0.917
      34.05/0.920
      34.40/0.923
      34.27/0.921
      34.72/0.925

    5.3.4 Mis-calibration robustness

    To ensure reliable performance in practical settings, we evaluate SpeedShot’s robustness to hardware calibration errors. A key advantage of SpeedShot is its robustness to spatial calibration errors, as it does not require pixel-wise alignment between the two cameras. Global alignment, achieved via homography-based calibration, is sufficient. For temporal calibration errors, we introduce timing jitter as temporal noise by blending each captured high-speed frame with adjacent frames using random weights of 0% to 1% and 0% to 3% to mimic inaccurate shutter modulation or synchronization errors. The results show negligible degradation, with a PSNR reduction of less than 1 dB for 3% jitter, demonstrating robust performance against minor temporal noise.

    5.3.5 MSRT network design

    We first analyze MSRT’s improvement over simple, convolutional networks. We implement a U-net model with residual 3D convolutional blocks, which has a similar amount of parameters to our MSRT (14.9M). The result on 8× Set6 is 31.65 dB, an over 3 dB decrease in PSNR. Then, the MSRT ablation involves retraining the network on a single scale and replacing the error map and motion prior with zero tensors. The results in Table 6 show that incorporating a scale-recurrent unfolding structure yielded the most significant improvement, enhancing PSNR by 0.6 dB. In addition, motion guidance from the MGSCS block and error-aware fusion from the ECSF block contribute to PSNR increases of 0.4 and 0.2 dB, respectively, validating the effectiveness of the proposed network design.

    6 Conclusion

    This paper presents SpeedShot, a compressive high-speed imaging framework that employs dual coded-exposure cameras for multiplexed temporal gradient imaging. This approach offers a viable alternative to existing VFI and VSCI methods and is especially advantageous for popular dual-camera devices such as smartphones. The proposed method shows strong potential for real-life applications. In consumer-level scenarios such as surveillance, sports analysis, and video analytics, SpeedShot can reduce the required capture frame rate, reducing the requirement for light conditions, data bandwidth, and storage. This makes high-speed video analysis more accessible on low-power and low-cost hardware. In high-speed scientific imaging, SpeedShot has the potential to boost the effective frame rate by tens of times, offering an affordable alternative to specialized high-speed cameras. Future work will focus on more challenging scenarios, such as handling back-and-forth motion and imaging in high-noise environments.

    Acknowledgments

    Acknowledgment. This work was supported by the National Natural Science Foundation of China (Grant No. 62305184), the Basic and Applied Basic Research Foundation of Guangdong Province (Grant No. 2023A1515012932), the Science, Technology, and Innovation Commission of Shenzhen Municipality (Grant No. JCYJ20241202123919027), the Major Key Project of Pengcheng Laboratory (Grant No. PCL2024A1), the Science Fund for Distinguished Young Scholars of Zhejiang Province (Grant No. LR23F010001), the Research Center for Industries of the Future (RCIF) at Westlake University and and the Key Project of Westlake Institute for Optoelectronics (Grant No. 2023GD007), the Zhejiang “Pioneer” and “Leading Goose” R&D Program (Grant Nos. 2024SDXHDX0006 and 2024C03182), the Ningbo Science and Technology Bureau “Science and Technology Yongjiang 2035” Key Technology Breakthrough Program (Grant No. 2024Z126), the Research Grants Council of the Hong Kong Special Administrative Region, China (Grant Nos. C5031-22G, CityU11310522, and CityU11300123), and the City University of Hong Kong (Grant No. 9610628).

    Biographies of the authors are not available.

    [32] Is space-time attention all you need for video understanding?, 4(2021).

    [36] et alThe 2017 Davis challenge on video object segmentation(2017).

    Tools

    Get Citation

    Copy Citation Text

    Yifei Zhang, Xing Liu, Lishun Wang, Ping Wang, Ganzhangqin Yuan, Mu Ku Chen, Kui Jiang, Xin Yuan, Zihan Geng, "High-speed video imaging via multiplexed temporal gradient snapshot," Adv. Photon. Nexus 4, 046017 (2025)

    Download Citation

    EndNote(RIS)BibTexPlain Text
    Save article for my favorites
    Paper Information

    Category: Research Articles

    Received: Mar. 26, 2025

    Accepted: Jul. 4, 2025

    Published Online: Aug. 15, 2025

    The Author Email: Xin Yuan (xyuan@westlake.edu.cn), Zihan Geng (geng.zihan@sz.tsinghua.edu.cn)

    DOI:10.1117/1.APN.4.4.046017

    CSTR:32397.14.1.APN.4.4.046017

    Topics