High-speed imaging has the advantage of giving direct visual information about transient processes such as fluid dynamics and sports movements, but its adoption is limited by complicated hardware, data volume, and high power needs.1
Advanced Photonics Nexus, Volume. 4, Issue 4, 046017(2025)
High-speed video imaging via multiplexed temporal gradient snapshot Editors' Pick
High-speed imaging is crucial for understanding the transient dynamics of the world, but conventional frame-by-frame video acquisition is limited by specialized hardware and substantial data storage requirements. We introduce “SpeedShot,” a computational imaging framework for efficient high-speed video imaging. SpeedShot features a low-speed dual-camera setup, which simultaneously captures two temporally coded snapshots. Cross-referencing these two snapshots extracts a multiplexed temporal gradient image, producing a compact and multiframe motion representation for video reconstruction. Recognizing the unique temporal-only modulation model, we propose an explicable motion-guided scale-recurrent transformer for video decoding. It exploits cross-scale error maps to bolster the cycle consistency between predicted and observed data. Evaluations on both simulated datasets and real imaging setups demonstrate SpeedShot’s effectiveness in video-rate up-conversion, with pronounced improvement over video frame interpolation and deblurring methods. The proposed framework is compatible with commercial low-speed cameras, offering a versatile low-bandwidth alternative for video-related applications, such as video surveillance and sports analysis.
1 Introduction
High-speed imaging has the advantage of giving direct visual information about transient processes such as fluid dynamics and sports movements, but its adoption is limited by complicated hardware, data volume, and high power needs.1
High-speed video reconstruction from low-speed, low-storage captures has been extensively studied. One major research direction focuses on software post-processing. Some methods, known as video frame interpolation (VFI), predict multiple high-speed frames from two consecutive low-speed frames. Based on the assumption of motion continuity, VFI spatially warps the pixels to generate unseen frames. Although optical flow analysis can accurately recover transformations between low-speed frames, its performance is limited by the reliance on precise motion assumptions, which are often impractical for complex, large-scale movements. As a result, the maximum ratio of video rate up-conversion via VFI is below 4× speed-up.
Advancements in computational imaging inspire researchers to develop novel hardware systems that capture temporal dynamics more effectively. Video snapshot compressive imaging (VSCI) multiplexes a sequence of video frames, each modulated with temporally varying masks, into a single-shot observation.5
Sign up for Advanced Photonics Nexus TOC Get the latest issue of Advanced Photonics delivered right to you!Sign up now
In this work, we explore the opportunity of only using low-cost commercial hardware components for high-speed capture and reconstruction. We propose SpeedShot, a temporal-modulation-only framework for computational high-speed video acquisition. As shown in Fig. 1(a), we consider the inherent redundancy of video data and recognize the spatial sparsity of temporal gradient images. The temporal difference between neighboring high-speed video frames reveals the frame-wise object motion and provides crucial information for video reconstruction. On the other hand, the frame-wise dynamic is spatially sparse and affects only a few pixels in the temporal gradient image. Thus, several such images can be reasonably multiplexed with minimal information loss, as shown in Fig. 1(b). Observing this, SpeedShot employs a dual-camera system with coded exposure for multiplexed temporal gradient imaging. During one long exposure, the cameras’ shutters are programmed with a predefined modulation pattern and let the input video signal alternatively switch between the two cameras, producing two coded snapshots. The subtraction between the coded exposure images creates a compact multiframe motion representation for video reconstruction. In this way, SpeedShot offers data-efficient and hardware-friendly high-speed video capture. The proposed framework can be easily implemented with two low-speed cameras with built-in or external programmable shutters. We also expect its compatibility with other emerging camera systems, where a two-tap sensor enables the acquisition of two temporally weighted images within a snapshot.8,9
Figure 1.Working principles for SpeedShot. (a) Temporal gradient (TG) images are sparse motion representations. (b) SpeedShot multiplexes TG images for multiframe motion representations, which assists in high-speed video reconstruction. (c) The proposed framework is compatible with low-end commercial cameras with coded exposure photography.
The high-speed video reconstruction from SpeedShot measurements involves solving an ill-posed inverse problem. Although several pioneering methods have been proposed for video compressive sensing, they do not consider SpeedShot’s unique temporal-only modulation model. Inspired by the plug-and-play (PnP) framework, we introduce the motion-guided scale-recurrent transformer (MSRT) for effective video restoration. MSRT is structured around two innovative components: 1) motion-guided hybrid enhancement (MoGHE), which extracts and utilizes the motion cues from the observed data to adaptively refine the video restoration process, and 2) an error-aware scale recurrent network, which operates on multiple scales to reconstruct the video, informed by the forward imaging model. This combination allows for the precise reconstruction of static and dynamic regions in the scene.
The key contributions of this paper are summarized as follows:
In addition, SpeedShot’s performance was rigorously evaluated through extensive experiments on various open video datasets. Comparative analysis with VFI, VSCI, and single-frame deblurrer highlights SpeedShot’s advantages in high-quality, hardware-efficient high-speed imaging.
2 Related Work
2.1 Compressive Video Acquisition
Numerous modalities have been developed to encode high-speed frames with hardware for low-speed embeddings, enhancing temporal resolution.10,11 Predominantly, these methods rely on specialized spatial-temporal modulation devices, such as VSCI and compressed ultrafast photography.12 The associated high costs underscore the importance of developing cost-effective alternatives.13,14 However, current methods still rely on spatial coding, leading to the heavy burden of the alignment and calibration between the sensor and the modulator. This paper presents a novel perspective by demonstrating that spatial modulation is not indispensable for achieving high-quality video compressive imaging. We utilize a dual-camera setup—a common configuration in computational imaging15
2.2 Coded Exposure Photography
Also known as flutter shutter photography, coded exposure photography (CEP) is a classic method in computational imaging.18
3 Multiplexed Temporal Gradient Imaging
3.1 Motivation
Our method aims to enhance the efficiency of video acquisition by leveraging a powerful perspective on video data: the temporal gradient (TG), which captures differences between successive frames. As illustrated in Fig. 1, TG images filter out static backgrounds, leaving only motion-related signals. Non-zero pixels in a TG image correspond to moving parts of objects, serving as indicators of the objects’ speed and direction. Previous work in video processing has used TG images as an additional modality to better describe and understand the transient motions.23
Unlike previous approaches that use TG to capture single motion information,8,23 SpeedShot captures multiplexed TG images for multiframe motion representation. Compared with conventional intensity images, TG images exhibit a high degree of spatial sparsity, making them highly compressible. Over short intervals, motion cues in TG images appear as sharp lines occupying only a few pixels, allowing multiple TG images from different time points to be multiplexed without compromising the separability of motion cues. This approach enables the multiplexed TG image to effectively capture frame-by-frame details of objects’ translations, rotations, and deformations, facilitating high-speed video reconstruction with enhanced accuracy.
3.2 Observation Formation Model
SpeedShot simultaneously captures two temporally coded observations for an -frame video clip, reducing the hardware frame rate required to . This process can be expressed mathematically as
The binary vector defines the shutter states of the first camera for each frame, forming the temporal modulation pattern. In , half of the elements are randomly set to “1”, indicating that the frame is exposed, whereas the other half are “0”, indicating it is blocked. The modulation patterns for the two cameras are complementary: each frame contributes to either or , ensuring full coverage across the frames. The temporal gradient image of the video clip can be obtained by calculating the difference between and
Figure 2.(a) Mathematical model for hardware encoding. (b) Visualization of SpeedShot’s splitting an uncoded long exposure image into two coded exposures,
4 MSRT for Software Decoding
MSRT takes the dual observation as input and reconstructs multiframe videos. As shown in Fig. 3, MSRT follows a scale-recurrent structure. At each stage, the network mainly consists of an error-aware cross-scale fusion (ECSF) module and MoGHE blocks. Two simple modules with three stacked Conv3D activated by ReLu are used for shallow feature extraction (FE) and video restoration (VR), respectively.
Figure 3.Overview of MSRT. A three-level snapshot pyramid is first generated. For each pyramid level
4.1 Scale-Recurrent Unfolding Network
The frame-wise temporal information is explicitly encoded in dual observations and their difference image. To better use this information, it is crucial to enhance the consistency between the reconstruction and the dual observations so that the reconstructed video has a similar motion pattern to what is recorded in the temporal gradient image. Addressing this, a scale-recurrent unfolding network (SRUN) is proposed. Video estimation from observation is treated as an optimization problem24,25 as in Eq. (3). The first term controls data fidelity, the similarity between the prediction and observation with coding . The second term is the prior regularization
To solve Eq. (3), we unfold the optimization problem with the plug-and-play (PnP) framework,6,25,26 which leads to an alternative iterative update equation as Eq. (4), where and are responsible for data fidelity and prior updates, respectively.
Traditional compressive sensing frameworks often necessitate a uniform input form throughout iterative processes. Breaking from this convention, MSRT facilitates a unique coarse-to-fine reconstruction approach. Owing to the absence of spatial modulation in its forward model, both and are spatially invariant. This allows for their approximation through upscaling and downscaling operations. Such flexibility enables MSRT to manage varying reconstruction levels with multiscale processing.
In MSRT, the input is down-sampled into a three-level pyramid, denoted as for multiscale analysis. The update equation [Eq. (4)] becomes
and , unlike other deep unfolding networks, are recurrently used across all stages with shared weights for parameter efficiency and training effectiveness. Addressing SpeedShot’s dual-input nature, we introduce forward and backward projection operators and . These operators link the observation and video domains, assisting in initial prediction and data consistency evaluation
In the following section, we describe the key components for and , namely, error-aware cross-scale fusion and motion-guided hybrid enhancement.
4.2 Error-Aware Cross-Scale Fusion (ECSF)
ECSF is designed to supervise data consistency in the feature domain, incorporating a cross-channel-attention mechanism to balance physics and learned priors across scales. As shown in Fig. 4, ECSF processes shallow features from the current pyramid level and upscaled features from the previous stage. These features are split and partially mixed via transposed attention, guided by the spatial error map that encodes discrepancies between the inputs and predictions from the last stage. This map, after processing through (), is combined with for refinement. The attention mechanism then uses refined for query generation, and for key and value . Mixed features, further processed with FFN and concatenated with unaltered halves of the splits, pass through a convolution to yield the fused .
Figure 4.Details for the ECSF. Feature
4.3 Motion-Guided Hybrid Enhancement
4.3.1 Motion-guided space-channel selection (MGSCS)
MGSCS enhances the motion-static relationship in video features. This process, defined as
4.3.2 Dense pyramid temporal self-attention (DPTSA)
DPTSA builds upon the temporal self-attention (TSA) mechanism.32,33 To overcome TSA’s limitation of a small receptive field, DPTSA incorporates atrous spatial pyramid pooling (ASPP)34 to enable hierarchical spatial interactions. An input feature is divided into three groups along the channel dimension, forming , each . In each level, undergoes fusion with outputs from previous levels via a convolution, followed by a dilated residual block. The dilation rates are set at [4, 2, 1] for different scales, facilitating a broad range of spatial interactions. Subsequently, each level undergoes temporal self-attention and a 3D feed-forward operation via the TSA block. The process concludes with the concatenation and fusion of features from all levels, again using a convolution for comprehensive feature integration.
5 Experiments
5.1 Simulated Experiments
We experimentally demonstrate the efficacy of the SpeedShot framework. In PyTorch, we adopt data augmentation from BIRNAT.35 Paired observations and ground truths were generated following Eq. (1). The networks are initially trained using the DAVIS201736 dataset for 300 epochs using the Adam optimizer with a batch size of 4. The base model, aimed at 8× restoration, started with a learning rate of , gradually reducing to with cosine annealing. Subsequently, the model is fine-tuned on Adobe24037 for 8×, 16×, and 32× restoration over 30 epochs with a learning rate of . The training objective is to minimize the mean square error (MAE).
5.1.1 Comparison with VFI in multi-frame restoration
We evaluate SpeedShot’s video speed-up capability on two widely used video benchmarks, Adobe240 and GoPro.38 The state-of-the-art image-only39
Figure 5.Details for the motion-guided hybrid enhancement (MOGHE) block.
Figure 6.Example of SpeedShot’s reconstruction from Adobe240. With a pair of simultaneously taken observations, SpeedShot records and restores tens of frames of a dynamic scene with nonlinear motions. Interpolation methods often fail in such scenarios due to a lack of pixel correspondence between the first and the last frames. (a) Input. (b) Reconstructions.
|
Figure 7.Visual comparison of the GoPro dataset at 8× speed-up.
The computational efficiency of our MSRT and other VFI methods in terms of model size and inference speed is compared on input images using an RTX 3090 GPU. Our MSRT contains 14.9 million parameters with 0.74 s of inference time. The number of parameters and inference time of other methods are listed in Table 1. Our MSRT offers superior reconstruction quality with a resonably lightweight architecture, achieved by reusing the same network module across multiple iterations. Future improvements in real-time performance can be anticipated through model compression techniques such as pruning and quantization.
The event camera and SpeedShot are theoretically similar. Event cameras record pixel intensity changes as sparse data, whereas SpeedShot captures these changes through spatial multiplexing. Experimental results show that SpeedShot outperforms event-assisted methods by 1.6 dB. This improvement is due to the binary quantization in event cameras, which loses substantial information during event generation. In contrast, SpeedShot preserves full precision in recording intensity changes, resulting in more detailed data.
5.1.2 Comparison with image deblurring
The SpeedShot system simultaneously captures two coded blurry exposures. Multiframe decoding can be interpreted as a more challenging task of image deblurring. From this perspective, we compare SpeedShot’s performance with several SOTA deblurring methods46
Sign up for Advanced Photonics Nexus TOC Get the latest issue of Advanced Photonics delivered right to you!Sign up now
|
|
5.1.3 Comparison with VSCI
The proposed SpeedShot framework has a close relationship with VSCI.6,7,33,52 Both methods reconstruct multiple frames from a highly compressed measurement. Although VSCI employs both spatial and temporal dimensions for video compressive encoding, SpeedShot only works on the temporal dimension. We test the 8× speed-up performance of SpeedShot on Set6,53 a widely used dataset for VSCI. The SOTA method, EfficientSCI,33 is chosen for comparison. As shown in Table 3, Speedshot achieves comparable performance. As visualized in Fig. 8, both methods semantically and quantitatively recover the details of motion. SpeedShot even outperforms VSCI in scenes with a linear motion, e.g., “Traffic.”
Figure 8.Selected reconstructions of Set6. EfficientSCI is the SOTA VSCI method.
5.2 Optical Prototype
To validate SpeedShot’s practicality, we construct two optical prototypes with commercial low-speed cameras. The first prototype, shown in Fig. 1(c), utilizes dual RGB cameras [OmniVision OV4689, 4MP, 60 frames per second (fps)] operating in multi-exposure mode. Each camera’s sensor combines four coded frames into a single observation at 7.5 fps. Spatial alignment is achieved using homography-based calibration with a checkerboard pattern, ensuring accurate registration of the paired observations. The MSRT network reconstructs a 60 fps video from these coded observations, as shown in Fig. 9. Speedshot manages to recover the nonlinear moving trajectory of the bottle while reducing the motion blur in the observations.
Figure 9.Paired observations from our SpeedShot dual RGB camera prototype, and the corresponding 8× reconstructions. The temporal gradient image highlights the frame-wise object motion and reflects the trajectory of the movement.
The second prototype, depicted in Fig. 10, employs an advanced optical setup with an external shutter (Fldiscovery F4320) for precise temporal encoding. The imaging system splits the light path into two using a beam splitter (Lbtek MBS1455-A, R:T = 5:5). The first path captures high-speed dynamic scenes directly on the Blurry Cam (HIKROBOT MV-CS020-10 UC, ) through a Nikon Macro lens (L1, 35 mm). The second path encodes the scenes using the reflective shutter. The encoded light passes through a Coolens WWK10-110-111 lens [L2, field of view (FOV): 18 mm] and a high-performance lens (L3) before being captured by the Coded Cam (HIKROBOT MV-CA050-20 UM, ). Both cameras are triggered simultaneously after encoding 16 shutter frames at a 960 Hz frame rate. This system is synchronized via a laptop-controlled Arduino microcontroller. Temporal calibration of this setup involves synchronization of the shutter and cameras, verified using a high-frequency LED signal to ensure precise modulation timing. This setup reconstructs a 960 fps video from 60 fps observations, as shown in Fig. 11. Although the captured image is severely blurred, the proposed MSRT can still reconstruct the intermediate frames of the abrupt process correctly with fine details.
Figure 10.Imaging prototype with an external shutter for optical modulation. It accelerates a 60 Hz camera to 960 Hz.
Figure 11.Inputs and reconstructions from the SpeedShot prototype with an external mechanical shutter at 960 Hz. One camera captures a temporally coded observation, whereas the other captures a blurry, uncoded image. Subtracting the coded observation from the blurry one yields an additional coded observation, enabling 960 fps video reconstruction.
5.3 Ablation Study
5.3.1 Multiplexed temporal gradient imaging
A key contribution of our paper is the introduction of a second camera and the performance of CEP in the gradient domain. To validate this, an adjusted version of MSRT is implemented by taking two consecutive coded exposure snapshots from a single camera as input and predicting frames corresponding to one coded snapshot. The result on 8× Set6 is 26.58 dB, an 8 dB decrease in PSNR. As shown in Fig. 12, the existing CEP struggles to tell nonlinear motion cues from static textures of the scene, leading to a blurry prediction.18,54 In contrast, our dual camera resolves this ambiguity, enabled by the explicitly defined and extracted motion map.
Figure 12.Comparison with previous single-camera CEP.
5.3.2 Coding pattern selection
We recommend using random asymmetric coding patterns . As shown in Table 4, the ablation experimental results are averaged over 10 different patterns to ensure statistical reliability. Each pattern is generated by randomly shuffling a binary sequence with half 0 s and half 1 s. It can be seen that pattern selection has minimal impact on reconstruction performance22 as variations in PSNR are relatively small. However, symmetric patterns can introduce ambiguity in determining motion direction and lead to blur.54
|
5.3.3 Noise robustness
Using the camera noise model,55 we train and test three MSRT variants, each considering different noise sources. As shown in Table 5, the results demonstrate that the proposed SpeedShot framework is resilient to common noise, maintaining relatively high performance and making it practical for real-world applications.
|
|
5.3.4 Mis-calibration robustness
To ensure reliable performance in practical settings, we evaluate SpeedShot’s robustness to hardware calibration errors. A key advantage of SpeedShot is its robustness to spatial calibration errors, as it does not require pixel-wise alignment between the two cameras. Global alignment, achieved via homography-based calibration, is sufficient. For temporal calibration errors, we introduce timing jitter as temporal noise by blending each captured high-speed frame with adjacent frames using random weights of 0% to 1% and 0% to 3% to mimic inaccurate shutter modulation or synchronization errors. The results show negligible degradation, with a PSNR reduction of less than 1 dB for 3% jitter, demonstrating robust performance against minor temporal noise.
5.3.5 MSRT network design
We first analyze MSRT’s improvement over simple, convolutional networks. We implement a U-net model with residual 3D convolutional blocks, which has a similar amount of parameters to our MSRT (14.9M). The result on 8× Set6 is 31.65 dB, an over 3 dB decrease in PSNR. Then, the MSRT ablation involves retraining the network on a single scale and replacing the error map and motion prior with zero tensors. The results in Table 6 show that incorporating a scale-recurrent unfolding structure yielded the most significant improvement, enhancing PSNR by 0.6 dB. In addition, motion guidance from the MGSCS block and error-aware fusion from the ECSF block contribute to PSNR increases of 0.4 and 0.2 dB, respectively, validating the effectiveness of the proposed network design.
6 Conclusion
This paper presents SpeedShot, a compressive high-speed imaging framework that employs dual coded-exposure cameras for multiplexed temporal gradient imaging. This approach offers a viable alternative to existing VFI and VSCI methods and is especially advantageous for popular dual-camera devices such as smartphones. The proposed method shows strong potential for real-life applications. In consumer-level scenarios such as surveillance, sports analysis, and video analytics, SpeedShot can reduce the required capture frame rate, reducing the requirement for light conditions, data bandwidth, and storage. This makes high-speed video analysis more accessible on low-power and low-cost hardware. In high-speed scientific imaging, SpeedShot has the potential to boost the effective frame rate by tens of times, offering an affordable alternative to specialized high-speed cameras. Future work will focus on more challenging scenarios, such as handling back-and-forth motion and imaging in high-noise environments.
Acknowledgments
Acknowledgment. This work was supported by the National Natural Science Foundation of China (Grant No. 62305184), the Basic and Applied Basic Research Foundation of Guangdong Province (Grant No. 2023A1515012932), the Science, Technology, and Innovation Commission of Shenzhen Municipality (Grant No. JCYJ20241202123919027), the Major Key Project of Pengcheng Laboratory (Grant No. PCL2024A1), the Science Fund for Distinguished Young Scholars of Zhejiang Province (Grant No. LR23F010001), the Research Center for Industries of the Future (RCIF) at Westlake University and and the Key Project of Westlake Institute for Optoelectronics (Grant No. 2023GD007), the Zhejiang “Pioneer” and “Leading Goose” R&D Program (Grant Nos. 2024SDXHDX0006 and 2024C03182), the Ningbo Science and Technology Bureau “Science and Technology Yongjiang 2035” Key Technology Breakthrough Program (Grant No. 2024Z126), the Research Grants Council of the Hong Kong Special Administrative Region, China (Grant Nos. C5031-22G, CityU11310522, and CityU11300123), and the City University of Hong Kong (Grant No. 9610628).
Biographies of the authors are not available.
[32] Is space-time attention all you need for video understanding?, 4(2021).
[36] et alThe 2017 Davis challenge on video object segmentation(2017).
Get Citation
Copy Citation Text
Yifei Zhang, Xing Liu, Lishun Wang, Ping Wang, Ganzhangqin Yuan, Mu Ku Chen, Kui Jiang, Xin Yuan, Zihan Geng, "High-speed video imaging via multiplexed temporal gradient snapshot," Adv. Photon. Nexus 4, 046017 (2025)
Category: Research Articles
Received: Mar. 26, 2025
Accepted: Jul. 4, 2025
Published Online: Aug. 15, 2025
The Author Email: Xin Yuan (xyuan@westlake.edu.cn), Zihan Geng (geng.zihan@sz.tsinghua.edu.cn)