High-speed video imaging via multiplexed temporal gradient snapshot

Yifei Zhang; Xing Liu; Lishun Wang; Ping Wang; Ganzhangqin Yuan; Mu Ku Chen; Kui Jiang; Xin Yuan; Zihan Geng

doi:10.1117/1.APN.4.4.046017

1 Introduction

High-speed imaging has the advantage of giving direct visual information about transient processes such as fluid dynamics and sports movements, but its adoption is limited by complicated hardware, data volume, and high power needs.1^–4 Unlike ordinary cameras, which capture frame-by-frame sharp images, this paper introduces a novel low-bandwidth video imaging method using two coded exposure snapshots to reconstruct up to 32 frames from a highly dynamic scene. This approach, named SpeedShot, is compatible with standard low-speed cameras and presents a memory-efficient alternative to capture the transient moments more efficiently.

High-speed video reconstruction from low-speed, low-storage captures has been extensively studied. One major research direction focuses on software post-processing. Some methods, known as video frame interpolation (VFI), predict multiple high-speed frames from two consecutive low-speed frames. Based on the assumption of motion continuity, VFI spatially warps the pixels to generate unseen frames. Although optical flow analysis can accurately recover transformations between low-speed frames, its performance is limited by the reliance on precise motion assumptions, which are often impractical for complex, large-scale movements. As a result, the maximum ratio of video rate up-conversion via VFI is below 4× speed-up.

Advancements in computational imaging inspire researchers to develop novel hardware systems that capture temporal dynamics more effectively. Video snapshot compressive imaging (VSCI) multiplexes a sequence of video frames, each modulated with temporally varying masks, into a single-shot observation.5^–7 Although VSCI can speed up the video frame rate by tens of times, the need for pixel-wise exposure control lays pressure on circuit design and system calibration, limiting its practicality. Another example is the emerging event camera, which employs the dynamic vision sensor (DVS) to record the pixels that have a light intensity change in time, resulting in much less data to be processed than the conventional image sensor. However, the DVS technology is still in its infancy, with a high economic cost. It also needs to work in pairs with traditional RGB cameras to restore static textures in the scene.

In this work, we explore the opportunity of only using low-cost commercial hardware components for high-speed capture and reconstruction. We propose SpeedShot, a temporal-modulation-only framework for computational high-speed video acquisition. As shown in Fig. 1(a), we consider the inherent redundancy of video data and recognize the spatial sparsity of temporal gradient images. The temporal difference between neighboring high-speed video frames reveals the frame-wise object motion and provides crucial information for video reconstruction. On the other hand, the frame-wise dynamic is spatially sparse and affects only a few pixels in the temporal gradient image. Thus, several such images can be reasonably multiplexed with minimal information loss, as shown in Fig. 1(b). Observing this, SpeedShot employs a dual-camera system with coded exposure for multiplexed temporal gradient imaging. During one long exposure, the cameras’ shutters are programmed with a predefined modulation pattern and let the input video signal alternatively switch between the two cameras, producing two coded snapshots. The subtraction between the coded exposure images creates a compact multiframe motion representation for video reconstruction. In this way, SpeedShot offers data-efficient and hardware-friendly high-speed video capture. The proposed framework can be easily implemented with two low-speed cameras with built-in or external programmable shutters. We also expect its compatibility with other emerging camera systems, where a two-tap sensor enables the acquisition of two temporally weighted images within a snapshot.8^,9

Figure 1.Working principles for SpeedShot. (a) Temporal gradient (TG) images are sparse motion representations. (b) SpeedShot multiplexes TG images for multiframe motion representations, which assists in high-speed video reconstruction. (c) The proposed framework is compatible with low-end commercial cameras with coded exposure photography.

Download full size

View all figures

The high-speed video reconstruction from SpeedShot measurements involves solving an ill-posed inverse problem. Although several pioneering methods have been proposed for video compressive sensing, they do not consider SpeedShot’s unique temporal-only modulation model. Inspired by the plug-and-play (PnP) framework, we introduce the motion-guided scale-recurrent transformer (MSRT) for effective video restoration. MSRT is structured around two innovative components: 1) motion-guided hybrid enhancement (MoGHE), which extracts and utilizes the motion cues from the observed data to adaptively refine the video restoration process, and 2) an error-aware scale recurrent network, which operates on multiple scales to reconstruct the video, informed by the forward imaging model. This combination allows for the precise reconstruction of static and dynamic regions in the scene.

The key contributions of this paper are summarized as follows:

In addition, SpeedShot’s performance was rigorously evaluated through extensive experiments on various open video datasets. Comparative analysis with VFI, VSCI, and single-frame deblurrer highlights SpeedShot’s advantages in high-quality, hardware-efficient high-speed imaging.

2 Related Work

2.1 Compressive Video Acquisition

Numerous modalities have been developed to encode high-speed frames with hardware for low-speed embeddings, enhancing temporal resolution.10^,11 Predominantly, these methods rely on specialized spatial-temporal modulation devices, such as VSCI and compressed ultrafast photography.12 The associated high costs underscore the importance of developing cost-effective alternatives.13^,14 However, current methods still rely on spatial coding, leading to the heavy burden of the alignment and calibration between the sensor and the modulator. This paper presents a novel perspective by demonstrating that spatial modulation is not indispensable for achieving high-quality video compressive imaging. We utilize a dual-camera setup—a common configuration in computational imaging15^–17—where the difference between two coded observations significantly improves the temporal decipherability of complex motions.

2.2 Coded Exposure Photography

Also known as flutter shutter photography, coded exposure photography (CEP) is a classic method in computational imaging.18^–22 In conventional photography, the camera is exposed only once, followed by the sensor read-out. CEP, on the other hand, allows for multiple exposures to obtain a single image. Many modern digital cameras support the coded exposure as a built-in function. CEP was originally designed for motion deblurring. Research revealed that multiple exposures alter the shape of the blurring kernel and make motion deblurring easier. Holloway et al.18 combined coded exposure with video frame interpolation, achieving a 7× frame rate up-conversion. However, the application of CEP-based video acquisition is limited to simple translational motion. When the object motion in the scene becomes more complex, the reconstructions suffer from blurs and unrealistic artifacts. We show that instead of conducting coded exposure directly in the image domain, conducting temporal multiplexing in the gradient domain is more effective for video restoration.

3 Multiplexed Temporal Gradient Imaging

3.1 Motivation

Our method aims to enhance the efficiency of video acquisition by leveraging a powerful perspective on video data: the temporal gradient (TG), which captures differences between successive frames. As illustrated in Fig. 1, TG images filter out static backgrounds, leaving only motion-related signals. Non-zero pixels in a TG image correspond to moving parts of objects, serving as indicators of the objects’ speed and direction. Previous work in video processing has used TG images as an additional modality to better describe and understand the transient motions.23

Unlike previous approaches that use TG to capture single motion information,8^,23 SpeedShot captures multiplexed TG images for multiframe motion representation. Compared with conventional intensity images, TG images exhibit a high degree of spatial sparsity, making them highly compressible. Over short intervals, motion cues in TG images appear as sharp lines occupying only a few pixels, allowing multiple TG images from different time points to be multiplexed without compromising the separability of motion cues. This approach enables the multiplexed TG image to effectively capture frame-by-frame details of objects’ translations, rotations, and deformations, facilitating high-speed video reconstruction with enhanced accuracy.

3.2 Observation Formation Model

SpeedShot simultaneously captures two temporally coded observations for an $N$ -frame video clip, reducing the hardware frame rate required to $1 / N$ . This process can be expressed mathematically as $Y = [Y_{c 1}; Y_{c 2}] = \sum_{i = 1}^{N} [m_{i} \cdot X_{i}; (1 - m_{i}) \cdot X_{i}],$ (1)where $Y_{c 1}$ and $Y_{c 2} \in R^{n_{x} \times n_{y}}$ are the coded snapshots captured by the first and second cameras. $X_{i} \in R^{n_{x} \times n_{y}}$ represents the $i$ th sharp frame of the $N$ -frame video $X$ .

The binary vector $m = {m_{i}}_{i = 1}^{N} \in {0,1}$ defines the shutter states of the first camera for each frame, forming the temporal modulation pattern. In $m$ , half of the elements are randomly set to “1”, indicating that the frame is exposed, whereas the other half are “0”, indicating it is blocked. The modulation patterns for the two cameras are complementary: each frame $X_{i}$ contributes to either $Y_{c 1}$ or $Y_{c 2}$ , ensuring full coverage across the frames. The temporal gradient image $T$ of the video clip $X$ can be obtained by calculating the difference between $Y_{c 1}$ and $Y_{c 2}$ $T = Y_{c 1} - Y_{c 2} = \sum_{j = 1}^{N / 2} T_{j},$ (2)where each $T_{j}$ represents the difference between two exposed frames, $X_{p}$ and $X_{q}$ , from the video. This formulation explicitly encodes temporal information across the video clip. The overall process is visualized in Fig. 2.

Figure 2.(a) Mathematical model for hardware encoding. (b) Visualization of SpeedShot’s splitting an uncoded long exposure image into two coded exposures, $Y_{c 1}$ and $Y_{c 2}$ , which leads to a multiplexed TG image $T$ .

Download full size

View all figures

4 MSRT for Software Decoding

MSRT takes the dual observation as input and reconstructs multiframe videos. As shown in Fig. 3, MSRT follows a scale-recurrent structure. At each stage, the network mainly consists of an error-aware cross-scale fusion (ECSF) module and $N$ MoGHE blocks. Two simple modules with three stacked Conv3D activated by ReLu are used for shallow feature extraction (FE) and video restoration (VR), respectively.

Figure 3.Overview of MSRT. A three-level snapshot pyramid is first generated. For each pyramid level $k$ , the dual observation $Y^{(k)}$ passes through the network to obtain a predicted video ${\hat{X}}^{(k)}$ and a refined feature ${\hat{F}}^{(k)}$ . Guided by the error $E^{(k)}$ between ${\hat{X}}^{(k)}$ and $Y^{(k)}$ , ${\hat{F}}^{(k)}$ is then fused into the next-level reconstruction on a larger scale. The network is recurrently passed through three iterations.

Download full size

View all figures

4.1 Scale-Recurrent Unfolding Network

The frame-wise temporal information is explicitly encoded in dual observations and their difference image. To better use this information, it is crucial to enhance the consistency between the reconstruction and the dual observations so that the reconstructed video has a similar motion pattern to what is recorded in the temporal gradient image. Addressing this, a scale-recurrent unfolding network (SRUN) is proposed. Video estimation from observation $Y = [Y_{c 1}; Y_{c 2}]$ is treated as an optimization problem24^,25 as in Eq. (3). The first term controls data fidelity, the similarity between the prediction and observation with coding $M = [m; 1 - m]$ . The second term $λ R (X)$ is the prior regularization $\hat{X} = \arg \min_{X} \frac{1}{2} ‖ Y - MX ‖_{2}^{2} + λ R (X) .$ (3)

To solve Eq. (3), we unfold the optimization problem with the plug-and-play (PnP) framework,6^,25^,26 which leads to an alternative iterative update equation as Eq. (4), where $G$ and $R$ are responsible for data fidelity and prior updates, respectively. ${\hat{X}}^{(k)} = R [G (M, {\hat{X}}^{(k - 1)}, Y)] .$ (4)

Traditional compressive sensing frameworks often necessitate a uniform input form throughout iterative processes. Breaking from this convention, MSRT facilitates a unique coarse-to-fine reconstruction approach. Owing to the absence of spatial modulation in its forward model, both $G$ and $R$ are spatially invariant. This allows for their approximation through upscaling and downscaling operations. Such flexibility enables MSRT to manage varying reconstruction levels with multiscale processing.

In MSRT, the input $Y$ is down-sampled into a three-level pyramid, denoted as $[Y^{(0)}, Y^{(1)}, Y^{(2)}]$ for multiscale analysis. The update equation [Eq. (4)] becomes ${\hat{X}}^{(k)} = R [G (M, {\hat{X}}^{(k - 1)}, Y^{(k - 1)}, Y^{(k)})],$ (5)where ${\hat{X}}^{(k)}$ is a $k$ th level down-scaled estimation of $X$ . The scaling factor is 2, and $k \in [0,1, 2]$ . Starting from the smallest scale $Y^{(0)}$ , $G$ not only ensures the consistency between ${\hat{X}}^{(k - 1)}$ and $Y^{(k - 1)}$ but also incorporates details from $Y^{(k)}$ for enhanced resolution in a larger scale. Meanwhile, $R$ progressively refines the feature at each resolution, leading to a fine-detailed prediction. The output of the last scale ${\hat{X}}^{(2)}$ is the final video reconstruction.

$G$ and $R$ , unlike other deep unfolding networks, are recurrently used across all stages with shared weights for parameter efficiency and training effectiveness. Addressing SpeedShot’s dual-input nature, we introduce forward and backward projection operators $P$ and $P^{- 1}$ . These operators link the observation and video domains, assisting in initial prediction and data consistency evaluation $Y = P (X) = Concat [mX, (1 - m) X],$ (6) $P^{- 1} (Y) = Concat [m^{†} Y_{c 1}, (1 - m)^{†} Y_{c 2}],$ (7)where † denotes the Moore–Penrose inverse for the coding patterns. $P^{- 1}$ is a least-squares estimation of the original video content. For instance, as the starting point of iteration, $X^{(0)}$ is initialized as $P^{- 1} (Y^{(0)})$ , whereas when each iteration ends, an error map can be obtained with $E^{(k)} = P ({\hat{X}}^{(k)}) - Y^{(k)}$ .

In the following section, we describe the key components for $G$ and $R$ , namely, error-aware cross-scale fusion and motion-guided hybrid enhancement.

4.2 Error-Aware Cross-Scale Fusion (ECSF)

ECSF is designed to supervise data consistency in the feature domain, incorporating a cross-channel-attention mechanism to balance physics and learned priors across scales. As shown in Fig. 4, ECSF processes shallow features $F^{(k)}$ from the current pyramid level and upscaled features ${\hat{F}}^{(k - 1)}$ from the previous stage. These features are split and partially mixed via transposed attention, guided by the spatial error map $E^{(k - 1)}$ that encodes discrepancies between the inputs and predictions from the last stage. This map, after processing through $ConvBlock$ ( $Conv 3 D - L ReLu - Conv 3 D$ ), is combined with ${\hat{F}}^{(k - 1)}$ for refinement. The attention mechanism then uses refined ${\hat{F}}^{(k - 1)}$ for query $Q$ generation, and $F^{(k)}$ for key $K$ and value $V$ . Mixed features, further processed with FFN and concatenated with unaltered halves of the splits, pass through a $1 \times 1 \times 1$ convolution to yield the fused $F^{(k)}$ .

Figure 4.Details for the ECSF. Feature $F_{in}^{(k)}$ from the feature extraction block is fused with $F^{(k - 1)}$ , yielding a fused feature $F_{out}^{(k)}$ .

Download full size

View all figures

4.3 Motion-Guided Hybrid Enhancement

4.3.1 Motion-guided space-channel selection (MGSCS)

MGSCS enhances the motion-static relationship in video features. This process, defined as $M = \frac{std (F)}{1 + abs [mean (F)]},$ (8)involves calculating the motion prior $M \in R^{C \times 1 \times H \times W}$ from $F \in R^{C \times T \times H \times W}$ . $M$ can be seen as the equivalent difference between $Y_{c 1}$ and $Y_{c 2}$ in the feature domain. As shown in Fig. 5, extracted motion priors are processed through a $ConvBlock$ and transformed into a spatial-channel attention map.27^–29 This map selectively enhances important regions in the feature both spatial-wise and channel-wise. An additional Inception-like branch is incorporated to prevent blurriness.30^,31

4.3.2 Dense pyramid temporal self-attention (DPTSA)

DPTSA builds upon the temporal self-attention (TSA) mechanism.32^,33 To overcome TSA’s limitation of a small receptive field, DPTSA incorporates atrous spatial pyramid pooling (ASPP)34 to enable hierarchical spatial interactions. An input feature $L \in R^{C \times T \times H \times W}$ is divided into three groups along the channel dimension, forming $[L_{0}, L_{1}, L_{2}]$ , each $L_{i} \in R^{C / 3 \times T \times H \times W}$ . In each level, $L_{i}$ undergoes fusion with outputs from previous levels via a $1 \times 1 \times 1$ convolution, followed by a dilated residual block. The dilation rates are set at [4, 2, 1] for different scales, facilitating a broad range of spatial interactions. Subsequently, each level undergoes temporal self-attention and a 3D feed-forward operation via the TSA block. The process concludes with the concatenation and fusion of features from all levels, again using a $1 \times 1 \times 1$ convolution for comprehensive feature integration.

5 Experiments

5.1 Simulated Experiments

We experimentally demonstrate the efficacy of the SpeedShot framework. In PyTorch, we adopt data augmentation from BIRNAT.35 Paired observations and ground truths were generated following Eq. (1). The networks are initially trained using the DAVIS201736 dataset for 300 epochs using the Adam optimizer with a batch size of 4. The base model, aimed at 8× restoration, started with a learning rate of $10^{- 4}$ , gradually reducing to $10^{- 5}$ with cosine annealing. Subsequently, the model is fine-tuned on Adobe24037 for 8×, 16×, and 32× restoration over 30 epochs with a learning rate of $10^{- 5}$ . The training objective is to minimize the mean square error (MAE).

5.1.1 Comparison with VFI in multi-frame restoration

We evaluate SpeedShot’s video speed-up capability on two widely used video benchmarks, Adobe240 and GoPro.38 The state-of-the-art image-only39^–42 and event-based VFI methods43^–45 are selected for comparison. For a fair assessment, videos are segmented into $N$ -frame clips corresponding to the intended speed-up rate, with VFI methods tasked with predicting $N - 1$ frames from the initial frames of each clip. For VFI methods, the inputs are the first and final sharp frames extracted from high-speed video clips, from which intermediate frames are interpolated. For event-based VFI methods, event data cubes are also given to provide additional temporal information. The quantitative results are reported in Table 1. The proposed model and VFI methods are trained on the Adobe240 dataset and inferred on two datasets. The results of the three event camera-assisted methods are directly taken from their original paper. MSRT shows a significant performance gap ( $6 - 11 dB$ ) over frame-based VFI. Given that SpeedShot doubles the data throughput by introducing an extra camera, we also address the cross-compression-rate comparison (e.g., 32× SpeedShot versus 16× VFI). MSRT’s superiority is still pronounced, with up to 7 dB improvements. Visual comparison in Fig. 7 shows that VFIs’ reconstructions either blur out or deviate a few pixels from the ground truth, whereas SpeedShot correctly restores the fine textures and details, as shown in Fig. 6.

Figure 5.Details for the motion-guided hybrid enhancement (MOGHE) block.

Download full size

View all figures

Figure 6.Example of SpeedShot’s reconstruction from Adobe240. With a pair of simultaneously taken observations, SpeedShot records and restores tens of frames of a dynamic scene with nonlinear motions. Interpolation methods often fail in such scenarios due to a lack of pixel correspondence between the first and the last frames. (a) Input. (b) Reconstructions.

Download full size

View all figures

Table 1. Results for multiframe restoration on Adobe240 and GoPro. Best results are in bold. Note that known sharp frames are not included in the metrics for VFI methods, whereas all frames are considered for MSRT.

View table

View all Tables

Table 1. Results for multiframe restoration on Adobe240 and GoPro. Best results are in bold. Note that known sharp frames are not included in the metrics for VFI methods, whereas all frames are considered for MSRT.


Method	Network inputs	Adobe240	GoPro	Parameters (M)/Runtime(s)
8×	16×	32×	8×	16×	32×
PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
SuperSloMo³⁹	The first and the last clear images	28.64	0.884	22.80	0.728	20.86	0.602	28.98	0.875	24.38	0.747	20.45	0.618	39.6/0.38
IFRNet⁴¹	30.49	0.916	25.59	0.807	21.43	0.653	29.81	0.893	24.38	0.746	20.69	0.620	19.7/0.42
GiMM-VFI⁴²	33.00	0.939	26.83	0.837	21.93	0.663	30.31	0.894	25.03	0.752	21.56	0.638	30.6/0.93
UPR-Net⁴⁰	32.34	0.934	26.46	0.823	21.84	0.661	29.69	0.885	24.86	0.749	21.50	0.635	3.7/0.76
TimeReplayer⁴⁴	The first and the last clear images + event data stream	34.14	0.950	—	—	—	—	34.02	0.960	—	—	—	—	—
TimeLens*⁴³	34.45	0.951	—	—	—	—	34.81	0.959	33.21	0.942	—	—	72.2/—
$A^{2} OF$ ⁴⁵	36.59	0.960	—	—	—	—	36.61	0.971	—	—	—	—	—
MSRT (Ours)	Two coded blurry images	39.08	0.983	35.22	0.968	33.26	0.958	39.42	0.984	36.44	0.968	32.54	0.939	14.9/0.74

Figure 7.Visual comparison of the GoPro dataset at 8× speed-up.

Download full size

View all figures

The computational efficiency of our MSRT and other VFI methods in terms of model size and inference speed is compared on $640 \times 480$ input images using an RTX 3090 GPU. Our MSRT contains 14.9 million parameters with 0.74 s of inference time. The number of parameters and inference time of other methods are listed in Table 1. Our MSRT offers superior reconstruction quality with a resonably lightweight architecture, achieved by reusing the same network module across multiple iterations. Future improvements in real-time performance can be anticipated through model compression techniques such as pruning and quantization.

The event camera and SpeedShot are theoretically similar. Event cameras record pixel intensity changes as sparse data, whereas SpeedShot captures these changes through spatial multiplexing. Experimental results show that SpeedShot outperforms event-assisted methods by 1.6 dB. This improvement is due to the binary quantization in event cameras, which loses substantial information during event generation. In contrast, SpeedShot preserves full precision in recording intensity changes, resulting in more detailed data.

5.1.2 Comparison with image deblurring

The SpeedShot system simultaneously captures two coded blurry exposures. Multiframe decoding can be interpreted as a more challenging task of image deblurring. From this perspective, we compare SpeedShot’s performance with several SOTA deblurring methods46^–50 on the GoPro dataset. These methods take a single uncoded long-exposure blurry frame generated by averaging high-speed frames as the deblur input. As shown in Table 2, although MSRT faces a more challenging task of restoring all $N$ latent frames, it still leads to an additional $\sim 0.5$ to 3.5 dB improvement over the SOTA event-based deblurring method. It also outperforms the best CEP-based deblurring method by 4.4 dB, validating the effectiveness of recording the multiframe motion dynamics in the temporal gradient domain.

Table 2. Comparison with image deblurring methods on GoPro. Best result is in bold, second best in italic.

View table

View all Tables

Table 2. Comparison with image deblurring methods on GoPro. Best result is in bold, second best in italic.


Method	Input	Amplitude	Output	PSNR/SSIM
NAFNet⁵¹	1 blurry frame	7 to 13 frames	1 sharp frame	33.69/0.967
Restormer⁴⁶	7 to 13 frames	32.92/0.961
Stripformer⁴⁷	7 to 13 frames	33.08/0.962
DeepCE²¹	1 coded frame	32 frames	1 sharp frame	28.10/0.8627
EFNet⁴⁸	1 blurry frame + event frames	7 to 13 frames	1 sharp frame	35.46/0.972
REFID⁴⁹	7 to 13 frames	35.91/0.973
MSRT (Ours)	2 coded frames	8 frames	8 sharp frames	39.42/0.984
16 frames	16 sharp frames	36.44/0.968
32 frames	32 sharp frames	32.54/0.939

Table 3. Comparison for 8× video reconstruction on Set6. Best results are in bold, second best in italic.

View table
View all Tables
Table 3. Comparison for 8× video reconstruction on Set6. Best results are in bold, second best in italic.

Method Input PSNR SSIM
UPR-net⁴⁰ VFI 31.08 0.872
IFRNet⁴¹ VFI 28.50 0.813
EfficientSCI³³ VSCI 35.43 0.959
MSRT (Ours) SpeedShot 34.72 0.925

5.1.3 Comparison with VSCI

The proposed SpeedShot framework has a close relationship with VSCI.6^,7^,33^,52 Both methods reconstruct multiple frames from a highly compressed measurement. Although VSCI employs both spatial and temporal dimensions for video compressive encoding, SpeedShot only works on the temporal dimension. We test the 8× speed-up performance of SpeedShot on Set6,53 a widely used dataset for VSCI. The SOTA method, EfficientSCI,33 is chosen for comparison. As shown in Table 3, Speedshot achieves comparable performance. As visualized in Fig. 8, both methods semantically and quantitatively recover the details of motion. SpeedShot even outperforms VSCI in scenes with a linear motion, e.g., “Traffic.”

Figure 8.Selected reconstructions of Set6. EfficientSCI is the SOTA VSCI method.

Download full size

View all figures

5.2 Optical Prototype

To validate SpeedShot’s practicality, we construct two optical prototypes with commercial low-speed cameras. The first prototype, shown in Fig. 1(c), utilizes dual RGB cameras [OmniVision OV4689, 4MP, 60 frames per second (fps)] operating in multi-exposure mode. Each camera’s sensor combines four coded frames into a single observation at 7.5 fps. Spatial alignment is achieved using homography-based calibration with a checkerboard pattern, ensuring accurate registration of the paired observations. The MSRT network reconstructs a 60 fps video from these coded observations, as shown in Fig. 9. Speedshot manages to recover the nonlinear moving trajectory of the bottle while reducing the motion blur in the observations.

Figure 9.Paired observations from our SpeedShot dual RGB camera prototype, and the corresponding 8× reconstructions. The temporal gradient image highlights the frame-wise object motion and reflects the trajectory of the movement.

Download full size

View all figures

The second prototype, depicted in Fig. 10, employs an advanced optical setup with an external shutter (Fldiscovery F4320) for precise temporal encoding. The imaging system splits the light path into two using a beam splitter (Lbtek MBS1455-A, R:T = 5:5). The first path captures high-speed dynamic scenes directly on the Blurry Cam (HIKROBOT MV-CS020-10 UC, $1624 pixel \times 1240 pixel$ ) through a Nikon Macro lens (L1, 35 mm). The second path encodes the scenes using the reflective shutter. The encoded light passes through a Coolens WWK10-110-111 lens [L2, field of view (FOV): 18 mm] and a high-performance lens (L3) before being captured by the Coded Cam (HIKROBOT MV-CA050-20 UM, $2591 pixel \times 2048 pixel$ ). Both cameras are triggered simultaneously after encoding 16 shutter frames at a 960 Hz frame rate. This system is synchronized via a laptop-controlled Arduino microcontroller. Temporal calibration of this setup involves synchronization of the shutter and cameras, verified using a high-frequency LED signal to ensure precise modulation timing. This setup reconstructs a 960 fps video from 60 fps observations, as shown in Fig. 11. Although the captured image is severely blurred, the proposed MSRT can still reconstruct the intermediate frames of the abrupt process correctly with fine details.

Figure 10.Imaging prototype with an external shutter for optical modulation. It accelerates a 60 Hz camera to 960 Hz.

Download full size

View all figures

Figure 11.Inputs and reconstructions from the SpeedShot prototype with an external mechanical shutter at 960 Hz. One camera captures a temporally coded observation, whereas the other captures a blurry, uncoded image. Subtracting the coded observation from the blurry one yields an additional coded observation, enabling 960 fps video reconstruction.

Download full size

View all figures

5.3 Ablation Study

5.3.1 Multiplexed temporal gradient imaging

A key contribution of our paper is the introduction of a second camera and the performance of CEP in the gradient domain. To validate this, an adjusted version of MSRT is implemented by taking two consecutive coded exposure snapshots from a single camera as input and predicting $N$ frames corresponding to one coded snapshot. The result on 8× Set6 is 26.58 dB, an 8 dB decrease in PSNR. As shown in Fig. 12, the existing CEP struggles to tell nonlinear motion cues from static textures of the scene, leading to a blurry prediction.18^,54 In contrast, our dual camera resolves this ambiguity, enabled by the explicitly defined and extracted motion map.

Figure 12.Comparison with previous single-camera CEP.

Download full size

View all figures

5.3.2 Coding pattern selection

We recommend using random asymmetric coding patterns $M$ . As shown in Table 4, the ablation experimental results are averaged over 10 different patterns to ensure statistical reliability. Each pattern is generated by randomly shuffling a binary sequence with half 0 s and half 1 s. It can be seen that pattern selection has minimal impact on reconstruction performance22 as variations in PSNR are relatively small. However, symmetric patterns can introduce ambiguity in determining motion direction and lead to blur.54

Table 4. Analysis of coding pattern selection on Set6.

View table
View all Tables
Table 4. Analysis of coding pattern selection on Set6.

Coding length PSNR (dB) Standard deviation of PSNR across patterns
8× 34.08 0.45
8×, symmetric 26.13 —
16× 29.17 0.31
32× 25.25 0.27

5.3.3 Noise robustness

Using the camera noise model,55 we train and test three MSRT variants, each considering different noise sources. As shown in Table 5, the results demonstrate that the proposed SpeedShot framework is resilient to common noise, maintaining relatively high performance and making it practical for real-world applications.

Table 5. Analysis of noise and calibration robustness on 8× Set6.

View table
View all Tables
Table 5. Analysis of noise and calibration robustness on 8× Set6.

Read noise Shot noise Timing jitter PSNR/SSIM
✓ — — 34.15/0.904
— ✓ — 34.25/0.912
✓ ✓ — 33.82/0.896
— — 0% to 1% 34.27/0.921
— — 0% to 3% 33.52/0.916

Table 6. Ablation on MSRT network design. Best result is in bold.

View table
View all Tables
Table 6. Ablation on MSRT network design. Best result is in bold.

SRUN Error-aware Motion-guidance PSNR/SSIM
— — — 33.44/0.915
— — ✓ 33.76/0.917
✓ — — 34.05/0.920
✓ — ✓ 34.40/0.923
✓ ✓ — 34.27/0.921
✓ ✓ ✓ 34.72/0.925

5.3.4 Mis-calibration robustness

To ensure reliable performance in practical settings, we evaluate SpeedShot’s robustness to hardware calibration errors. A key advantage of SpeedShot is its robustness to spatial calibration errors, as it does not require pixel-wise alignment between the two cameras. Global alignment, achieved via homography-based calibration, is sufficient. For temporal calibration errors, we introduce timing jitter as temporal noise by blending each captured high-speed frame with adjacent frames using random weights of 0% to 1% and 0% to 3% to mimic inaccurate shutter modulation or synchronization errors. The results show negligible degradation, with a PSNR reduction of less than 1 dB for 3% jitter, demonstrating robust performance against minor temporal noise.

5.3.5 MSRT network design

We first analyze MSRT’s improvement over simple, convolutional networks. We implement a U-net model with residual 3D convolutional blocks, which has a similar amount of parameters to our MSRT (14.9M). The result on 8× Set6 is 31.65 dB, an over 3 dB decrease in PSNR. Then, the MSRT ablation involves retraining the network on a single scale and replacing the error map and motion prior with zero tensors. The results in Table 6 show that incorporating a scale-recurrent unfolding structure yielded the most significant improvement, enhancing PSNR by 0.6 dB. In addition, motion guidance from the MGSCS block and error-aware fusion from the ECSF block contribute to PSNR increases of 0.4 and 0.2 dB, respectively, validating the effectiveness of the proposed network design.

6 Conclusion

This paper presents SpeedShot, a compressive high-speed imaging framework that employs dual coded-exposure cameras for multiplexed temporal gradient imaging. This approach offers a viable alternative to existing VFI and VSCI methods and is especially advantageous for popular dual-camera devices such as smartphones. The proposed method shows strong potential for real-life applications. In consumer-level scenarios such as surveillance, sports analysis, and video analytics, SpeedShot can reduce the required capture frame rate, reducing the requirement for light conditions, data bandwidth, and storage. This makes high-speed video analysis more accessible on low-power and low-cost hardware. In high-speed scientific imaging, SpeedShot has the potential to boost the effective frame rate by tens of times, offering an affordable alternative to specialized high-speed cameras. Future work will focus on more challenging scenarios, such as handling back-and-forth motion and imaging in high-noise environments.

Acknowledgments

Acknowledgment. This work was supported by the National Natural Science Foundation of China (Grant No. 62305184), the Basic and Applied Basic Research Foundation of Guangdong Province (Grant No. 2023A1515012932), the Science, Technology, and Innovation Commission of Shenzhen Municipality (Grant No. JCYJ20241202123919027), the Major Key Project of Pengcheng Laboratory (Grant No. PCL2024A1), the Science Fund for Distinguished Young Scholars of Zhejiang Province (Grant No. LR23F010001), the Research Center for Industries of the Future (RCIF) at Westlake University and and the Key Project of Westlake Institute for Optoelectronics (Grant No. 2023GD007), the Zhejiang “Pioneer” and “Leading Goose” R&D Program (Grant Nos. 2024SDXHDX0006 and 2024C03182), the Ningbo Science and Technology Bureau “Science and Technology Yongjiang 2035” Key Technology Breakthrough Program (Grant No. 2024Z126), the Research Grants Council of the Hong Kong Special Administrative Region, China (Grant Nos. C5031-22G, CityU11310522, and CityU11300123), and the City University of Hong Kong (Grant No. 9610628).

Biographies of the authors are not available.

Category: Research Articles

Received: Mar. 26, 2025

Accepted: Jul. 4, 2025

Published Online: Aug. 15, 2025

The Author Email: Xin Yuan (xyuan@westlake.edu.cn), Zihan Geng (geng.zihan@sz.tsinghua.edu.cn)

DOI:10.1117/1.APN.4.4.046017

CSTR:32397.14.1.APN.4.4.046017

Table 1. Results for multiframe restoration on Adobe240 and GoPro. Best results are in bold. Note that known sharp frames are not included in the metrics for VFI methods, whereas all frames are considered for MSRT.

Table 1. Results for multiframe restoration on Adobe240 and GoPro. Best results are in bold. Note that known sharp frames are not included in the metrics for VFI methods, whereas all frames are considered for MSRT.

Table 2. Comparison with image deblurring methods on GoPro. Best result is in bold, second best in italic.

Table 2. Comparison with image deblurring methods on GoPro. Best result is in bold, second best in italic.

Table 3. Comparison for 8× video reconstruction on Set6. Best results are in bold, second best in italic.

Table 3. Comparison for 8× video reconstruction on Set6. Best results are in bold, second best in italic.

Table 4. Analysis of coding pattern selection on Set6.

Table 4. Analysis of coding pattern selection on Set6.

Table 5. Analysis of noise and calibration robustness on 8× Set6.

Table 5. Analysis of noise and calibration robustness on 8× Set6.

Table 6. Ablation on MSRT network design. Best result is in bold.

Table 6. Ablation on MSRT network design. Best result is in bold.