1. INTRODUCTION
Event cameras [1], also known as neuromorphic cameras or dynamic vision sensors, are an emerging modality for capturing dynamic scenes. Their ability to capture data at a much faster rate than conventional cameras has led to their application in high-speed navigation, augmented reality, and real-time 3D reconstruction [2]. Unlike a conventional CMOS camera that outputs intensity images at fixed intervals, an event camera detects brightness changes at each pixel asynchronously. When the brightness change at a pixel exceeds some threshold, an event is recorded. The output from an event consists of three elements: a timestamp, the spatial coordinate of the triggered pixel, and a binary polarity, indicating whether it is an increase or decrease of brightness.
While this scheme allows event cameras to operate beyond conventional framerates, it makes them blind to the stationary components of a scene, which induce no brightness changes over time. This issue is especially prevalent when the camera is not moving and therefore provides no information about the static background. Even though event cameras are not designed to capture an intensity image, it is often needed for downstream applications such as initializing motion tracking algorithms [3]. To mitigate this issue, some event cameras include a conventional frame-based sensor in the pixel circuit to simultaneously image both events and traditional intensity images [4,5]. Alternatively, when circuit-level modification of the event camera is not possible, a frame-based camera can be installed in parallel, using either a beam splitter [6–8] or additional view registration [9]. A recent study built an opto-mechanical device that rotates a wedge prism to generate events on static scenes [10]. All of these solutions introduce additional hardware and increase cost, complexity, size, and power consumption, highlighting the challenge of capturing static scenes without adding significant overhead.
Fig. 1. Overview of Noise2Image. (a) An event pixel circuit compares the change of logarithmic light intensity and produces an event when the change passes the contrast threshold. (b)–(e) To illustrate the illuminance dependency of noise events on static scenes, we simulate the photon arrival process for low- or high-brightness scenes. The variance of logarithmic intensity is higher for the low-brightness scene, and thus its changes from the reference signal triggering the previous event are also more dramatic, resulting many noise events under low brightness. The dotted black lines in (d) indicate the contrast thresholds for positive and negative events. (f) Noise2Image method only takes noise events to reconstruct the static scene intensity. Noise2Image relies on characterizing the relationship between noise events and using learned priors to resolve ambiguities. For scenes with both static and dynamic components, signal events triggered by intensity changes can be fed into existing event-to-video reconstruction methods.
Our work leverages the fact that, even when the scene is static, event cameras still produce noise events. We focus on low- and moderate-brightness regimes (e.g., room light, outdoor sunset) [11–13] in which the dominant source of noise is photon noise—random fluctuations in the photon arrival process. In contrast, the high-brightness regime (e.g., outdoor daylight) includes leakage noise events [14], which we do not model here. While the photon noise is well-studied in the context of frame-based scenes, those events triggered by photon noise are commonly deemed as part of the general background noise activity of the event-based sensor, to be filtered out.
In this work we propose a method called Noise2Image that can reconstruct a static scene from its event noise statistics, with no hardware modifications (see Fig. 1). First, we derive a statistical noise model describing how noise event generation correlates with scene intensity, which shows a good correspondence with our experimental measurements. Unlike in conventional sensors, where photon noise grows with the signal, we find that for event cameras, the number of events triggered by photon noise is mostly negatively correlated with the illuminance level due to the logarithmic sensitivity of the sensor. Imaging the static scene then amounts to inverting this intensity-to-noise process. However, the mapping is one-to-many, so not directly invertible; thus, we rely on a learned prior to resolve ambiguities. To train and validate our method, we experimentally collect a dataset of event recordings on static scenes. We demonstrate that Noise2Image can recover photo-realistic images from noise events alone, or can be used to recover the static parts of scenes with dynamics. In addition to testing on in-distribution data, we demonstrate the robustness of this approach with out-of-distribution testing data and live scenes.
2. RELATED WORK
A. Event-to-Video Reconstruction
Event-to-video reconstruction (E2VID) is a class of methods that recover high-framerate video from event recordings. These methods can capture high-speed dynamics without the motion blur that plagues traditional frame-based cameras. Because event cameras only capture scene changes, many E2VID approaches include one or two traditional frame-based images, using events to deblur [15–18], synthesize adjacent frames [19], or interpolate temporally [20–23].
In contrast, a more challenging class of E2VID reconstruction only uses events for video reconstruction. Because the initial scene intensity is unknown and only the relative intensity changes are measured, reconstruction requires either explicit modeling of spatiotemporal relationships [24] or deep neural networks as a data prior to fill in the missing information [25–31]. Because it is difficult to collect event recordings paired with the corresponding frame-based videos at scale, the data prior is obtained through synthetic data generation [25,26] using an event camera simulator [32]. While E2VID methods are effective for recordings with object motion or camera motion, the reconstruction of static scenes or regions is still out of reach since no events are thought to be triggered without motion. Our Noise2Image method can be considered complementary to E2VID; one might use Noise2Image to recover the static parts of the scene and E2VID to recover the dynamic parts.
B. Event Camera Noise Characterization
An event recording often contains a number of events not associated with intensity changes, which are termed noise events or background activity [12,33]. Noise events are attributed to two main sources: photon noise and leakage current [11,13,14]. Photon noise is the dominant source of noise in low-brightness conditions [13,34], while leakage noise events dominate in high-brightness conditions [13,14]. Although the noise model of event cameras is less studied than traditional sensors, in the lower-light regime, it is generally believed that noise events become less likely to trigger as intensity increases [35,36]. In event camera simulators, the events triggered by photon noise are modeled as a Poisson process, with the noise event rate linearly decreasing with intensity [36]. The intensity dependency of noise events has been experimentally measured in [13], and shows a non-monotonic relationship with intensity in low light. Our experimental results are consistent with this trend, and we derive a theoretical noise model that explains the relationship between intensity and noise events.
C. Event Denoising
While denoising methods for CMOS or CCD sensors often focus on building an accurate noise model, event camera denoising methods instead emphasize noise detection by identifying the “signal” events corresponding to changes in the scene and removing everything else. Because natural scene changes are inherently spatiotemporal, an event triggered by a real signal should be accompanied by a number of other events at the neighboring spatial and temporal locations. With this observation, a background activity filter (BAF) (also called nearest neighbor filter) is used to identify real events by checking the time difference between a new event and the most recent previous event in its proximate spatial location and rejecting events when the time difference passes a threshold [37–39]. Similarly, the noise events can also be found through spatiotemporal local plane fitting [40]. BAF is simple to compute and can be implemented in hardware with limited computational resources [41,42]. Guo and Delbruck further extended BAF by reducing its memory footprint and incorporating structural information from a data-driven signal-versus-noise classifier [11]. A recent study points out that events are often triggered simultaneously by noise and signal and therefore treats denoising as a regression problem to predict the noise likelihood [43]. Here, we assume access to a good event denoiser [35,44] that isolates noise events resulting from static components of a scene.
D. Event Camera Datasets
Unlike frame-based cameras, the collection and curation of event camera datasets are challenging, largely due to the inaccessible hardware and the scarcity of online data repositories (e.g., flickr, Google Images for framed-based images). Nevertheless, since large datasets are critical for machine learning [45,46], there are a handful of recent efforts to systematically generate large-scale event camera datasets. One common approach is to display images on a monitor and record them using an event camera [47–50]. The motion in the scene can then be generated by either moving the displayed image on the monitor [47,49] or moving the camera [48,50]. Alternatively, event datasets can also be generated from frame-based video datasets by over-sampling in time and feeding into an event camera simulator [36,51,52]. In this work, we generate an NE2I dataset using experimental measurements and synthetic noise, described in Section 3.
3. MODELING NOISE EVENT STATISTICS
First, we develop a model of noise event statistics for later use in our synthetic dataset generation and image reconstruction steps. An event ? is triggered when a pixel (?,?) detects a change in logarithmic intensity, log?(?), greater than the contrast threshold, ?, at the time ?. The polarity ?=+1 if the change is positive, and ?=−1 if the change is negative. The triggering condition for an event can be written as
where ?0 is the timestamp for the most recent event at the same pixel. The logarithmic sensitivity ensures that this triggering condition is adaptive to the brightness level, e.g., a tenfold increase from 1 lux to 10 lux would result in a similar response as a tenfold increase from 100 lux to 1000 lux. As pixels operate asynchronously, we will proceed by considering events at an arbitrary pixel and drop the (?,?) notation for simplicity.
By design, there should be no events triggered by a static scene, because there will be no changes in intensity. However, the inherent randomness in the photon arrival process means that there will be fluctuations in the detected intensity and thus noise events triggered. Consider a static scene in which a pixel sees ? photons over a short period of time. ? follows a Poisson distribution with the average photon count ?∝?, which can be further approximated as a Gaussian distribution for the light levels we work with in photography conditions (?>10). We similarly write the photon count during the last event trigger as ?0, and both ?,?0∼i.i.d.??(?,?).
To trigger a positive event, we need log?(?)−log?(?0)>?, which can be re-written as ?−?0??>0. Under the Gaussian approximation, ?−?0??∼?(?(1−??),?(1+?2?)). Thus, we can write the probability of triggering a positive event:
where erf denotes the error function. The derivation for negative events is similar. This function is plotted in Fig. 2(a) (cyan curve) and reveals a monotonically decreasing trend between noise event probability and illuminance. In other words, as the illuminance increases, it is less likely to find ?,?0 that satisfies the triggering condition. This trend matches with previous literature [12,13,35].
Fig. 2. (a) Theoretical noise event probability, ??, versus the average photon count, ?, at different photoreceptor bias values, ?pr (? is the contrast threshold). (b) Experimentally measured noise event rate versus illuminance (proportional to ?) matches well with our theoretical model after fitting parameters ?,?pr and ?.
For low-light conditions, where ? is relatively small, the relative signal fluctuation gets stronger, which can lead to more noise events. Consequently, the analog circuit of an event pixel is designed to have a photoreceptor bias voltage and source-follower buffer, which help stabilize the photoreceptor’s reading and filter out fluctuations beyond a certain bandwidth [12,33]. To account for this in our model, we approximate the filtering effect by adding a photoreceptor bias term, ?pr, to the photon count before the logarithmic operation. As a result, the triggering condition of a positive event becomes log?(?+?pr)−log?(?0+?pr)>?, which can be re-written as ?+?pr−??(?0+?pr)>0. We can similarly write the probability of triggering an event, ??, as a function of the average photon count (proportional to the illuminance):
Fig. 3. (a) Displayed static scene. (b), (c) Synthetic noise event count sampled from the Poisson distribution and generalized negative binomial distribution, respectively. (d) Noise event count from an experimental recording. The insets show the histogram for the noise event count. The frequency of the histogram (?-axis) is in log scale.
With the bias term, the model shows fewer noise events at lower illuminance levels [Fig. 2(a)]. As average photon count increases, the number of noise events will increase up to a point and then decrease. While this model is not monotonic, we will show that we can still recover illuminance from noise event counts.
To validate our derivation, we acquired experimental measurements of noise events at different illuminance levels [Fig. 2(b)]. The measured positive event rate matches to our derived noise model after parameter fitting. Note that our formulation is a simplified noise model for the event circuit, which characterizes the events triggered by photon noise. There are more controllable bias parameters in the actual analog circuit [12,33] that are beyond our formulation.
A. Sampling Noise Events
Once we know the probability of observing a single noise event, we can simulate the number of noise events over a fixed time window given the intensity of a scene, ?. Viewing each potential noise event in this window as a Bernoulli trial with probability ??, the noise event count has a binomial distribution with probability of ??(?) and ? trials. With the refractory period (i.e., minimum time between events) much smaller than the inter-event interval (i.e., the time interval between two consecutive noise events at the same pixel), ? will be large, and the binomial distribution can be approximated by a Poisson distribution with parameter ?⋅??(?).
However, we observe that the empirical noise event count [Fig. 3(d)] has a higher variance than the Poisson distribution [Fig. 3(b)], referred to as overdispersion. The overdispersion of the experimental data is likely caused by some inter-pixel variabilities, such as the variation of contrast threshold values [34]. Thus, we instead sample the noise count from a generalized negative binomial distribution—which is often used in lieu of Poisson when there is overdispersion—with a mean of ?⋅??(?) and an illuminance-dependent variance obtained empirically from the calibration data. The noise count sampled from the negative binomial distribution [Fig. 3(c)] resembles the experimental count.
4. RECONSTRUCTING THE SCENE
Once we have a model for the mapping between scene intensity and noise event count at each pixel, we can develop an algorithm for recovering the static scene. We first estimate the true noise event count ?⋅??(?) from the empirical noise event count that the camera measures, and then estimate the intensity ? by inverting Eq. (3). By doing this for every pixel, we can form an image for a static scene.
This inverse problem is non-trivial to solve for two reasons. First, the problem is ill-posed since Eq. (3) is one-to-many, as in Fig. 2(b), for nonzero biases and thus using the noise count alone is insufficient to solve for exact intensity values. Second, even though the empirical event count is the maximum likelihood estimator of the true event count, the estimate from the empirical event counts in a finite time window will have some statistical error leading to further downstream error after inverting ??(?).
We approach the first challenge by counting the positive and negative polarities of noise events separately. We found that event cameras often set different contrast threshold values to trigger positive and negative events, since the leakage current of the event circuit causes positive events to be triggered more easily [14]. Because of potentially asymmetric behavior of the two polarities, the correlation between light intensity and noise events is also different for each polarity. As shown in Fig. S1, the corresponding light intensity becomes more uniquely defined when counting positive and negative events separately.
The second issue—inexact event count estimation—can also be mitigated through the use of data priors in the inverse problem. This will make the resulting reconstruction robust to small errors in the event rate. Combining these two, we train a neural network to map the estimated event count directly to the corresponding intensity image. The network accepts inputs with two spatial channels for positive and negative event counts. To further reduce the variance, we perform pixel binning across blocks of 2×2 pixels.
A. Application on Dynamic Scenes
As in Fig. 1(f), the static scene reconstruction can also operate in parallel with the event-to-video reconstruction (E2VID) pipeline on recordings with both static and dynamic components, e.g., scenes with moving foreground and static background. We first separate signal events triggered by dynamic changes from other recorded events using an event denoising algorithm [35,44], and apply an E2VID method to reconstruct the dynamic components. The remaining events can be considered noise events triggered mostly by photon fluctuation, which can be fed into the proposed Noise2Image model for static background reconstruction. For a variable-length recording, we aggregate noise events using a moving window with a fixed temporal width. Lastly, we stitch the dynamic and static scene reconstructions together using a binary motion mask identified from the signal events.
5. EXPERIMENTS
The existing event camera datasets all have a significant amount of scene motion and/or camera motion. It is not possible to fully distinguish between events triggered by intensity changes and noise events [43], and thus they are not suitable for our goal of static scene reconstruction. Hence, we collect our own training/validation dataset, termed noise events-to-image (NE2I). NE2I contains pairs of high-resolution intensity images and noise event recordings from both experimental acquisition and synthetic noise based on the model presented in Section 3.A.
We image a 24.5-inch LCD monitor displaying static images. In order to calibrate the noise model, we display 256 grayscale values on the monitor and capture both their event response and light intensity. The illuminance of each grayscale value is measured using a light meter placed next to the camera. Our event camera is a monochromatic Prophesee Metavision EVK3-HD. The default bias parameters were used for the camera, and no denoising was performed on the raw data. While the event camera has a high pixel count (1280×720 pixels), it does not have an active pixel sensor (APS) that records frame-based intensity images. To register the event camera recording to the screen, we first imaged a standard checkerboard flashing on the monitor to induce events. The aggregated event count is used to establish a transformation matrix [53], which is applied to any event recordings to align it spatially with the monitor.
The full NE2I dataset consists of an in-distribution set and an out-of-distribution test set. The in-distribution data contains 1004 high-resolution images of artistic human portraits from Unsplash, split into 754 images for training, 100 for validation, and 150 for testing. The out-of-distribution test set has 100 high-resolution images from the validation set of the DIV2K image super-resolution dataset [54], aiming to provide a variety of scenes much beyond the training data distribution. We intentionally chose a confined distribution for training data and a broad distribution for testing, so that the evaluation can test if the model learns local correlations instead of an image-level prior.
Given a sequence of noise events, we first aggregate them into a 2D-matrix of event count. To account for the positive and negative polarities, we separately aggregate them and store as two channels. We train a U-net to map an event count matrix into the corresponding intensity image. A modified U-net from [55] is used for enhanced performance. The network is trained using either experimental training data (described above) or synthetic event counts generated by the statistical noise model in Section 3.A. For experimental data training, we augment the input noise count by choosing a random starting time for the 1-s window over a 10-s event recording. For synthetic data training, event counts are re-sampled each time on-the-fly. After training, the model takes around 100 ms to recover a static scene on an NVIDIA RTX3090 GPU.
We use E2VID as a baseline method for the evaluation of static scene recovery even though it was not trained with such data. Rather than quantifying our performance gain, this comparison aims to test whether E2VID can identify the correlation between noise events and illuminance level. We compare to the pre-trained model implementation [56] for the original E2VID [25], E2VID+ [26], FireNet [27], FireNet+ [26], ET-Net [28], SPADE-E2VID [29], SSL-E2VID [30], and HyperE2VID [31]. A sequence of events within a 1-s window is fed into each method, and the last frame of predicted video (five predicted frames in total) is used for evaluation.
A. Metrics
Evaluation of all methods is performed using experimentally collected testing data, with the quality of recovered intensity images being calculated for the in-distribution and out-of-distribution testing datasets using three common quantitative metrics [25]: peak signal-to-noise ratio (PSNR), structured similarity (SSIM), and perceptual similarity (LPIPS [57]). LPIPS is computed using pre-trained AlexNet with image intensity values normalized to [−1, 1].
B. Results
For static-only scenes, the quantitative scores for Noise2Image as well as baseline methods using the NE2I dataset are reported in Table 1. As shown in Fig. 4, the Noise2Image model trained by either experimental or synthetic data recovers detailed intensity images from the count of noise events. Both Noise2Image models generalized well to out-of-distribution testing data, providing a good level of contrast and details, suggesting that Noise2Image learns the correlation between noise event and intensity. The Noise2Image model trained by experimental data outperformed the one trained by the synthetic data by 4.1 dB of PSNR. As the synthetic-data-trained model often recovered speckle-like patterns for the uniform image background, we speculate that this performance margin is caused by the variability of contrast thresholds across pixels, which is beyond our synthesis model but potentially captured by the experimental data training. Baseline E2VID methods did not perform well on static scene reconstruction for two reasons: they were only trained by scenes with various degrees of motion (not static scenes) and the training data of E2VID was generated without an intensity-dependent noise model [32].
Table 1. Quantitative Results for Static Scene Reconstruction with In-Distribution and Out-of-Distribution Testing Dataa
Fig. 4. Comparison between Noise2Image and baseline pre-trained event-to-video (E2VID) methods on noise event-to-intensity reconstruction. The first row shows the input event count (captured in experiment) aggregated over a 1-s window. Noise2Image is trained using either synthetic or experimental data as specified. Full comparison with all baseline methods can be found in Fig. S2.
Fig. 5. Real-world examples of Noise2Image taken outside of the laboratory setting. The Noise2Image model trained by synthetic data results in background artifacts on the in-door scene (second row, second column), presumably caused by the one-to-many relationship of Eq. (3). In contrast, the Noise2Image trained by experimental data predicts the background correctly, hinting that there exist spatial structures in the experimental noise sensitivity, beyond our derived spatially independent noise event synthesis. The reference images were taken by an iPhone 12 plus back camera.
Table 2. Ablation Study of Event Polaritiesa
In addition to testing data, we tested Noise2Image on real-world scenes, both indoor and outdoor. The Noise2Image model trained by images displayed on a monitor generalizes well for scenes outside of the laboratory setting, as shown in Fig. 5. We have implemented a real-time demo running on a laptop computer as well.
Table S1 shows the effect of varying the aggregation window time for training and testing the Noise2Image method. As the aggregation duration shortens, fewer noise events will be triggered, and the estimation of the true event count becomes less accurate. Noise2Image reconstruction works even with short aggregation duration (0.1 s), although longer integration will result in better reconstruction quality. We also tested using only events from a single polarity, which resulted in a slight reduction in performance as shown in Table 2.
C. Reconstructing Scenes with Dynamic Components
Finally, we demonstrated that Noise2Image is complementary to E2VID reconstruction when the scene has both static and dynamic components. We imaged a fan rapidly moving in front of a static scene, and fed signal events into a pre-trained E2VID model [25]. While the E2VID method performed well on the moving object, it could not recover the background static scene [Fig. 6(c)]. To incorporate Noise2Image, we first identified a motion mask at each timepoint [Fig. 6(b)] by thresholding signal events. We then aggregated noise events and fed the event count at pixels without motion to the Noise2Image model. Stitching together the moving foreground from E2VID and the static background from Noise2Image, we obtained the final high-quality reconstruction as in Fig. 6(d) and Visualization 1.
Fig. 6. Dynamic scene reconstruction of a moving fan in front of a static scene (see Visualization 1). (a) Aggregated event count. The first row is at a timepoint with no motion and only faint noise, and the second row has motion. (b) Motion mask obtained from thresholding the signal events as determined by the event denoiser (YNoise denoiser from [44] implemented in [35]). (c) E2VID reconstruction using pre-trained model from [25]. (d) The dynamic foreground reconstructed by E2VID and the static background reconstructed by Noise2Image are stitched together.
6. DISCUSSION
Our experiments were done using a single event camera, and there might be differences in the noise characteristics between different event circuit designs. However, our finding in Section 3 is based on the photon arrival process, which is generalizable to other event camera hardware. This correlation between noise events and illuminance level was also independently reported in previous studies using different event cameras [12,13,35,36].
Denoising and camera bias parameters [59] can also affect this illuminance dependency. Our data was acquired without denoising and using bias parameters default to our event camera. We also tested different high-pass filtering bias values that affect the correlation between noise event rate and illuminance (Fig. S3). For this reason, the trained Noise2Image model is bound to the camera model and bias parameters that were used in the training data collection. We hope that our modeling of noise events can be improved upon by incorporating the analog nature of event sampling and modeling additional event camera bias parameters, e.g., using non-parametric models to account for unknown camera behavior.
In our evaluation, Noise2Image trained by experimental data outperforms the one trained by synthetic data, which suggests discrepancy between the captured noise distribution and our theoretical noise model. From our observation, there exist some spatial structures in the experimental noise event count, which is beyond our theoretical noise model. We believe this is caused by the variability of contrast threshold, a known issue related to the manufacturing of event sensors [34,60].
The dynamic range of Noise2Image is limited to the moderate-brightness regime, where noise events are abundant. Although our theoretical noise model still holds in the high-brightness regime, there are fewer noise events, providing little signal for static scene recovery. Our empirical study is further constrained by the dynamic range of the monitor we used, suggesting that a high-dynamic-range, high-brightness monitor or projector could enhance the dynamic range of the experimental training data. We hypothesize that our Noise2Image approach may still work for brighter scenes with the presence of leakage-current-induced noise events [14], and will leave this for future research. For low-brightness scenes, it is worth further investigating the feasibility of the Noise2Image approach under extremely low-signal conditions, such as in fluorescence microscopy.
Another future direction is to incorporate the proposed noise event modeling into existing event camera simulators. This will help synthesize data with realistic noise events, which can be used for training event-to-video recovery (E2VID) models. Overall, the model similarities between Noise2Image and E2VID suggest that in the future, Noise2Image can be incorporated into the E2VID pipeline via more realistic training data, complementing each other’s goals.
In summary, we demonstrated imaging of static intensity using an event camera with no additional hardware. This is made possible by the fact that photon-noise triggers events that are correlated with illuminance. We developed a statistical model of the noise event generation, and leveraged it to develop a strategy called Noise2Image, which maps noise events to an intensity image via a neural network that incorporates local priors. We demonstrate that our approach works well quantitatively and qualitatively on experimental event recordings, including ones taken in the wild. We also show that our method is compatible with scenes that have object dynamics. Our hope is that Noise2Image reduces the need for event cameras to have additional hardware for measuring static scenes.