Multiframe-integrated, in-sensor computing using persistent photoconductivity

Xiaoyong Jiang; Minrui Ye; Yunhai Li; Xiao Fu; Tangxin Li; Qixiao Zhao; Jinjin Wang; Tao Zhang; Jinshui Miao; Zengguang Cheng

doi:10.1088/1674-4926/24040002

Introduction

The rapidly expanding volume of human daily data and the evolving machine vision systems powered by artificial intelligence have elevated the significance of dynamic motion recognition in diverse AI applications^[1−3]. Conventional machine vision systems typically follow a dissociated sensing and computing approach, involving the storage of detector-generated data frame by frame in memory units, followed by all computations being executed in separate computing modules. This approach often leads to substantial energy consumption and latency issues^[4]. For years, there has been research on in-sensor computing, which enables some basic calculations, such as edge extraction and convolution, to be performed locally within the detectors^[4−6]. While various in-sensor systems have demonstrated their capability to handle different tasks under specific conditions, including language recognition, image classification, and basic motion perception^[7−11], they still adhere to the frame-by-frame paradigm. In this paradigm, data is generated in each frame and frequently transferred to subsequent computing units. The image or feature map captured in a specific frame typically contains spatial information limited to that moment, as demonstrated in Fig. 1(a). Although this approach enhances the computing efficiency and speed of the overall system to a certain extend, it does not qualitatively reduce the amount of data required for computation. Generating more temporal information within a single image to minimize data analysis requirements can break the conventional frame-by-frame paradigm, which is crucial for further advancement^[12].

Figure 1.(Color online) (a) Schematic of traditional frame-by-frame detecting system. Detector genetrates output for subsequent computing in every single frame. (b) Schematic of multiframe-integrated, in-sensor computing using persistent photoconductivity effect. Detector continuously detects multiple frames and only generates one final output state for analysis, which already memorizes the information of past and current frames, spatially and temporally. The final state is input to the subsequent linear classifier, serving as the readout layer.

Download full size

View all figures

To shift away from the traditional frame-by-frame analysis model, detectors must be capable of capturing images that incorporate spatial-temporal data covering multiple frames^{[1, 9, 11, 13]}. Recently, there has been a surge in research focusing on the biological vision systems^{[14, 15]}. Studies have revealed that these systems rely heavily on the spatial-temporal processing capabilities embedded within the detector unit. Drawing inspiration from these natural phenomena, devices exhibiting persistent photoconductivity (PPC) have garnered significant attention. Among them, 2D material-based devices hold great promise as a potential candidate for enabling spatial-temporal in-sensor computing^{[5, 7, 16, 17]}, when paired with artificial neural networks (ANNs). Their PPC effect offers a unique advantage in this region. Within the realm of deep neural networks, reservoir computing (RC) stands out as a particularly suitable candidate for implementing an all-in-one paradigm rather than a frame-by-frame approach. This is particularly advantageous for time-dependent tasks.

In this work, we employed a kind of retinomorphic photodetector to develop a spatial-temporal in-sensor computing system, leveraging the principles of integrating multiple frames into one by using reservoir computing (RC) networks (Fig. 1(b)). By harnessing the nonlinear persistent photoconductivity exhibited by 2D MoS₂ photodetectors, which emulate synaptic functionalities, we simulated a optoelectronic reservoir layer. This layer is capable of detecting light signals and mapping them into an output vector in the form of photocurrent^[18], which then is input to the subsequent linear classifier, serving as the readout layer. During operation, the photodetectors are continuously illuminated with light inputs, frame by frame, with a fixed frequency, encoding the location of the target. Photocurrent is generated and decays over time, only the final states of the devices, namely the residual photocurrent after the final frame, need to be recorded and fed into the readout layer, differing from prior works^{[1, 9, 19]}. This approach embeds information from previous frames into the final one. By deploying our spatial-temporal in-sensor computing system to classify target moving directions (clockwise/anticlockwise), we achieved a recognition accuracy of 93.5%, surpassing the performance of a traditional fully-connected network by 40% while maintaining a smaller network scale.

Experiment

The Au/MoS₂/Au retinomorphic photodetector were fabricated based on CVD-grown MoS₂ film on sapphire substrate as follows: First, the photoresist S1818 is uniformly spun on the MoS₂/sapphire substrate, and the interdigitated shape pattern region is defined by lithography and development. Then, the Au/Cr (45 nm/15 nm) is deposited by electron beam evaporation (EBE) to form interdigitated electrodes, and the rest of the photoresist is removed to expose the MoS₂/sapphire substrate. Finally, the sample is annealed on the hot plate at 200 °C for 10 min.

To demonstrate the microstructural morphology characterization and the elemental composition of the retinomorphic MoS₂ photodetector, scanning electron microscopy (SEM) imaging and wavelength-dispersive X-ray spectroscopy (WDS) measurements were performed using a field emission electron probe micro-analyzer (EPMA, JXA-8530F Plus).

To demonstrate the atomic structure of MoS₂ crystal, the cross-sectional MoS₂ layers were analyzed using a Talos F200X transmission electron microscope (ThermoFisher Scientific).

To estimate the chemical structure, phase and morphology of the MoS₂, Raman measurements were conducted using a Labram HR800 Raman spectrometer (Horiba Jobin. Yvon. TM.) at 295 K with a 532 nm excitation wavelength.

X-ray photoelectron spectroscopy (XPS) analysis was performed using an ESCALAB 250Xi instrument to assess chemical states and elemental content qualitatively.

The persistent photoconductivity outputs of the retinomorphic MoS₂ photodetector were assessed at room temperature using a Thorlabs-LP520-SF15 pigtailed laser diode and a Keysight B1500A semiconductor device analyzer.

To demonstrate the capability that the photodetectors can encode the optical information of several frames into one frame, we let one detector receive different sequences of light pulses which represent different information to generate distinguished photocurrent, the final states of photocurrent is recorded by semiconductor device analyzer.

To demonstrate the feasibility of our in-sensor computing system for dynamic motion recognition, we use our MoS₂ photodetector to simulate a 8 × 8 array with one device per pixel. We created a 8 × 8 map where a car can move. We used light pulses to mimic target location and continuously presented four frames depicting target motion in two possible directions (clockwise and anticlockwise) at varying speeds, and trained the network to discern walking directions. For comparison, we finished the same task using a traditional FC network with a comparable network scale. The classifier and the output of results are accomplished by computer.

Results and discussion

Characteristics of the retinomorphic MoS₂ photodetector

Fig. 2(a) shows a schematic representation of the Au/MoS₂/Au structure fabricated on a CVD-grown MoS₂/sapphire substrate. Fig. 2(b) shows the scanning electron microscope (SEM) image of Au/MoS₂/Au structure with interdigitated electrodes (left side), and the corresponding wavelength dispersive X-ray spectroscopy (WDS) mapping of molybdenum and sulfur atoms indicates the uniform covering of MoS₂ on the substrate. Raman spectra shown in Fig. 2(c) show the multiple peaks of MoS₂, including E¹_2g (383 cm⁻¹) and A_1g (408 cm⁻¹)^[20]. The large peak frequency difference of about 25 cm⁻¹ indicates the formation of thick MoS₂, consistent with the transmission electron microscope (TEM) image shown in the inset of Fig. 1(c)^[20−22]. Fig. 2(d) shows the X-ray photoelectron spectroscopy (XPS) spectra of the sample. The presence of Mo 3d region (235 eV) indicates the slight oxidation of MoS₂^[23]. The peaks observed at 229 and 232 eV are attributed to the Mo4+ 3d5/2 and Mo4+ 3d3/2 spin-orbit split components, respectively^[24].

Figure 2.(Color online) (a) Schematic image of retinomorphic MoS₂ photodetector. (b) Scanning electron microscopy (SEM) imaging (left) and wavelength-dispersive X-ray spectroscopy (WDS) imaging (sulfur element and molybdenum element image are present in red and green, respectively) of retinomorphic MoS₂ photodetector, Scarbar, 50 μm. (c) Raman spectra of retinomorphic MoS₂ photodetector. The inset shows the transmission electron microscopy (TEM) images of cross-sectional MoS₂ flake, Scarbar, 10 nm. (d) X-ray photoelectron spectroscopy (XPS) of the retinomorphic MoS₂ photodetector. The inset shows the optical image of a 1 cm × 1 cm retinomorphic MoS₂ photodetector array.

Download full size

View all figures

Fig. 3(a) presents the current−voltage (I−V) characterization of the retinomorphic MoS₂ photodetector. The device exhibits a linear I−V curve shown in the inset of Fig. 3(a), indicating the presence of ohmic contact. When illuminated with a green laser (520 nm), the external electric field by applying bias voltage facilitates the movement of photo-excited carriers, which are then collected by the interdigitated electrodes, thereby demonstrating a good photoresponse. Fig. 3(b) shows the I−V characterization of retinomorphic MoS₂ photodetector under laser pulses (520 nm) of 10 mW. Besides, the decay feature is independent with the light pulse power, as shown in Fig. 3(b) insert, thus, the persistent photoconductivity effects allow the nonlinear decay process of encoding to memorize the sequence of the input optical signal. According to Fig. 3(c), the integration of different laser pulse inputs allows the projection of optical information from multiple frames into a single frame with varying degrees of attenuation. This technique effectively reduces the imaging frequency through the use of a readout integrated circuit (ROIC).

Figure 3.(Color online) (a) I−V characterization of retinomorphic MoS₂ photodetector in logarithmic scale. The inset shows the I−V curve on a linear scale. (b) The persistent photoconductivity effects observed in retinomorphic MoS₂ photodetector illuminated under laser pulses (520 nm, 10 mW). Pink rectangle: light on; blue rectangle: light off. (Insert: Photocurrent of Au/MoS₂/Au device were measured under illumination by light pulses with different power (λ = 520 nm, 3, 5, 10, 12, 14 mW laser power)). (c) 3-bit light pulse inputs ranging from "000" to "111", each with a pulse width and interval of 100 and 900 ms were used. (d) The resultant normalized photocurrent characteristics, including input−output feature extraction, were analyzed using a retinomorphic MoS₂ photodetector.

Download full size

View all figures

Persistent photocurrent effect for encoding temporal information

Three continuous frames of light pulses are irradiated to the detector with a fixed frequency (10 Hz in our experiment), as shown in Fig. 3(c), photocurrent is generated and decays over time, we recorded the final states of photocurrent after three frames (Fig. 3(d)), we found the expected difference of the residual current. In general, larger number of light pulses received results in higher level of current, and different order of the pulses appears when the number of pulses unchanges also determines the final current. And the results clearly proves that we can identify the time-dependent input signals only with the data of the final frame accurately.

Dynamic motion recognition task

Based on the results of optoelectronic characteristics above, we further simulated an 8 × 8 detector array to demonstrate the feasibility of our in-sensor computing system for dynamic motion recognition, as shown in Fig. 4(a). We applied a reservoir computing (RC) network for propagation. Typically, a RC network is structured with an input layer, a reservoir layer, and a readout layer^[12]. Only the readout layer requires training, while the randomness inherent in the reservoir layer of the RC network offers a promising strategy to overcome the challenges posed by the heterogeneity of 2D material devices^[25]. With the photo-responsibility and nonlinear decay feature of our detectors, the photodetector array serves as a hardware foundation to emulate the functionality of a reservoir, the recurrent part of the whole network. It offers a feature map which is projected by input and reservoir nodes for the readout layer. The light and photo-responsibility acts as input and the decay feature acts as the iterative formula of the nodes in reservoir. Traditionally, it is impossible to determine dynamic motion direction using a single frame due to the absence of temporal information. However, the final state of reservoir nodes becomes an informative frame, encapsulating spatial-temporal information from multiple prior frames. Fig. 4(b) shows four heatmaps of photocurrent of all devices after every frame, darker a pixel is, later a light pulse appears here, that means only the final map is needed for training and propafation into readout layer. This significantly reduces the amount of data required for computation without missing any information from previous frames.

Figure 4.(Color online) (a) Schematic of the mission proposed, target can move in two directions (clockwise/anticlockwise), we use light pulse irradiated onto detector array (8 × 8 pixels in our simulation) to represent the location of target. (b) Schematic of four heatmaps of photocurrent of all detectors after every single frame, the green one refers to clockwise, the blue one refers to anticlockwise, darker a pixel is, later a light pulse appears here. (c) Evolution of the accuracy rates based on multiframe-integrated RC system and traditional FC network within 100 epochs.

Download full size

View all figures

The readout layer contains 64 × 2 weights for classifying two directions. We utilized 8000 training data to train our in-sensor RC system and 4000 test data for validation. Through simulation, we achieved a recognition accuracy of 93.5% after 100 epochs, surpassing the traditional FC network by 40% (Fig. 4(c)). Our spatial-temporal reservoir system boosts accuracy without expanding the network scale of the readout layer. FC network simply saves the information of all frame images to the last frame, losing the timing information of the car's movement. But the starting and ending positions are the same in this task, that is, regardless of whether the car is moving clockwise or counterclockwise, the result generated by overlaying the information on the last frame image is almost the same, therefore, traditional FC network performs not well in this task.

Conclusion

In summary, we successfully applied MoS₂ photodetector to implement in-sensor computing by integrating multiple frames into one with persistent photoconductivity for dynamic motion recognition. The inherent photoconductivity of the MoS₂ photodetector allows for the embedding of spatial-temporal information into a single frame, effectively reducing redundant data flow and simplifying dynamic visual tasks. Furthermore, unlike with traditional recurrent neural networks, the coupling weights in the reservoir are not trained. They are usually chosen randomly, globally scaled in order for the network to operate in a certain dynamical regime. When detector devices are applied as the nodes of a reservoir, the responsibility of the device acts as the input coefficient of the reservoir, and the weight of each node doesn’t need to be the same, as long as it can remain unchanged. Thus, there is no need for the homogeneity of all devices in a reservoir. The scalability of the two-terminal structure and tolerance for heterogeneity in 2D material devices facilitate the addition of more devices per pixel, potentially enriching the reservoir states and enhancing network training, leading to improved accuracy with minimal cost. In conclusion, this in-sensor computing system serves as a prototype for high energy-efficient dynamic machine vision applications.

Category: Articles

Received: Apr. 1, 2024

Accepted: --

Published Online: Oct. 11, 2024

The Author Email: Fu Xiao (XFu), Cheng Zengguang (ZGCheng)

DOI:10.1088/1674-4926/24040002