Optimized operation scheme of flash-memory-based neural network online training with ultra-high endurance

Yang Feng; Zhaohui Sun; Yueran Qi; Xuepeng Zhan; Junyu Zhang; Jing Liu; Masaharu Kobayashi; Jixuan Wu; Jiezhi Chen

doi:10.1088/1674-4926/45/1/012301

1.　Introduction

To address the concerns from frequent data-shuttling-related energy consumption and latency, computing-in-memory (CIM) applications in neural networks (NNs) have attracted much attention. Some previous works have demonstrated CIM-based NN inference^[1−3], but it is still challenging to implement online training due to strict requirements on both performance (speed, power, array size, etc.) and reliability (endurance, stabilities, etc.). Though NNs using emerging memories, such as random-access memory (RRAM)^[4] and phase change memory (PCM)^[5], and have demonstrated great performance on CIM inference, memory performance for online training applications needs to be further improved. Due to excellent electrical performance such as endurance and programming speed, RRAM has emerged as one of the most promising candidates for the synapse of the NN online training^{[6, 7]}, however, the reliability and non-linearity in large arrays still need further optimizations. Flash-based CIM provides a more feasible and reliable solution because of its mature technology, ultra-high bit density, and capabilities to construct large arrays for matrix operations. So far, CIM architectures can be applied in NNs with software-combined offline training, while strict requirements have arisen for endurance and programming speed for NN online training in CIM hardware. Thus, for further exploration of flash-based CIM as online training NN accelerators, it is strongly required to break the obstacles of endurance and the speed of flash cells.

2.　Background

2.1.　Flash-based CIM architecture

Fig. 1 illustrates the architecture of flash memory and the adopted CIM scheme in this work. The matrix-vector-multiplication (MVM) is implemented through a matrix represented by I_d which can be tuned by V_th, and vectors represented by the pulse time of V_g. Firstly, the matrix and vectors are pre-processed to the corresponding electrical parameters of the device array. The matrix needs to be stored in the array and the vectors are transferred to the pulse time. Then, the amount of charge can represent the output of MVM, which can be described as Q = I ∙ t. The great reliability and mature technology of flash memory ensure the computing accuracy of large-scale MVM.

Figure 1.(Color online) Schematics of flash-based CIM architecture. The pulse time of V_g and the threshold voltage is individually mapped as vector and matrix, then the amount of charge can represent the result of MVM.

Download full size

View all figures

2.2.　Flash operation by CHEI and HHI

The CHEI programming scheme and the HHI erasing scheme are adopted in this work. The separated word line (WL) and bit line (BL) of NOR flash allow single-cell selectivity and individual programming operations with CHEI. Different from the traditional Fowler-Nordheim (FN) tunneling operation, the HHI can tune cells individually, benefitting from the independent WL and BL^[8]. In addition, the applied voltage is much lower than FN tunneling. The scheme and energy band diagram of the CHEI and the HHI are shown in Fig. 2. The model of CHEI can be described as that the energetic electrons gain energy primarily from the lateral channel field (or the applied drain bias V_d) to overcome the Si−SiO₂ potential barrier, injecting into the floating gate layer to increase V_th. The HHI scheme utilizes the positive voltage on BL and negative voltage on WL, and the band-to-band tunneling (BTBT) is happening at the cell drain junction. Meanwhile, electrons generated by this process are gathered by the drain contact. With the large bias of the p-well and drain junction, the holes that flow towards the p-well can be injected into the floating gate under the high negative voltage of the gate. In this process, holes injected into the floating gate can recombine with electrons originally stored in the floating gate, thereby reducing the cell V_th.

Figure 2.(Color online) (a) Schematic of adopted CHEI and HHI programming scheme. (b) The energy band diagram of CHEI and HHI programming scheme.

Download full size

View all figures

2.3.　ResNet50 and VGG16 neural networks

We choose the representative ResNet50 and VGG16 neural networks to test the performance of the proposed online training architecture. The system frame diagram of both two architectures is shown in Fig. 3. ResNet50 is a convolutional neural network architecture that was proposed to simplify the training of deeper networks and improve their speed and accuracy. The key innovation is the introduction of residual blocks, which allow the network to learn identity mappings and avoid the degradation problem caused by adding more layers. ResNet50 is representative of ResNet and has achieved a state-of-the-art performance on various image recognition tasks. On the other hand, VGG16 is another popular convolutional neural network architecture that also focuses on increasing network depth. One of its key improvements over previous architectures such as AlexNet is the use of smaller 3 $\times$ 3 convolutional filters instead of larger ones. The VGG16 model has achieved widespread adoption in multiple domains due to its ability to enable better approximation of complex functions while maintaining a manageable number of parameters. It has also been shown to achieve high accuracy on image recognition tasks.

Figure 3.(Color online) The architecture of (a) ResNet 50 and (b) VGG 16 convolutional neural network.

Download full size

View all figures

3.　Reliability analysis and discussion

The 55-nm NOR flash array is used to construct the CIM matrix for large-scale online training NNs, wherein the CHEI and the HHI^[9] are adopted for ultra-fast threshold voltage (V_th) tuning in cells for weight updating (Fig. 4). In the conventional operation scheme towards storage usages, the MW has to be large to make the error bits as low as possible. While for NN applications with a stronger tolerance to noise, we can optimize the MW to achieve faster operation speed and suppress the cell degradation with fewer pulses and lower programming biases.

Figure 4.(Color online) (a) The proposed scheme to improve both endurance and speed by optimizing the operation scheme for NN online training. (b) The comparison of the V_th tuning speed of FN tunneling and the HHI. (c) The high linearity and symmetric potentiation and depression process using the CHEI and the HHI combined methods.

Download full size

View all figures

Rather than the 7.7 V substrate operation voltage and −8 to −9.5 V gate operation voltage used in FN tunneling operation, the 0 V substrate operation voltage of the HHI makes the design of the peripheral circuit much simpler. More importantly, it is found that as fast as 10 ns (pulse width) operations can be adopted for V_th tuning. The HHI can achieve 10⁴ times faster erasing than FN tunneling as shown in Fig. 4(b). The fast programming speed is essential for the frequently fast weight updating of NN online training. Furthermore, it is necessary to ensure the linearity of the conductance–pulse curve^[10] for online training applications. The proposed CHEI and HHI combined method shows high linearity in Fig. 4(c), as well as the symmetric potential and depression process during the short programming of 10 ns and erasing time.

In addition to adjustments in the programming and erasing scheme, the trade-off between MW and endurance is investigated in this work for CIM applications, especially for NN online training. Impressively, by adopting the CHEI–HHI combined programming scheme and lowering the MW, flash cells can realize record high endurance, exceeding 10⁹ cycles in 0.2−0.5 V MW operations, which is enough for 1-bit/cell and even 2-bit/cell operations in CIM applications. In Fig. 5(a), the I–V curves demonstrate that even after 10⁹ cycles, the current is quite stable for computing. Fig. 5(b) shows the balance between the MW and endurance. Although the test results show that the endurance decreases with the increment of the MW, it should be noted that even 4 bit/cell operations (1 V MW) can achieve 10⁷ endurance by adopting the proposed programming scheme, which is enough for various large-scale NNs. However, many devices including flash memory exhibit state-dependent programming variation, and programming variations in multibit flash memories are generally more state-independent while also suffering from additional nonlinear behaviors^[11]. Therefore, compared with the larger MW with multi-level-cell (MLC) operation mode, the small MW may not cause degradation in the prediction accuracy when the MW is enough for one-bit programming precisely in the SLC operation mode.

After cycling, the subthreshold swing (SS) and the leakage current (I_off) are also tested to avoid deteriorated computing accuracy as the result of on/off ratio degradation. Although both the traditional FN tunneling and the HHI will increase the SS value, the increment of SS is conducive to precise device programming for more precise weight updating of the CIM application instead. This is because larger SS values result in larger memory windows between adjacent programming states. Therefore, only the degradation of the I_off will significantly impact the computational performance to the deteriorated on/off ratio. However, different from FN tunneling erasing with serious I_off degradation, I_off can be suppressed to sub-10 pA after 10⁹ cycles by the HHI erasing (Fig. 5(d)). This can be understood because FN tunneling causes serious degradation to gate dielectrics and the interface, while the HHI mainly degrades the interface^[12].

This can be observed from the statistical data of SS and I_off in Fig. 5. After 10⁹ cycles, the subthreshold region of cells with larger SS and sub-10pA I_off allows a larger V_th tuning range and better immunity to the fluctuations, which can be utilized to construct efficient large-scale online training NNs.

4.　Performance in neural network

The comparison is analyzed between different programming schemes. Aiming at CIM applications, the test results of 55 nm flash memory in Fig. 6(a) show that the traditional FN tunneling scheme is much slower and requires a higher tuning voltage as compared to the proposed HHI scheme. Besides this, the I_off after P/E cycles increase significantly, which can have a considerable impact on the on/off ratio of the device, ultimately affecting the CIM accuracy.

Figure 5.(Color online) (a) The I–V curves of the programmed/erased state before and after 109 cycles. (b) Enhancements of endurance at lower MW show the trade-off between MW and endurance. (c) SS value and (d) I_off of different MW and cycles compared with the traditionalprogramming scheme, wherein each box contains 15 different memory cells.

Download full size

View all figures

Figure 6.(Color online) (a) Comparisons between the proposed scheme and the traditional scheme. (b) Read disturbance of different states after 10⁹ cycles. (c) Applications in CIFAR-10 using ResNet50 and Vgg16. Even after 10⁹ cycles, ~90% accuracy can be achieved for the CIFAR-10 task.

Download full size

View all figures

Table 1. The benchmark of this work and various non-volatile CIM devices.

View table

View all Tables

Table 1. The benchmark of this work and various non-volatile CIM devices.

Ref	Cell type	On/off ratio	Pgm.speed	Endurance	DR (s)
[8]	PCM	10⁴	–	10⁸	10⁶
[13]	RRAM	10³	1 μs	–	10⁴
[14]	FeFET	10⁵	300 ns	10⁵	10⁴
[15]	3D flash	10⁵	–	10⁵	10⁵
[16]	Flash	10²	10 μs	10⁵	–
This work (optimized operation)	Flash	10⁶ (before cycles)	10 ns	10⁹	10⁵

The read disturb (RD) characteristic is then tested to evidence robust reliabilities in flash cells. Well-controlled RD lasting for 1 ks can be observed in Fig. 6(b). After 10⁹ cycles, though the read current decreases to nA-level as a result of the increment of SS, the RD is quite stable. The long-term read disturbance characteristics have an ignorance impact on computing accuracy with cells' current fluctuations for online training of NN. The short-term currency fluctuations will not degrade the accuracy of neural network calculations.

To further evaluate the performance of the compact flash-based NN system, the chip tester is designed to characterize flash-based CIMs and can support fast programming, as well as operations of matrix-vector multiplication (MVM). To demonstrate the feasibility of flash CIM, standard the ResNet50 and VGG16 convolutional neural network (CNN) is implemented for image classification in the CIFAR-10 dataset with 10 object classes, as shown in Fig. 6(c). With the simulations of different NNs, the high accuracy of the proposed system is demonstrated. It should be noted that short-term RDs are an important impact factor because online training requires continuous ultra-fast weight updating. Impressively, over 90% recognition accuracy has been achieved after 10⁹ cycles with the proposed flash CIM. The comparisons with other related works are summarized in Table 1. The results indicate that the proposed programming scheme for NOR flash demonstrates superior characteristics in terms of on/off ratio, programming speed, endurance, and DR compared to other emerging memories.

5.　Conclusion

This work shows the potential of flash-based computational-in-memory (CIM) devices to achieve high endurance (10⁹) and ultra-fast programming speed (10 ns) through the implementation of the CHEI–HHI programming scheme and MW optimizations. Utilizing this optimized operation scheme, a compact flash-based NN online training CIM system is proposed. Our results demonstrate that even after 10⁹ cycles, a high accuracy rate of approximately 90% can be attained when performing CIFAR-10 tasks. Further characterizations are done on read disturb to evidence robust reliabilities, highlighting its significant potential for online training of actual NN tasks. This work provides a comprehensive assessment of a flash-based online training network with potential implications for the advancement of AI accelerators.

Category: Articles

Received: Jul. 14, 2023

Accepted: --

Published Online: Mar. 13, 2024

The Author Email: Wu Jixuan (JXWu), Chen Jiezhi (JZChen)

DOI:10.1088/1674-4926/45/1/012301