A leap forward in compute-in-memory system for neural network inference

Liang Chu; Wenjun Li

doi:10.1088/1674-4926/25020028

To maximize the potential of AnDi architecture, several strategies have been proposed. The fine-grained dual-domain mapping divides weight matrices larger than analogue CIM arrays into smaller ones. Weights are allocated to either digital or analogue cores based on their size. Low-parallel weight blocks are assigned to digital cores, thus maximizing the utilization rates of digital multiply-accumulate (MAC) cores^[6]. This approach enhances system-level energy efficiency by avoiding weight splits into additional blocks within the same layer, thereby reducing memory management overhead. The NN feature-enhancing technique addresses the noise accumulation issue in analogue CIM^[7]. Lightweight enhancing layers are integrated into the network and deployed on noise-free digital cores for inference. These layers can significantly enhance the signal-to-noise ratio (SNR) of the feature maps. For example, in a seven-layer convolutional NN for CIFAR-10 image classification, the enhancing layers boost both the SNR and accuracy of the network with minimal extra computational overhead. The dynamic scheduling strategy intelligently assigns tasks to analogue CIM or digital cores based on accuracy and efficiency requirements. This strategy takes advantage of the AnDi design to further optimize the performance of systems.

With the AnDi architecture, a memristor hardware system has been developed using analogue CIM arrays and a softcore central processing unit (CPU) for hardware-implemented you-only-look-once (YOLO) inference. Through an adaptive self-driving car and an object-detection NN, the performance of the system has been analyzed. In the adaptive self-driving car, the AnDi architecture can adaptively assign the task to digital or analogue CIM cores depending on the road conditions, which demonstrates a high degree of tunability between accuracy and efficiency. The system achieves a maximum pass rate of 99.9% and improves efficiency by 1.4 times, highlighting the architecture flexibility. For the YOLO object-detection task, the mean average precision (mAP@50) of the AnDi system increased from 0.27 to 0.724 after using a feature-enhancing technique. The system can handle the wide dynamic range of the YOLO network feature maps through FP calculations with the DDFP processor, which also shows 23% improvement in system-level energy efficiency compared with pure analogue CIM system.

The AnDi architecture marks a significant advancement in CIM systems, which successfully tackles the challenges of analogue CIM in managing complex regression tasks with improved energy efficiency and accuracy. Notably, the architecture maintains high accuracy without traditional compensation techniques and achieves high parallelism and weight storage density. Future studies should aim to deeply optimize the DDFP processor to achieve lower energy consumption and higher performance. In optimizing the DDFP processor, an in-depth study can be conducted on its quantization and dequantization algorithms. Although the DDFP currently achieves certain efficient computation, its algorithms can be further improved. For example, adopting more precise quantization can minimize precision loss during data conversion, thus enhancing overall computational performance. From the circuit-design perspective, adopting low-power advanced techniques can further reduce the energy consumption of processor. In terms of hardware implementation, utilizing novel semiconductor materials and manufacturing processes can further decrease chip area and enhance integration density. Additionally, exploring collaborative modes between the DDFP processor and other specialized computing units can maximize their respective advantages, which can simultaneously ensure computational accuracy and enhance computational efficiency, providing stronger support for complex NN inference. Exploring the scalability of the AnDi architecture for larger-scale NN is another crucial area. Furthermore, integrating the AnDi architecture with emerging memory technologies, such as resistive random-access memory^[8], phase-change memory^[9], and ferroelectric phase memory^[10], could lead to significantly more efficient computing systems. In summary, the AnDi architecture makes the groundwork for the future development of high-performance computing systems for NN inference.

A novel analogue-digital unified CIM architecture (named as AnDi) has been developed by integrating the analogue and digital computing cores, which aims to address the above challenges of analogue CIM (Fig. 1)^[5]. The key component is the dual-domain floating-point (DDFP) processor, which serves as a universal data type to represent both FP and integer (INT) numbers. This processor enables FP compatibility regardless of the native capabilities of analogue CIM or digital cores. The DDFP data structure consists of an INT tensor for the feature map and an FP scale. The DDFP processor manages data flow between the analogue and digital domains, performing quantization or dequantization operation. This effectively decouples the NN algorithm from the underlying hardware, enabling more general NN computation. The training process for the top-level algorithm can proceed without considering specific hardware architectures, as all data-scaling operations are handled at the hardware level.

Developing efficient neural network (NN) computing systems is crucial in the era of artificial intelligence (AI). Traditional von Neumann architectures have both the issues of "memory wall" and "power wall", limiting the data transfer between memory and processing units^{[1, 2]}. Compute-in-memory (CIM) technologies, particularly analogue CIM with memristor crossbars, are promising because of their high energy efficiency, computational parallelism, and integration density for NN computations^[3]. In practical applications, analogue CIM excels in tasks like speech recognition and image classification, revealing its unique advantages. For instance, it efficiently processes vast amounts of audio data in speech recognition, achieving high accuracy with minimal power consumption. In image classification, the high parallelism of analogue CIM significantly speeds up feature extraction and reduces processing time. With the boosting development of AI applications, the demands for computational accuracy and task complexity are rising continually. However, analogue CIM systems are limited in handling complex regression tasks with needs of precise floating-point (FP) calculations. They are primarily suited for the classification tasks with low data precision and a limited dynamic range^[4].

Figure 1.(Color online) The AnDi architecture based on a DDFP processor is designed to seamlessly integrate data from both the analogue and digital domains, enabling FP computation^[5]. With this architecture, the hybrid mapper has the capability to dynamically allocate matrix multiplication tasks across these domains with fine-grained control at the single row and column level, tailored to the specific requirements of the task. The system outperforms a purely analogue CIM system in terms of floating-point compatibility, precision, and the efficient use of spatial array resources.

Download full size

View all figures

Category: Research Articles

Received: Feb. 24, 2025

Accepted: --

Published Online: May. 21, 2025

The Author Email: Liang Chu (LChu), Wenjun Li (WJLi)

DOI:10.1088/1674-4926/25020028