Scalable parallel photonic processing unit for various neural network accelerations

Shiyin Du; Jun Zhang; Hao Ouyang; Zilong Tao; Qiuquan Yan; Hao Hao; Junhu Zhou; Jie You; Yuhua Tang; Tian Jiang

doi:10.1364/PRJ.527940

1. INTRODUCTION

With the development of neural network applications, large-scale neural networks such as ChatGPT have become a trend in artificial intelligence [1,2]. The popularity of such models has given rise to a surge in matrix operations and nonlinear computations. The amount of computation required by neural networks doubles every 100 days, which is expected to increase by more than a million times in the next five years [3]. As Moore’s law slows down, it becomes challenging for traditional computing hardware to keep up with such rapid growth in computing demand [4]. Thanks to the low-loss, large bandwidth, and high parallelism properties of light, optical neural network processing units with ultra-fast computing speeds and ultra-low power overheads have made great progress in recent years [5,6]. Many excellent optical integrated architectures have been proposed and widely applied in computer vision [7 –10], natural language processing [11], medicine [12], and other applications, as they show excellent potential to improve energy consumption and computing speed by at least two orders of magnitude compared to electrical computing units [13].

Being viewed as the next generation of neuromorphic processors, the optical neural networks on chip contain different types of photonic architectures [6], including plane light conversion (PLC) [13 –17], Mach-Zehnder interferometers (MZIs) [11,18 –23], and wavelength division multiplexing (WDM) [10,24 –31]. These solutions can optically implement vector-matrix multiplication, convolution, and other linear operations, greatly improving the inference efficiency of neural networks. However, limited by optical integration technology and the performance of on-chip components, these excellent architectures still face remarkable challenges in practicality and versatility. To address these issues, many researchers are currently working towards large-scale photonic integration [32,33] and on-chip component optimization [34 –37]. Thanks to these groundbreaking efforts, silicon photonic chips have made rapid progress in recent years. Meanwhile, specially designed photonic architecture is also a prerequisite to maximize the utility of crucial components and minimize processing difficulty under hardware constraints. Even if the scale of on-chip components is fixed, the photonic architecture should have parallel processing capabilities to adapt to dynamically changing data sizes, preventing valuable on-chip components from being idle. Additionally, the photonic architecture is expected to own multitasking capability for different scenarios. Unfortunately, the existing research mainly focuses on a specific neural network structure, such as fully connected layers and convolutional layers, causing significant limitations in scenarios demanding different neural network structures. Such structure-specific architecture design is obviously not conducive to maximizing the utility of on-chip components. In terms of minimizing processing difficulty, optical architecture design should consider device failures or performance substandard caused by the yield of chip processing, improving process compatibility.

In this work, we propose and demonstrate a scalable parallel photonic processing unit (SPPU) for various neural network accelerations on a silicon photonic platform exploiting high-speed phase modulation to improve the versatility and practicality of optical neural network inference accelerators. Specifically, our photonic tensor processor contains eight optically independent computing channels, which can be grouped freely to process different computing tasks in parallel (e.g., operations with two sets of length-4 vectors or four sets of length-2 vectors), with each channel link consisting of two phase modulators (PMs), two microring resonators (MRRs), and a balanced photodetector (BPD). Two on-chip silicon PMs can simultaneously load two input data to support special network structures such as residual layers, pooling layers, and recurrent layers, or only use one PM to implement basic network structures such as convolutional layers and fully connected layers. Any MRR can be used to load weights, with one used for backup to reduce the loss of weight precision caused by process errors. The output results are derived from BPDs whose current direction visualizes the positive or negative sign. Equally important, the SPPU can realize both multiply-accumulate (MAC) and nonlinear operations required in neural network inference without additional time or space consumption. Comprehensive and multi-task experimental results show that three types of neural networks, namely, convolutional neural network (CNN), three-layer residual CNN, and recurrent neural network (RNN), can run well on our SPPU chip, with the recognition accuracies to be 97.9% of CNN for the MNIST dataset, 92.8% of three-layer residual CNN for eight signal modulation formats (RML201610a) [38], and 83.6% (92.3%) of RNN for the IMDB Movie Reviews dataset [39] (Amazon Fine Food Reviews dataset [40]). Owing to the high integration level and fast data loading speed, the energy efficiency of the photonic-core is significantly improved, reaching 0.83 TOPS/W, which is two to ten times higher than the previous work [9 –14,24]. This work takes an essential and significant step in bringing integrated photonic AI accelerators into the real world.

2. SPPU ARCHITECTURE AND PRINCIPLE

The proposed SPPU chip consisting of eight optically independent computing channels can support neural network inference with multiple structures, whose functionality and conceptual architecture are vividly shown in Figs. 1(a) and 1(b), respectively. As presented in Fig. 1(a), SPPU is able to implement four base operators, namely, $y = \pm w x$ , $y = \pm w (x_{1} + x_{2})$ , $y = \pm w \cos (x)$ , and $y = \pm w \cos (x_{1} + x_{2})$ , to support the inference of CNNs, residual CNNs, and RNNs. These operators allow to perform recognition and classification tasks over a wide range of data, such as handwritten digits, modulated signals, and reviews. From Fig. 1(b), one can see that the high-speed input data are encoded into the optical phase via on-chip silicon PMs in each channel, where a single PM or two PMs can be used for encoding in the same channel. The kernel weights are mapped to heater voltages in MRRs. Here, we employ one MRR to load the weights and another MRR as a backup, in order to minimize the impact of process deviations on system performance. By sampling the RF output signals from BPD, the computation results will be obtained. The eight channels adopt independent designs in the structure, which can avoid system failure caused by the failure of a single channel and support parallel data processing.

Figure 1.Conceptual architecture diagram of SPPU on chip. (a) Basic process of SPPU inference for neural networks with different structures. (b) Schematic illustration of SPPU chip. This chip includes eight independent computational channels, with each channel consisting of an edge coupler (EC), two phase modulators (PMs), two microring resonators (MRRs), a balanced photodetector (BPD), and two grating couplers (GCs).

Download full size

View all figures

Within a single channel, the PM is assumed to load a pulse amplitude modulation (PAM) digital signal, with the caused optical phase shift in a single symbol denoted as $φ (x, t)$ . At the same time, an alternative PAM digital signal is loaded on the heating electrodes of MRR, producing a weight change known as $w (t)$ . After coherent detection, the output of BPD is mathematically expressed as $I (t) = 4 A_{1} A_{2} w_{1} (t) w_{2} (t) \sin (φ_{1} (x_{1}, t) - φ_{2} (x_{2}, t) + Δ φ),$ (1)where $A_{1}$ and $A_{2}$ are the amplitudes of the upper and lower optical carriers, respectively, and $Δ φ$ represents the phase difference generated by the upper and lower optical transmissions. According to the signal loading methods, several different working principles are introduced as follows.

To begin with, we consider the case of signals loading with a single PM, which allows the simplification of Eq. (1): $I (t) = A w (t) \sin (φ_{1} (x_{1}, t) + Δ φ),$ (2)where $A$ represents the amplitude of the optical carrier. We introduce a bias in the PM that ensures $Δ φ = 2 k π (k \in Z)$ and approximates $\sin (φ_{1} (x_{1}, t)) = φ_{1} (x_{1}, t)$ , and then the output of this channel is approximated as $I (t) = A w (t) φ_{1} (x_{1}, t)$ . Considering the error control and PM encoding interval, we limit $φ_{1} (x_{1}, t)$ in the range of $[- 0.4, 0.4]$ to ensure that the error is less than 0.01. As a result, an MAC operation of length 8 can be achieved by superimposing the outputs of all channels. Of course, the vector length can also be changed according to the task requirements, and use it for parallel processing of multiple vectors. If the length of MAC operation exceeds eight, we use the divide-and-conquer strategy to break down the longer MAC calculation into a series of computations with a length less than eight, ensuring that the SPPU can flexibly adapt to different lengths of MAC operations. Particularly, we find that the input data $φ_{1} (x_{1}, t)$ is signed in the MAC operation. Besides, if we apply a bias such that $Δ φ = 2 k π + π / 2 (k \in Z)$ , the output will be $I (t) = A w (t) \cos (φ_{1} (x_{1}, t))$ . At this point, the signal superposition from eight channels can realize the MAC operation with cosine activation that is a nonlinear activation for outputs from the previous layer. Lastly, we set a bias to make sure $Δ φ = (2 k + 1) π (k \in Z)$ , which results in $I (t) = - A w (t) φ_{1} (x_{1}, t)$ . To this end, the representation of negative weights is completed. Of course, changing the applied bias can also achieve different linear degrees of activation, thus realizing the representation of negative weights. Since the bias and input data are loaded at the same time, the switching of modes does not affect the computing speed.

Next, we discuss signals loading with two PMs, in which case the BPD output can be described as $I (t) = A w (t) \sin (φ_{1} (x_{1}, t) - φ_{2} (x_{2}, t) + Δ φ) .$ (3)If we load the input data on PM1 and the bias on PM2, the situation will be the same as discussed earlier. In the following section, we will focus on the case where both PMs are loaded with input signals. By inverting the input signal of PM2, the normalized result is $I (t) = w (t) \sin (φ_{1} (x_{1}, t) + φ_{2} (x_{2}, t) + Δ φ) .$ (4)When a bias is applied to ensure $Δ φ = 2 k π (k \in Z)$ , Eq. (4) becomes $I (t) = w (t) (φ_{1} (x_{1}, t) + φ_{2} (x_{2}, t))$ . First, if the weight is set to 0.5, we can use SPPU to perform average pooling in CNN. Since the weights remain unchanged during average pooling, the slow MRR heat adjustment does not affect the computing speed. In addition, we find that in large-scale neural networks, many residual layers are needed to solve the gradient disappearance. Interestingly, the basic operation of the residual layer is $y = w (x_{1} + x_{2})$ , where $x_{1}$ and $x_{2}$ represent inputs from different layers. Naturally, SPPU can be efficiently employed to perform the operations of the residual layer. When the scale of SPPU is increased in the future, it is expected to serve large model inferences like ResNet-101 [41].

Additionally, RNNs with feedback structures are mostly utilized in contextually relevant scenarios to leverage historical information [42]. The basic operation in a simple RNN network is $y_{t} = σ (w x_{t} + u s_{t - 1})$ , where $σ$ represents the activation function, $s_{t - 1}$ means the state of RNN, and $w$ and $u$ are both weights. If we employ the bias such that $Δ φ = 2 k π + π / 2 (k \in Z)$ , the output will be $I (t) = w (t) \cos (φ_{1} (x_{1}, t) + φ_{2} (x_{2}, t))$ . In this case, SPPU can perform simple RNN efficiently, indicating that our proposed chip can be easily extended to more categories of neural networks. One can still control bias to implement negative weights [i.e., $Δ φ = 2 k π - π / 2 (k \in Z)$ ] and activation functions with different degrees of linearity.

To decode the calculation results, we need to sample, quantize, and reshape the output signals from the SPPU by electronic hardware. A high-speed analog-to-digital converter (ADC) is used to convert the analog signal to digital, followed by quantification of the output voltage based on a pre-established mapping table. Finally, we reshape these values to obtain the desired matrix format. Moving on now to consider the switching of computing modes and weight calibration, we can measure the photocurrent intensity from the monitoring PDs to calibrate the weights loaded onto the MRR. As for the switching of computing modes, we can obtain the bias voltage corresponding to different computing modes by scanning the relationship curve between bias voltage and BPD photocurrent intensity. Considering the SPPU state changes slowly with the environment, we will send out calibration signals at regular intervals, and calculate the error as a basis for whether calibration is needed or not (see Appendix B for more details).

In short, SPPU is able to continuously perform MAC and nonlinear operations in parallel. The specific mapping relationship between computing mode and bias is shown in Table 1. It can be seen from this table that each channel can perform at most one addition, one multiplication, and one cosine activation in a single computation. Thus, the computing speed is estimated to $3 f$ operations per second (OPS) for each channel, where $f$ represents the loading speed of input data. If all eight channels are utilized, the signal superposition of these channels is equivalent to seven additional operations, leading to a total computing speed of $8 \times 3 f + 7 f = 31 f$ OPS.Table 1.

Computing Paradigm of a Single SPPU Channel in Different Computing Modes

Bias	Single PM	Two PMs
$Δ φ = 2 k π$	$y = w x$	$y = w (x_{1} + x_{2})$
$Δ φ = 2 k π + π / 2$	$y = w \cos (x)$	$y = w \cos (x_{1} + x_{2})$
$Δ φ = 2 k π - π / 2$	$y = - w \cos (x)$	$y = - w \cos (x_{1} + x_{2})$
$Δ φ = (2 k + 1) π$	$y = - w x$	$y = - w (x_{1} + x_{2})$

3. EXPERIMENTS AND RESULTS

A. SPPU Fabrication and Characterization

The SPPU chip used in our work is fabricated by 90 nm CMOS-compatible processes in a standard 220 nm silicon-on-insulator (SOI) platform, whose optical micrograph is shown in Fig. 2(a). Precisely, optical signals are incident from the right side of a photonic chip via edge couplers, while the electric results are output from the bottom side. As for the top and left sides, there are RF inputs and DC inputs, respectively. The footprint of the whole photonic chip is about $16 mm \times 6 mm$ , with the core computing area occupying around $13 {mm}^{2}$ . Moreover, the photoelectric hybrid package module of the SPPU chip is vividly illustrated in Fig. 2(b). The SPPU chip is welded to the printed circuit board (PCB) using wire bonding, whereas RF interfaces adopt GPPOs and DC interfaces use flexible flat cables (FFCs). As the on-chip BPDs generate photocurrent, we also package trans-impedance amplifier (TIA) chips onto the PCB to convert current signals to voltage signals.

Figure 2.Optical micrograph of SPPU. (a) Microscopic image of the SPPU chip. (b) Image of SPPU unit after photoelectric hybrid package. Microscopic images of (c) phase modulator, (d) add-drop microring resonator, and (e) photodetector.

Download full size

View all figures

Optical microscope images of several key elements on the silicon photonic chip are described in Figs. 2(c)–2(e). Precisely, the phase modulators in Fig. 2(c) are operating based on the plasma dispersion effect and carrier-depletion schema. To acquire a large modulation bandwidth, traveling-wave electrodes terminated with on-chip $50 Ω$ resistors (left ports of PM) are utilized to meet the requirements of phase match and impedance match. Meanwhile, the add-drop MRRs in Fig. 2(d) are made using a 500 nm wide strip waveguide to support the single TE mode near 1550 nm, with each MRR containing two TiN microheater-based thermo-optical phase shifters inside. Moreover, the on-chip photodetectors in Fig. 2(e) are Ge-based vertical p-doped-intrinsic-n-doped (PIN) diodes operating around 1550 nm. Here, two identical photodetectors constitute BPD in the SPPU by differential connection.

After introducing the structure of key devices, their physical properties are comprehensively investigated (see Fig. 3). The measured S21 parameters of PM are shown in Fig. 3(a), revealing that its 3 dB bandwidth is about 43.2 GHz. Thus, the modulation speed of input data can reach more than 40 Gbaud. To demonstrate the relationship between weights and voltage, we scan the variation of MRR weights over the voltage range of 1 to 4 V with a step of 0.02 V, whose results are presented in Fig. 3(b). With the increase of voltage, the weight shows a trend similar to the Lorentzian peak shape of MRR. Considering the random drift in MRR, we set 250 voltage values within the range of 0 to 5 V and sample each voltage 50 times (12,500 samples in total). The weight comparison between measured and expected results is displayed in Fig. 3(c). All points are distributed near the diagonal line (theoretical case), and the deviation roughly follows a normal distribution with a standard deviation of approximately 0.006. According to statistical theory, we estimate that the weight precision of MRR is about $\log_{2} (\frac{1}{3 \times 0.006}) = 5.8 bits$ .

Figure 3.Characterization of key photonic components on chip. (a) Measured bandwidth of the on-chip silicon phase modulator. (b) Normalized weights and the voltage applied to the on-ring heater for one microring resonator. (c) Scatter plots of the measured weights versus calculated expectations. (d) Measured bandwidth of the on-chip photodetector. (e) Responsivity of the on-chip photodetector measured at $- 1 V$ bias. (f) Dark current measured by the on-chip photodetector at different bias voltages.

Download full size

View all figures

Further to that, it can find the measured S21 parameters of the photodetector in Fig. 3(d), where the 3 dB bandwidth is around 33.63 GHz. The relationship between incident light power and photocurrent of the photodetector at $- 1 V$ bias is described in Fig. 3(e). The responsivity of the photodetector is obtained from the slope of the line fitted by scatter points, which is about 0.83 A/W. The dark current of the photodetector at different bias voltages is also measured, as illustrated in Fig. 3(f). It can be easily derived from this figure that the dark current is about 2.02 nA at $- 1 V$ bias. Thereby, the photodetector can not only support resultant output rates of over 30 Gbaud, but also detect an optical power as low as $- 50 dBm$ . Considering the working speeds of the photodetector and PM, the computing speed of SPPU is estimated to be as high as $31 \times 30 \times 10^{9} = 0.93$ TOPS.

We turn now to the working flow of one SPPU channel, whose results are shown in Fig. 4. To start with, the computational process of optical neural network inference performed by an SPPU channel is presented in Fig. 4(a), including input data format conversion, weight encoding, and result acquisition. Based on this, we can acquire the sequence calculation results of an SPPU channel in Fig. 4(b), whose weight is one and input data vector is [1, 0.75, 0.5, 0.25, 0, $- 0.25$ , $- 0.5$ , $- 0.75$ , $- 1$ ]. Then, we send this input data vector at regular intervals to determine whether the SPPU channel needs to be calibrated (see Appendix B for details). Figure 4(c) introduces the waveform of a convolutional computation based on 4-bit coding. The orange and blue lines represent the ideal and experimental waveforms, respectively. From these plots, we observe that the experimental results are in good agreement with the theoretically calculated samples, confirming the effectiveness of SPPU.

Figure 4.Working flow of one SPPU channel. (a) Schematic of an SPPU channel performing optical neural network computation. (b) Sequence calculation results when the weight is one and the input data vector is [1, 0.75, 0.5, 0.25, 0, $- 0.25$ , $- 0.5$ , $- 0.75$ , $- 1$ ], which is used for SPPU channel calibration. (c) Ideal (orange line) and experimental (blue curve) convolution output waveform with 4-bit coding.

Download full size

View all figures

B. SPPU for Handwritten Digit Classification

To experimentally characterize the capability of the SPPU chip, a handwritten digits classification task is conducted. Here, we construct a convolutional neural network with two convolutional layers and a fully connected layer, as shown in Fig. 5(a). The input image is from the MNIST dataset, whose matrix size is $28 \times 28$ . The first convolutional layer convolutes an input image with 12 $2 \times 2$ kernels, and then the calculated results are treated with a cosine function as nonlinear activation. Meanwhile, to reduce computation, $2 \times 2$ average pooling is also used in the convolutional layer. After the average pooling, the size of the resulting feature map becomes $13 \times 13 \times 12$ , which means that we obtain 12 feature matrices of size $13 \times 13$ . Similarly, the second convolutional layer has the same structure as the first, except that it has five kernels and the resulting feature map is $6 \times 6 \times 5$ in size. In this case, the feature map from the second convolutional layer is serialized into a $180 \times 1$ vector as the input of the fully connected layer to acquire the prediction results. It is easy to see that all the sizes of the convolution kernels and pooling are set as multiples of eight. This ensures that the SPPU’s computing units can be grouped effectively, allowing for multiple convolutions or pooling operations to run in parallel, making sure those valuable SPPU computing units are not sitting idle.

Figure 5.SPPU for a CNN. (a) Structure of the CNN for handwritten digits classification, which consists of two convolutional layers and a fully connected layer. Here, each convolutional layer can be subdivided into convolutional operations and pooling operations. (b) Variation in calculation accuracy and loss of validation dataset during the training. (c) Confusion matrices of digit classification. The experimental accuracy (97.9%) is in good agreement with the calculated results (98.3%).

Download full size

View all figures

The standard back-propagation algorithm and the Adam optimizer are used to train the parameters in the CNN on a computer. The variation in calculation accuracy and loss of the validation dataset during the training is exhibited in Fig. 5(b). Particularly, we adopt the quantization-aware training technique to reduce the impact of SPPU precision on CNN performance (see Appendix C for more training details) [43]. All parameters in convolutional layers are 4-bit precision. In the inference, two convolutional layers will be converted into MAC operations and nonlinear operations running on the SPPU, while the fully connected layer is performed on a computer. Specifically, convolution operations are implemented in the form of MAC operations by controlling $Δ φ = k π (k \in Z)$ , while cosine activation and average pooling are implemented simultaneously by setting $w = 1 / 4$ and $Δ φ = 2 k π + π / 2 (k \in Z)$ . Note that all computations within the convolutional layers are conducted in parallel in the single PM mode. In future works, the fully connected layer is also achieved by SPPU if MRRs shift from thermal to electrical regulation. Here, we experimentally demonstrate a 10-class classification of 10,000 images, whose confusion matrix is indicated in Fig. 5(c). At the same time, we train and infer a network with the same structure on a computer for comparison, with the corresponding confusion matrix shown in Fig. 5(c). The experimental accuracy obtained on SPPU is 97.9%, while the electrical computer-calculated accuracy is 98.3%, revealing the high consistency between our photonic solution and the electrical calculation.

C. SPPU for Signal Modulation Format Classification

We also construct a residual CNN to verify the ability of SPPU in signal modulation format classification [see Fig. 6(a)]. The input signal comes from the RML201610a dataset with a size of $2 \times 128$ . After entering the residual CNN, the first average pooling layer performs down-sampling operations to reduce computation with the size of $1 \times 2$ . It can be implemented on a single channel of SPPU by setting $w = 1 / 2$ and $Δ φ = 2 k π (k \in Z)$ . The following three convolutional layers convolute the signal, and then the results are processed with a cosine function as nonlinear activation. In particular, the third convolutional layer has two inputs, which come from the first pooling layer and the second convolutional layer, incorporating an additional operation of information fusion through addition. Specifically, the convolution operation of the first convolutional layer is achieved by setting $Δ φ = k π (k \in Z)$ in a single PM mode. Then, by adjusting $Δ φ = k π + π / 2 (k \in Z)$ , the nonlinear activation of the first convolutional layer and the convolution operation of the second convolutional layer can be realized simultaneously. In contrast, the convolution operation for inputs from the pooling layer needs to be adjusted for $Δ φ = k π (k \in Z)$ , because no nonlinear activation is required. Importantly, information fusion of two input convolutions (i.e., additive operation), nonlinear activation of the third convolutional layer, and the second pooling layer are achieved simultaneously by controlling $Δ φ = k π + π / 2 (k \in Z)$ and $w = 1 / 2$ in the two PM modes of SPPU. Thus, all the convolutional layers and pooling layers are executed in parallel by the SPPU in different modes through smart grouping. As for the fully connected layers, the first layer uses the cosine function, while the second layer utilizes the softmax function as the nonlinear activation. The fully connected layers are implemented on a computer.

Figure 6.SPPU for a residual CNN. (a) Structure of the residual CNN for signal modulation format classification. The CNN contains a residual block and two fully connected layers, where the residual block can be subdivided into three convolutional layers. (b) Variation in calculation accuracy and loss of validation dataset during the training. (c) Confusion matrices of modulation format classification.

Download full size

View all figures

The parameters in the residual CNN are trained with a standard back-propagation algorithm and an Adam optimizer on a computer. The exploration of calculation accuracy and loss of the validation dataset during the training is exhibited in Fig. 6(b). Similar to handwritten digit recognition, we also use quantization-aware to train the network weight precision to 4-bit. Then, we feed 16,000 signals into SPPU to estimate the classification accuracy. The confusion matrix in Fig. 6(c) illustrates that the prediction accuracies obtained from the experiment and computer calculation are 92.8% and 96.8%, respectively. The deviation of 4% is mainly caused by the limited precision and noise. Unlike handwritten digit recognition whose images are in low precision, the signal can be distorted at low precision, resulting in confusion between signals in different modulation formats. Although the precision will limit the performance of the SPPU to some extent, the implementation of residual blocks by SPPU proves its application potential in large-scale neural networks. Currently, residual blocks are widely used in large-scale vision models such as ResNet-101, which indicates the great potentials of SPPU in aspects of target detection [44] and image recognition [45] when its computing scale and speed are further enhanced.

D. SPPU for Emotion Recognition of Reviews

To further evaluate the scalability of SPPU chips, we construct a simple RNN network for emotion recognition of reviews. The reviews are from both the Amazon Fine Food Reviews dataset and the IMDB Movie Reviews dataset, with the relevant results being illustrated in Fig. 7. The simple RNN in Fig. 7(a) consists of an embedding layer, a recurrent layer, and a fully connected layer. The embedding layer performs feature extraction on the word vectors, reducing its dimensionality. After the embedding layer, each review is represented as a dense matrix. The embedding layer is done on the computer and the size of output is $128 \times 16$ . Note that each review is represented using a $128 \times 16$ matrix. In the recurrent layer, the time parameter $t$ takes values from 1 to 128, while the input vector $X$ has a dimension of 16. Considering that the output dimension of the recurrent layer is 16, the sizes of weight matrices $U$ and $W$ are $16 \times 16$ . Naturally, the SPPU execution recurrent layer can be divided into three steps. In the first step, we decompose the $16 \times 16$ weight matrix into 32 vectors of a length 8 and complete the computation of $W X_{t}$ by MAC operation. Among them, the multiplication of two vectors is realized by controlling $Δ φ = k π (k \in Z)$ in a single PM mode. The superposition of two vector multiplications is done by controlling $w = 1$ and $Δ φ = 2 k π (k \in Z)$ in two PM modes. In the second step, we calculate $U S_{t - 1}$ in the same way as step 1. Note that $S_{0}$ is equal to zero. In the last step, we control $w = 1$ and $Δ φ = k π + π / 2 (k \in Z)$ in two PM modes to complete the superposition of the results in the first two steps and cosine nonlinear activation simultaneously. Finally, the fully connected layer is used as the output layer, and its activation function is a softmax function.

Figure 7.SPPU for a simple RNN. (a) Structure of the simple RNN for emotion recognition of reviews. It contains an embedding layer, a recurrent layer, and a fully connected layer. Variation in calculation accuracy and loss for both (b) Amazon Fine Food Reviews dataset and (c) IMDBMovie Reviews dataset, during the training. Confusion matrices of emotion recognition for the (d) Amazon Fine Food Reviews dataset and (e) IMDB Movie Reviews dataset.

Download full size

View all figures

We also use standard back-propagation algorithms and Adam optimization algorithms to train the parameters in RNN. Considering the MRR weight loading precision, the quantization-aware technique is also employed in the training process to obtain parameters with a 4-bit precision. In Figs. 7(b) and 7(c), we present the loss and accuracy changes in the validation set during the training, where both cases of the Amazon Fine Food Reviews dataset [Amazon, Fig. 7(b)] and IMDB Movie Reviews dataset [IMDB, Fig. 7(c)] are included. The RNN structure used for these two datasets is the same [see Fig. 7(a)]. The Amazon dataset shows some fluctuations in the loss and accuracy curves during training, which basically converges after 50 epochs. In contrast, the IMDB dataset varies more smoothly and basically converges after 80 epochs. The Amazon dataset contains about 500,000 reviews. We randomly select 105,162 items to form the test dataset, of which 16,449 are negative and 88,713 are positive. The confusion matrices obtained from RNN in the Amazon dataset are shown in Fig. 7(d). Positive reviews are identified with higher accuracy than negative reviews due to the relatively large number of positive reviews in the dataset. As a result, the experimental accuracy obtained on SPPU is 92.3% and the computer calculation accuracy is 92.5%, with results in good agreement, with the 0.2% deviation mainly due to noise. As for the IMDB dataset, it contains a total of 50,000 reviews, 25,000 of which are test samples. Unlike the Amazon dataset, the number of positive and negative reviews in the IMDB dataset is consistent. Hence, the identification accuracy of positive and negative reviews in the IMDB dataset is roughly the same as the confusion matrix in Fig. 7(e). The experimental accuracy obtained by SPPU and the calculation accuracy obtained by the computer are 83.6% and 84.3%, respectively, with a noise-induced bias of about 0.7%. When comparing these two datasets, the Amazon dataset has much better recognition results than the IMDB dataset, which may be caused by the fact that the Amazon dataset has more samples (10 times).

4. CONCLUSIONS

In this work, we have self-designed an SPPU chip utilizing the fast modulation speed of PM, wavelength selective property of MRR, and large detection rates of BPD, whose computational speed is up to 0.93 TOPS, exceeding previous on-chip solutions [9 –14,24]. Although the relatively narrow bandwidth MRR in our prototype chip will cause loss and distortion during high-speed signal loading, we can make compensation by using higher-order MRR combined with embedded MZI [24]. In this case, the computational speed of the SPPU chip can be further improved after photonic device optimization. Meanwhile, considering that SPPU contains eight channels, they can be combined in any way according to the scale of the data to achieve parallel data processing. This allows each on-chip component to maximize efficiency at the current hardware preparation level. Moreover, the power consumption of the computing system mainly comes from the silicon photonic chip and peripheral devices (e.g., light sources, FPGA, and TIA). Here, the power consumption of our silicon photonic chip is estimated to be 0.1398 W per channel, with all key devices such as PMs, MRRs, and BPDs included, whose energy efficiency is about 0.83 TOPS/W. For more details about power consumption and energy efficiency please see Appendix A. With the addition of peripheral devices, the estimated power consumption is 78.3434 W considering the components used in our measurement setup. As a prototype experiment, the vast majority of power overhead comes from the peripheral discrete devices. In future work, we will optimize the peripheral circuits and photonic devices with more advanced integration and packaging techniques. The expected power consumption will be approximately 2 W based on similar protocols in Refs. [13,24], and the expected energy efficiency for the SPPU system is about 0.465 TOPS/W.

As a neural network inference unit, SPPU is able to implement cosine nonlinear activation beyond linear computation. Currently, cosine nonlinear activation has been applied in scenarios such as computer vision and Helmholtz equation solving [46]. When compared with the commonly used activation functions (e.g., relu, tanh), the higher-order derivability property of a cosine function allows neural networks to characterize some fine features. In this study, we train three neural networks using a cosine activation function to complete tasks of handwritten digit recognition, modulation format recognition, and review emotion recognition. To verify the performance of the cosine activation function, we also construct three neural networks with the same structure but different functions as activation, whose comparative results are presented in Table 2. In two CNNs with different structures, we utilize the relu function as a comparison, which is the most common activation function in convolutional layers. We find that the relu function slightly outperforms the cosine function in the MNIST dataset, while the cosine function achieves an absolute victory in the RML201610a dataset. In RNN, we adopt the tanh function based on the same principles as a comparison, and find that the cosine function has advantages in both datasets. Therefore, the cosine nonlinear activation function implemented in the SPPU is able to replace the nonlinear activation function that is not easily implemented in optical systems (e.g., relu, tanh). Of course, the limitation of the cosine function is its non-convex function with multiple optimal points, so the training of neural networks with the cosine function tends to require a smaller learning rate, resulting in relatively slow convergence and increasing training overhead.Table 2.

Recognition Accuracy of Neural Networks with Different Activation Functions

	Accuracy (%)
Function	MNIST	RML201610a	Amazon	IMDB
Calculated (cos)	98.3	96.8	92.5	84.3
Experiment (cos)	97.9	92.8	92.3	83.6
Calculated (relu)	98.6	89.6	–	–
Calculated (tanh)	–	–	92.4	80.3

In summary, we have experimentally established a fully integrated SPPU chip utilizing high-speed phase modulation to accelerate neural network inference. The SPPU adopted a unique parallel channel structure design, which allows for rapid expansion and enhances system process compatibility. Three neural networks with different structures are implemented in this SPPU chip, namely, CNN, residual CNN, and RNN. The comprehensive experiments show that SPPU can achieve different computing modes and complete different structures of inference by controlling working points. Compared with previously reported schemes, the switching of SPPU’s computing mode is done simultaneously via data encoding, which is extremely fast and with simple structures, indicating that SPPU can be easily extended to other neural network structures without adding time or space overhead. In addition, the signs of SPPU weights and data are also done by data encoding, which abandons the scheme of making a difference between two outputs in the existing architecture and reduces the time and space overhead while ensuring completeness. On the other hand, it is experimentally proved that the cosine activation implemented in SPPU is a good working condition for multi-task scenarios. This work has furnished a novel photonic computing scheme to serve large-scale neural networks and drive the next generation of artificial intelligence platforms toward practicality and versatility under process constraints.

5. METHODS

All experiments with the self-developed SPPU chips are implemented using commercial optoelectronic components. The light source is an ultra-low-phase-noise NKT laser, which is capable of generating a single-frequency continuous optical signal at a wavelength of 1550.12 nm and a linewidth of less than 0.1 kHz. For device characterization and performance experiments, high-speed signals are generated and sampled by an arbitrary waveform generator and a real-time oscilloscope. Considering that the arbitrary waveform generator and oscilloscope do not have enough channels to support high-speed signal generation and sampling for all channels of the SPPU, we switch to using the XILINX ZCU111 FPGA with RFSoC for signal generation and sampling in the three proof-of-concept tasks. A single ZCU111 FPGA supports eight 12-bit 4.096 GSa/s ADCs and eight 14-bit 6.554 GSa/s DACs. A 128-channel multichannel power supply controlled by the FPGA encodes the weight vector for the on-chip MRRs. The multichannel power supply also provides the DC bias voltage for the photodetectors. A Keysight B2962B low-noise power source is used to generate and measure high-precision DC signals.

Acknowledgment

Acknowledgment. The authors thank Pinghu Intelligent Optoelectronics Research Institute for providing optoelectronic hybrid packaging.

Category: Silicon Photonics

Received: Apr. 22, 2024

Accepted: Sep. 2, 2024

Published Online: Nov. 1, 2024

The Author Email: Tian Jiang (tjiang@nudt.edu.cn)

DOI:10.1364/PRJ.527940

CSTR:32188.14.PRJ.527940

Architectures	Data Loading Rate (Gbaud)	Weight Precision (bits)	Energy Efficiency (TOPS/W)
MZI mesh [8]	25	1	0.48
Cascaded MZI [12]	$10^{- 7}$	6.6	–
MRRs [47]	0.047	$> 5.5$	–
MRR + PCM [10]	2	5	0.4
MRR + VOA [9]	–	–	0.07
AWG [14]	$10^{- 5}$	–	0.11
MRR + WDM [25]	20	4.83	0.06
MMI [13]	16.6	5	0.21
MRR + delay [24]	17	9	0.2
This work	30	5.8	0.83/0.465

微信扫一扫：分享