In recent years, integrated optical processing units (IOPUs) have demonstrated advantages in energy efficiency and computational speed for neural network inference applications. However, limited by optical integration technology, the practicality and versatility of IOPU face serious challenges. In this work, a scalable parallel photonic processing unit (SPPU) for various neural network accelerations based on high-speed phase modulation is proposed and implemented on a silicon-on-insulator platform, which supports parallel processing and can switch between multiple computational paradigms simply and without latency to infer different neural network structures, enabling to maximize the utility of on-chip components. The SPPU adopts a scalable and process-friendly architecture design, with a preeminent photonic-core energy efficiency of 0.83 TOPS/W, two to ten times higher than existing integrated solutions. In the proof-of-concept experiment, a convolutional neural network (CNN), a residual CNN, and a recurrent neural network (RNN) are all implemented on our photonic processor to handle multiple tasks of handwritten digit classification, signal modulation format recognition, and review emotion recognition. The SPPU achieves multi-task parallel processing capability, serving as a promising and attractive research route to maximize the utility of on-chip components under the constraints of integrated technology, which helps to make IOPU more practical and universal.
【AIGC One Sentence Reading】:A scalable parallel photonic processing unit (SPPU) enhances neural network inference with high energy efficiency and multi-task parallel processing, implemented on a silicon-on-insulator platform.
【AIGC Short Abstract】:A scalable parallel photonic processing unit (SPPU) is proposed for neural network accelerations. Implemented on a silicon-on-insulator platform, it supports parallel processing and switches between multiple computational paradigms without latency. The SPPU demonstrates high energy efficiency and multi-task processing capability, making integrated optical processing units more practical and versatile.
Note: This section is automatically generated by AI . The website and platform operators shall not be liable for any commercial or legal consequences arising from your use of AI generated content on this website. Please be aware of this.
1. INTRODUCTION
With the development of neural network applications, large-scale neural networks such as ChatGPT have become a trend in artificial intelligence [1,2]. The popularity of such models has given rise to a surge in matrix operations and nonlinear computations. The amount of computation required by neural networks doubles every 100 days, which is expected to increase by more than a million times in the next five years [3]. As Moore’s law slows down, it becomes challenging for traditional computing hardware to keep up with such rapid growth in computing demand [4]. Thanks to the low-loss, large bandwidth, and high parallelism properties of light, optical neural network processing units with ultra-fast computing speeds and ultra-low power overheads have made great progress in recent years [5,6]. Many excellent optical integrated architectures have been proposed and widely applied in computer vision [7–10], natural language processing [11], medicine [12], and other applications, as they show excellent potential to improve energy consumption and computing speed by at least two orders of magnitude compared to electrical computing units [13].
Being viewed as the next generation of neuromorphic processors, the optical neural networks on chip contain different types of photonic architectures [6], including plane light conversion (PLC) [13–17], Mach-Zehnder interferometers (MZIs) [11,18–23], and wavelength division multiplexing (WDM) [10,24–31]. These solutions can optically implement vector-matrix multiplication, convolution, and other linear operations, greatly improving the inference efficiency of neural networks. However, limited by optical integration technology and the performance of on-chip components, these excellent architectures still face remarkable challenges in practicality and versatility. To address these issues, many researchers are currently working towards large-scale photonic integration [32,33] and on-chip component optimization [34–37]. Thanks to these groundbreaking efforts, silicon photonic chips have made rapid progress in recent years. Meanwhile, specially designed photonic architecture is also a prerequisite to maximize the utility of crucial components and minimize processing difficulty under hardware constraints. Even if the scale of on-chip components is fixed, the photonic architecture should have parallel processing capabilities to adapt to dynamically changing data sizes, preventing valuable on-chip components from being idle. Additionally, the photonic architecture is expected to own multitasking capability for different scenarios. Unfortunately, the existing research mainly focuses on a specific neural network structure, such as fully connected layers and convolutional layers, causing significant limitations in scenarios demanding different neural network structures. Such structure-specific architecture design is obviously not conducive to maximizing the utility of on-chip components. In terms of minimizing processing difficulty, optical architecture design should consider device failures or performance substandard caused by the yield of chip processing, improving process compatibility.
In this work, we propose and demonstrate a scalable parallel photonic processing unit (SPPU) for various neural network accelerations on a silicon photonic platform exploiting high-speed phase modulation to improve the versatility and practicality of optical neural network inference accelerators. Specifically, our photonic tensor processor contains eight optically independent computing channels, which can be grouped freely to process different computing tasks in parallel (e.g., operations with two sets of length-4 vectors or four sets of length-2 vectors), with each channel link consisting of two phase modulators (PMs), two microring resonators (MRRs), and a balanced photodetector (BPD). Two on-chip silicon PMs can simultaneously load two input data to support special network structures such as residual layers, pooling layers, and recurrent layers, or only use one PM to implement basic network structures such as convolutional layers and fully connected layers. Any MRR can be used to load weights, with one used for backup to reduce the loss of weight precision caused by process errors. The output results are derived from BPDs whose current direction visualizes the positive or negative sign. Equally important, the SPPU can realize both multiply-accumulate (MAC) and nonlinear operations required in neural network inference without additional time or space consumption. Comprehensive and multi-task experimental results show that three types of neural networks, namely, convolutional neural network (CNN), three-layer residual CNN, and recurrent neural network (RNN), can run well on our SPPU chip, with the recognition accuracies to be 97.9% of CNN for the MNIST dataset, 92.8% of three-layer residual CNN for eight signal modulation formats (RML201610a) [38], and 83.6% (92.3%) of RNN for the IMDB Movie Reviews dataset [39] (Amazon Fine Food Reviews dataset [40]). Owing to the high integration level and fast data loading speed, the energy efficiency of the photonic-core is significantly improved, reaching 0.83 TOPS/W, which is two to ten times higher than the previous work [9–14,24]. This work takes an essential and significant step in bringing integrated photonic AI accelerators into the real world.
Sign up for Photonics Research TOC Get the latest issue of Advanced Photonics delivered right to you!Sign up now
2. SPPU ARCHITECTURE AND PRINCIPLE
The proposed SPPU chip consisting of eight optically independent computing channels can support neural network inference with multiple structures, whose functionality and conceptual architecture are vividly shown in Figs. 1(a) and 1(b), respectively. As presented in Fig. 1(a), SPPU is able to implement four base operators, namely, , , , and , to support the inference of CNNs, residual CNNs, and RNNs. These operators allow to perform recognition and classification tasks over a wide range of data, such as handwritten digits, modulated signals, and reviews. From Fig. 1(b), one can see that the high-speed input data are encoded into the optical phase via on-chip silicon PMs in each channel, where a single PM or two PMs can be used for encoding in the same channel. The kernel weights are mapped to heater voltages in MRRs. Here, we employ one MRR to load the weights and another MRR as a backup, in order to minimize the impact of process deviations on system performance. By sampling the RF output signals from BPD, the computation results will be obtained. The eight channels adopt independent designs in the structure, which can avoid system failure caused by the failure of a single channel and support parallel data processing.
Figure 1.Conceptual architecture diagram of SPPU on chip. (a) Basic process of SPPU inference for neural networks with different structures. (b) Schematic illustration of SPPU chip. This chip includes eight independent computational channels, with each channel consisting of an edge coupler (EC), two phase modulators (PMs), two microring resonators (MRRs), a balanced photodetector (BPD), and two grating couplers (GCs).
Within a single channel, the PM is assumed to load a pulse amplitude modulation (PAM) digital signal, with the caused optical phase shift in a single symbol denoted as . At the same time, an alternative PAM digital signal is loaded on the heating electrodes of MRR, producing a weight change known as . After coherent detection, the output of BPD is mathematically expressed as where and are the amplitudes of the upper and lower optical carriers, respectively, and represents the phase difference generated by the upper and lower optical transmissions. According to the signal loading methods, several different working principles are introduced as follows.
To begin with, we consider the case of signals loading with a single PM, which allows the simplification of Eq. (1): where represents the amplitude of the optical carrier. We introduce a bias in the PM that ensures and approximates , and then the output of this channel is approximated as . Considering the error control and PM encoding interval, we limit in the range of to ensure that the error is less than 0.01. As a result, an MAC operation of length 8 can be achieved by superimposing the outputs of all channels. Of course, the vector length can also be changed according to the task requirements, and use it for parallel processing of multiple vectors. If the length of MAC operation exceeds eight, we use the divide-and-conquer strategy to break down the longer MAC calculation into a series of computations with a length less than eight, ensuring that the SPPU can flexibly adapt to different lengths of MAC operations. Particularly, we find that the input data is signed in the MAC operation. Besides, if we apply a bias such that , the output will be . At this point, the signal superposition from eight channels can realize the MAC operation with cosine activation that is a nonlinear activation for outputs from the previous layer. Lastly, we set a bias to make sure , which results in . To this end, the representation of negative weights is completed. Of course, changing the applied bias can also achieve different linear degrees of activation, thus realizing the representation of negative weights. Since the bias and input data are loaded at the same time, the switching of modes does not affect the computing speed.
Next, we discuss signals loading with two PMs, in which case the BPD output can be described as If we load the input data on PM1 and the bias on PM2, the situation will be the same as discussed earlier. In the following section, we will focus on the case where both PMs are loaded with input signals. By inverting the input signal of PM2, the normalized result is When a bias is applied to ensure , Eq. (4) becomes . First, if the weight is set to 0.5, we can use SPPU to perform average pooling in CNN. Since the weights remain unchanged during average pooling, the slow MRR heat adjustment does not affect the computing speed. In addition, we find that in large-scale neural networks, many residual layers are needed to solve the gradient disappearance. Interestingly, the basic operation of the residual layer is , where and represent inputs from different layers. Naturally, SPPU can be efficiently employed to perform the operations of the residual layer. When the scale of SPPU is increased in the future, it is expected to serve large model inferences like ResNet-101 [41].
Additionally, RNNs with feedback structures are mostly utilized in contextually relevant scenarios to leverage historical information [42]. The basic operation in a simple RNN network is , where represents the activation function, means the state of RNN, and and are both weights. If we employ the bias such that , the output will be . In this case, SPPU can perform simple RNN efficiently, indicating that our proposed chip can be easily extended to more categories of neural networks. One can still control bias to implement negative weights [i.e., ] and activation functions with different degrees of linearity.
To decode the calculation results, we need to sample, quantize, and reshape the output signals from the SPPU by electronic hardware. A high-speed analog-to-digital converter (ADC) is used to convert the analog signal to digital, followed by quantification of the output voltage based on a pre-established mapping table. Finally, we reshape these values to obtain the desired matrix format. Moving on now to consider the switching of computing modes and weight calibration, we can measure the photocurrent intensity from the monitoring PDs to calibrate the weights loaded onto the MRR. As for the switching of computing modes, we can obtain the bias voltage corresponding to different computing modes by scanning the relationship curve between bias voltage and BPD photocurrent intensity. Considering the SPPU state changes slowly with the environment, we will send out calibration signals at regular intervals, and calculate the error as a basis for whether calibration is needed or not (see Appendix B for more details).
In short, SPPU is able to continuously perform MAC and nonlinear operations in parallel. The specific mapping relationship between computing mode and bias is shown in Table 1. It can be seen from this table that each channel can perform at most one addition, one multiplication, and one cosine activation in a single computation. Thus, the computing speed is estimated to operations per second (OPS) for each channel, where represents the loading speed of input data. If all eight channels are utilized, the signal superposition of these channels is equivalent to seven additional operations, leading to a total computing speed of OPS.
Computing Paradigm of a Single SPPU Channel in Different Computing Modes
Bias
Single PM
Two PMs
3. EXPERIMENTS AND RESULTS
A. SPPU Fabrication and Characterization
The SPPU chip used in our work is fabricated by 90 nm CMOS-compatible processes in a standard 220 nm silicon-on-insulator (SOI) platform, whose optical micrograph is shown in Fig. 2(a). Precisely, optical signals are incident from the right side of a photonic chip via edge couplers, while the electric results are output from the bottom side. As for the top and left sides, there are RF inputs and DC inputs, respectively. The footprint of the whole photonic chip is about , with the core computing area occupying around . Moreover, the photoelectric hybrid package module of the SPPU chip is vividly illustrated in Fig. 2(b). The SPPU chip is welded to the printed circuit board (PCB) using wire bonding, whereas RF interfaces adopt GPPOs and DC interfaces use flexible flat cables (FFCs). As the on-chip BPDs generate photocurrent, we also package trans-impedance amplifier (TIA) chips onto the PCB to convert current signals to voltage signals.
Figure 2.Optical micrograph of SPPU. (a) Microscopic image of the SPPU chip. (b) Image of SPPU unit after photoelectric hybrid package. Microscopic images of (c) phase modulator, (d) add-drop microring resonator, and (e) photodetector.
Optical microscope images of several key elements on the silicon photonic chip are described in Figs. 2(c)–2(e). Precisely, the phase modulators in Fig. 2(c) are operating based on the plasma dispersion effect and carrier-depletion schema. To acquire a large modulation bandwidth, traveling-wave electrodes terminated with on-chip resistors (left ports of PM) are utilized to meet the requirements of phase match and impedance match. Meanwhile, the add-drop MRRs in Fig. 2(d) are made using a 500 nm wide strip waveguide to support the single TE mode near 1550 nm, with each MRR containing two TiN microheater-based thermo-optical phase shifters inside. Moreover, the on-chip photodetectors in Fig. 2(e) are Ge-based vertical p-doped-intrinsic-n-doped (PIN) diodes operating around 1550 nm. Here, two identical photodetectors constitute BPD in the SPPU by differential connection.
After introducing the structure of key devices, their physical properties are comprehensively investigated (see Fig. 3). The measured S21 parameters of PM are shown in Fig. 3(a), revealing that its 3 dB bandwidth is about 43.2 GHz. Thus, the modulation speed of input data can reach more than 40 Gbaud. To demonstrate the relationship between weights and voltage, we scan the variation of MRR weights over the voltage range of 1 to 4 V with a step of 0.02 V, whose results are presented in Fig. 3(b). With the increase of voltage, the weight shows a trend similar to the Lorentzian peak shape of MRR. Considering the random drift in MRR, we set 250 voltage values within the range of 0 to 5 V and sample each voltage 50 times (12,500 samples in total). The weight comparison between measured and expected results is displayed in Fig. 3(c). All points are distributed near the diagonal line (theoretical case), and the deviation roughly follows a normal distribution with a standard deviation of approximately 0.006. According to statistical theory, we estimate that the weight precision of MRR is about .
Figure 3.Characterization of key photonic components on chip. (a) Measured bandwidth of the on-chip silicon phase modulator. (b) Normalized weights and the voltage applied to the on-ring heater for one microring resonator. (c) Scatter plots of the measured weights versus calculated expectations. (d) Measured bandwidth of the on-chip photodetector. (e) Responsivity of the on-chip photodetector measured at bias. (f) Dark current measured by the on-chip photodetector at different bias voltages.
Further to that, it can find the measured S21 parameters of the photodetector in Fig. 3(d), where the 3 dB bandwidth is around 33.63 GHz. The relationship between incident light power and photocurrent of the photodetector at bias is described in Fig. 3(e). The responsivity of the photodetector is obtained from the slope of the line fitted by scatter points, which is about 0.83 A/W. The dark current of the photodetector at different bias voltages is also measured, as illustrated in Fig. 3(f). It can be easily derived from this figure that the dark current is about 2.02 nA at bias. Thereby, the photodetector can not only support resultant output rates of over 30 Gbaud, but also detect an optical power as low as . Considering the working speeds of the photodetector and PM, the computing speed of SPPU is estimated to be as high as TOPS.
We turn now to the working flow of one SPPU channel, whose results are shown in Fig. 4. To start with, the computational process of optical neural network inference performed by an SPPU channel is presented in Fig. 4(a), including input data format conversion, weight encoding, and result acquisition. Based on this, we can acquire the sequence calculation results of an SPPU channel in Fig. 4(b), whose weight is one and input data vector is [1, 0.75, 0.5, 0.25, 0, , , , ]. Then, we send this input data vector at regular intervals to determine whether the SPPU channel needs to be calibrated (see Appendix B for details). Figure 4(c) introduces the waveform of a convolutional computation based on 4-bit coding. The orange and blue lines represent the ideal and experimental waveforms, respectively. From these plots, we observe that the experimental results are in good agreement with the theoretically calculated samples, confirming the effectiveness of SPPU.
Figure 4.Working flow of one SPPU channel. (a) Schematic of an SPPU channel performing optical neural network computation. (b) Sequence calculation results when the weight is one and the input data vector is [1, 0.75, 0.5, 0.25, 0, , , , ], which is used for SPPU channel calibration. (c) Ideal (orange line) and experimental (blue curve) convolution output waveform with 4-bit coding.
To experimentally characterize the capability of the SPPU chip, a handwritten digits classification task is conducted. Here, we construct a convolutional neural network with two convolutional layers and a fully connected layer, as shown in Fig. 5(a). The input image is from the MNIST dataset, whose matrix size is . The first convolutional layer convolutes an input image with 12 kernels, and then the calculated results are treated with a cosine function as nonlinear activation. Meanwhile, to reduce computation, average pooling is also used in the convolutional layer. After the average pooling, the size of the resulting feature map becomes , which means that we obtain 12 feature matrices of size . Similarly, the second convolutional layer has the same structure as the first, except that it has five kernels and the resulting feature map is in size. In this case, the feature map from the second convolutional layer is serialized into a vector as the input of the fully connected layer to acquire the prediction results. It is easy to see that all the sizes of the convolution kernels and pooling are set as multiples of eight. This ensures that the SPPU’s computing units can be grouped effectively, allowing for multiple convolutions or pooling operations to run in parallel, making sure those valuable SPPU computing units are not sitting idle.
Figure 5.SPPU for a CNN. (a) Structure of the CNN for handwritten digits classification, which consists of two convolutional layers and a fully connected layer. Here, each convolutional layer can be subdivided into convolutional operations and pooling operations. (b) Variation in calculation accuracy and loss of validation dataset during the training. (c) Confusion matrices of digit classification. The experimental accuracy (97.9%) is in good agreement with the calculated results (98.3%).
The standard back-propagation algorithm and the Adam optimizer are used to train the parameters in the CNN on a computer. The variation in calculation accuracy and loss of the validation dataset during the training is exhibited in Fig. 5(b). Particularly, we adopt the quantization-aware training technique to reduce the impact of SPPU precision on CNN performance (see Appendix C for more training details) [43]. All parameters in convolutional layers are 4-bit precision. In the inference, two convolutional layers will be converted into MAC operations and nonlinear operations running on the SPPU, while the fully connected layer is performed on a computer. Specifically, convolution operations are implemented in the form of MAC operations by controlling , while cosine activation and average pooling are implemented simultaneously by setting and . Note that all computations within the convolutional layers are conducted in parallel in the single PM mode. In future works, the fully connected layer is also achieved by SPPU if MRRs shift from thermal to electrical regulation. Here, we experimentally demonstrate a 10-class classification of 10,000 images, whose confusion matrix is indicated in Fig. 5(c). At the same time, we train and infer a network with the same structure on a computer for comparison, with the corresponding confusion matrix shown in Fig. 5(c). The experimental accuracy obtained on SPPU is 97.9%, while the electrical computer-calculated accuracy is 98.3%, revealing the high consistency between our photonic solution and the electrical calculation.
C. SPPU for Signal Modulation Format Classification
We also construct a residual CNN to verify the ability of SPPU in signal modulation format classification [see Fig. 6(a)]. The input signal comes from the RML201610a dataset with a size of . After entering the residual CNN, the first average pooling layer performs down-sampling operations to reduce computation with the size of . It can be implemented on a single channel of SPPU by setting and . The following three convolutional layers convolute the signal, and then the results are processed with a cosine function as nonlinear activation. In particular, the third convolutional layer has two inputs, which come from the first pooling layer and the second convolutional layer, incorporating an additional operation of information fusion through addition. Specifically, the convolution operation of the first convolutional layer is achieved by setting in a single PM mode. Then, by adjusting , the nonlinear activation of the first convolutional layer and the convolution operation of the second convolutional layer can be realized simultaneously. In contrast, the convolution operation for inputs from the pooling layer needs to be adjusted for , because no nonlinear activation is required. Importantly, information fusion of two input convolutions (i.e., additive operation), nonlinear activation of the third convolutional layer, and the second pooling layer are achieved simultaneously by controlling and in the two PM modes of SPPU. Thus, all the convolutional layers and pooling layers are executed in parallel by the SPPU in different modes through smart grouping. As for the fully connected layers, the first layer uses the cosine function, while the second layer utilizes the softmax function as the nonlinear activation. The fully connected layers are implemented on a computer.
Figure 6.SPPU for a residual CNN. (a) Structure of the residual CNN for signal modulation format classification. The CNN contains a residual block and two fully connected layers, where the residual block can be subdivided into three convolutional layers. (b) Variation in calculation accuracy and loss of validation dataset during the training. (c) Confusion matrices of modulation format classification.
The parameters in the residual CNN are trained with a standard back-propagation algorithm and an Adam optimizer on a computer. The exploration of calculation accuracy and loss of the validation dataset during the training is exhibited in Fig. 6(b). Similar to handwritten digit recognition, we also use quantization-aware to train the network weight precision to 4-bit. Then, we feed 16,000 signals into SPPU to estimate the classification accuracy. The confusion matrix in Fig. 6(c) illustrates that the prediction accuracies obtained from the experiment and computer calculation are 92.8% and 96.8%, respectively. The deviation of 4% is mainly caused by the limited precision and noise. Unlike handwritten digit recognition whose images are in low precision, the signal can be distorted at low precision, resulting in confusion between signals in different modulation formats. Although the precision will limit the performance of the SPPU to some extent, the implementation of residual blocks by SPPU proves its application potential in large-scale neural networks. Currently, residual blocks are widely used in large-scale vision models such as ResNet-101, which indicates the great potentials of SPPU in aspects of target detection [44] and image recognition [45] when its computing scale and speed are further enhanced.
D. SPPU for Emotion Recognition of Reviews
To further evaluate the scalability of SPPU chips, we construct a simple RNN network for emotion recognition of reviews. The reviews are from both the Amazon Fine Food Reviews dataset and the IMDB Movie Reviews dataset, with the relevant results being illustrated in Fig. 7. The simple RNN in Fig. 7(a) consists of an embedding layer, a recurrent layer, and a fully connected layer. The embedding layer performs feature extraction on the word vectors, reducing its dimensionality. After the embedding layer, each review is represented as a dense matrix. The embedding layer is done on the computer and the size of output is . Note that each review is represented using a matrix. In the recurrent layer, the time parameter takes values from 1 to 128, while the input vector has a dimension of 16. Considering that the output dimension of the recurrent layer is 16, the sizes of weight matrices and are . Naturally, the SPPU execution recurrent layer can be divided into three steps. In the first step, we decompose the weight matrix into 32 vectors of a length 8 and complete the computation of by MAC operation. Among them, the multiplication of two vectors is realized by controlling in a single PM mode. The superposition of two vector multiplications is done by controlling and in two PM modes. In the second step, we calculate in the same way as step 1. Note that is equal to zero. In the last step, we control and in two PM modes to complete the superposition of the results in the first two steps and cosine nonlinear activation simultaneously. Finally, the fully connected layer is used as the output layer, and its activation function is a softmax function.
Figure 7.SPPU for a simple RNN. (a) Structure of the simple RNN for emotion recognition of reviews. It contains an embedding layer, a recurrent layer, and a fully connected layer. Variation in calculation accuracy and loss for both (b) Amazon Fine Food Reviews dataset and (c) IMDBMovie Reviews dataset, during the training. Confusion matrices of emotion recognition for the (d) Amazon Fine Food Reviews dataset and (e) IMDB Movie Reviews dataset.
We also use standard back-propagation algorithms and Adam optimization algorithms to train the parameters in RNN. Considering the MRR weight loading precision, the quantization-aware technique is also employed in the training process to obtain parameters with a 4-bit precision. In Figs. 7(b) and 7(c), we present the loss and accuracy changes in the validation set during the training, where both cases of the Amazon Fine Food Reviews dataset [Amazon, Fig. 7(b)] and IMDB Movie Reviews dataset [IMDB, Fig. 7(c)] are included. The RNN structure used for these two datasets is the same [see Fig. 7(a)]. The Amazon dataset shows some fluctuations in the loss and accuracy curves during training, which basically converges after 50 epochs. In contrast, the IMDB dataset varies more smoothly and basically converges after 80 epochs. The Amazon dataset contains about 500,000 reviews. We randomly select 105,162 items to form the test dataset, of which 16,449 are negative and 88,713 are positive. The confusion matrices obtained from RNN in the Amazon dataset are shown in Fig. 7(d). Positive reviews are identified with higher accuracy than negative reviews due to the relatively large number of positive reviews in the dataset. As a result, the experimental accuracy obtained on SPPU is 92.3% and the computer calculation accuracy is 92.5%, with results in good agreement, with the 0.2% deviation mainly due to noise. As for the IMDB dataset, it contains a total of 50,000 reviews, 25,000 of which are test samples. Unlike the Amazon dataset, the number of positive and negative reviews in the IMDB dataset is consistent. Hence, the identification accuracy of positive and negative reviews in the IMDB dataset is roughly the same as the confusion matrix in Fig. 7(e). The experimental accuracy obtained by SPPU and the calculation accuracy obtained by the computer are 83.6% and 84.3%, respectively, with a noise-induced bias of about 0.7%. When comparing these two datasets, the Amazon dataset has much better recognition results than the IMDB dataset, which may be caused by the fact that the Amazon dataset has more samples (10 times).
4. CONCLUSIONS
In this work, we have self-designed an SPPU chip utilizing the fast modulation speed of PM, wavelength selective property of MRR, and large detection rates of BPD, whose computational speed is up to 0.93 TOPS, exceeding previous on-chip solutions [9–14,24]. Although the relatively narrow bandwidth MRR in our prototype chip will cause loss and distortion during high-speed signal loading, we can make compensation by using higher-order MRR combined with embedded MZI [24]. In this case, the computational speed of the SPPU chip can be further improved after photonic device optimization. Meanwhile, considering that SPPU contains eight channels, they can be combined in any way according to the scale of the data to achieve parallel data processing. This allows each on-chip component to maximize efficiency at the current hardware preparation level. Moreover, the power consumption of the computing system mainly comes from the silicon photonic chip and peripheral devices (e.g., light sources, FPGA, and TIA). Here, the power consumption of our silicon photonic chip is estimated to be 0.1398 W per channel, with all key devices such as PMs, MRRs, and BPDs included, whose energy efficiency is about 0.83 TOPS/W. For more details about power consumption and energy efficiency please see Appendix A. With the addition of peripheral devices, the estimated power consumption is 78.3434 W considering the components used in our measurement setup. As a prototype experiment, the vast majority of power overhead comes from the peripheral discrete devices. In future work, we will optimize the peripheral circuits and photonic devices with more advanced integration and packaging techniques. The expected power consumption will be approximately 2 W based on similar protocols in Refs. [13,24], and the expected energy efficiency for the SPPU system is about 0.465 TOPS/W.
As a neural network inference unit, SPPU is able to implement cosine nonlinear activation beyond linear computation. Currently, cosine nonlinear activation has been applied in scenarios such as computer vision and Helmholtz equation solving [46]. When compared with the commonly used activation functions (e.g., relu, tanh), the higher-order derivability property of a cosine function allows neural networks to characterize some fine features. In this study, we train three neural networks using a cosine activation function to complete tasks of handwritten digit recognition, modulation format recognition, and review emotion recognition. To verify the performance of the cosine activation function, we also construct three neural networks with the same structure but different functions as activation, whose comparative results are presented in Table 2. In two CNNs with different structures, we utilize the relu function as a comparison, which is the most common activation function in convolutional layers. We find that the relu function slightly outperforms the cosine function in the MNIST dataset, while the cosine function achieves an absolute victory in the RML201610a dataset. In RNN, we adopt the tanh function based on the same principles as a comparison, and find that the cosine function has advantages in both datasets. Therefore, the cosine nonlinear activation function implemented in the SPPU is able to replace the nonlinear activation function that is not easily implemented in optical systems (e.g., relu, tanh). Of course, the limitation of the cosine function is its non-convex function with multiple optimal points, so the training of neural networks with the cosine function tends to require a smaller learning rate, resulting in relatively slow convergence and increasing training overhead.
Recognition Accuracy of Neural Networks with Different Activation Functions
Accuracy (%)
Function
MNIST
RML201610a
Amazon
IMDB
Calculated (cos)
98.3
96.8
92.5
84.3
Experiment (cos)
97.9
92.8
92.3
83.6
Calculated (relu)
98.6
89.6
–
–
Calculated (tanh)
–
–
92.4
80.3
In summary, we have experimentally established a fully integrated SPPU chip utilizing high-speed phase modulation to accelerate neural network inference. The SPPU adopted a unique parallel channel structure design, which allows for rapid expansion and enhances system process compatibility. Three neural networks with different structures are implemented in this SPPU chip, namely, CNN, residual CNN, and RNN. The comprehensive experiments show that SPPU can achieve different computing modes and complete different structures of inference by controlling working points. Compared with previously reported schemes, the switching of SPPU’s computing mode is done simultaneously via data encoding, which is extremely fast and with simple structures, indicating that SPPU can be easily extended to other neural network structures without adding time or space overhead. In addition, the signs of SPPU weights and data are also done by data encoding, which abandons the scheme of making a difference between two outputs in the existing architecture and reduces the time and space overhead while ensuring completeness. On the other hand, it is experimentally proved that the cosine activation implemented in SPPU is a good working condition for multi-task scenarios. This work has furnished a novel photonic computing scheme to serve large-scale neural networks and drive the next generation of artificial intelligence platforms toward practicality and versatility under process constraints.
5. METHODS
All experiments with the self-developed SPPU chips are implemented using commercial optoelectronic components. The light source is an ultra-low-phase-noise NKT laser, which is capable of generating a single-frequency continuous optical signal at a wavelength of 1550.12 nm and a linewidth of less than 0.1 kHz. For device characterization and performance experiments, high-speed signals are generated and sampled by an arbitrary waveform generator and a real-time oscilloscope. Considering that the arbitrary waveform generator and oscilloscope do not have enough channels to support high-speed signal generation and sampling for all channels of the SPPU, we switch to using the XILINX ZCU111 FPGA with RFSoC for signal generation and sampling in the three proof-of-concept tasks. A single ZCU111 FPGA supports eight 12-bit 4.096 GSa/s ADCs and eight 14-bit 6.554 GSa/s DACs. A 128-channel multichannel power supply controlled by the FPGA encodes the weight vector for the on-chip MRRs. The multichannel power supply also provides the DC bias voltage for the photodetectors. A Keysight B2962B low-noise power source is used to generate and measure high-precision DC signals.
Acknowledgment
Acknowledgment. The authors thank Pinghu Intelligent Optoelectronics Research Institute for providing optoelectronic hybrid packaging.
APPENDIX A: POWER CONSUMPTION AND ENERGY EFFICIENCY OF SPPU CHIP
As mentioned in the main text, the power consumption of the SPPU system is mainly from a silicon photonic chip and peripheral devices that contain modulator drivers, TIA chips, FPGAs, and lasers. For the silicon photonic chip, there are three active devices, including PMs, MRRs, and BPDs. The power consumption of on-chip PM is mainly DC bias, which is used to adjust the operating point of the SPPU chip. The MRR utilizes inbuilt heated electrodes to drive thermal phase shifters for weight loading, which is the source of MRR’s power consumption. The BPD requires a reverse voltage to enhance carrier migration efficiency and improve detection responsivity. Considering that the dark current of the BPD at reverse bias is 2.02 nA, we approximate the power consumption of BPD to be zero. Thus, the total power consumption of the silicon optical chip is 1.1184 W, where all eight channels are considered.
Additional power consumption comes from peripheral devices. First, according to the user manual, the power consumption of FPGA for encoding and converting the input data is typically 17 W. Since the signal power generated by the FPGA is very small, the modulator driver is required to amplify the signal. The measured power consumption of the driver is approximately 38.4 W. The current signal detected by BPD needs to be amplified and then converted into a voltage signal by the TIA, which consumes approximately 0.825 W. Additionally, the laser source for optical signal generation takes a power consumption of 4 W. As a result, the power consumption required for the entire SPPU system to complete the calculation is about 78.3434 W.
Obviously, the vast majority of power consumption comes from peripheral discrete devices, accounting for 98.6%. Benefitting from the recent advances in integrated lasers, application-specific integrated circuits (ASICs), and analog digital interchange circuits (ADCs and DACs), it will eventually be possible to replace peripheral discrete devices with fully integrated photonic devices and integrated circuits using hybrid and monolithic integration techniques in future work. The expected power consumption of the SPPU system can be reduced to 2 W, calculated in line with similar protocols in Refs. [13] and [24].
In this work, the energy efficiency of the SPPU chip is estimated to be TOPS/W, and the expected energy efficiency of the entire SPPU system is approximately TOPS/W. Moreover, the comparison between several representatives of integrated photonic computing hardware is summarized in Table 3. We primarily focus on the three most important metrics in computing, namely, the data loading rate, weight precision, and energy efficiency. Thanks to the high data loading rate, the SPPU proposed in this work is more energy efficient than other integrated optical computing architectures. Meanwhile, SPPU achieves a weighting precision of 5.8 bits compared to the equally energy-efficient MZI mesh scheme, revealing that SPPU is more suitable for real-world applications requiring higher accuracy.
Recognition Accuracy of Neural Networks with Different Activation Functions
Architectures
Data Loading Rate (Gbaud)
Weight Precision (bits)
Energy Efficiency (TOPS/W)
MZI mesh [8]
25
1
0.48
Cascaded MZI [12]
6.6
–
MRRs [47]
0.047
–
MRR + PCM [10]
2
5
0.4
MRR + VOA [9]
–
–
0.07
AWG [14]
–
0.11
MRR + WDM [25]
20
4.83
0.06
MMI [13]
16.6
5
0.21
MRR + delay [24]
17
9
0.2
This work
30
5.8
0.83/0.465
APPENDIX B: SPPU CALIBRATION AND COMPUTING MODE SWITCHING
In order to obtain accurate linear and nonlinear results, it is important to ensure that the weights and working points (bias) loaded on the SPPU chip are accurate. To start with, we use an add-drop MRR in the SPPU chip, where the drop port outputs the optical signal after loading the weights and the through port is used to monitor the weight loading, in order to make sure the weight matrix is correctly loading. Specifically, the through port connects a grating coupler and a photodetector via a MMI. We can couple the optical signals from the through port to off-chip optical devices via a grating coupler, or get the optical signal intensity from the through port directly through the photodetector. In our experiments, we employ photodetectors to monitor the MRR weight loading. Thereby, the weight calibration of SPPU can be divided into three steps as follows. System.Xml.XmlElementSystem.Xml.XmlElementSystem.Xml.XmlElement
The calibration of the work points is performed next. According to Eq. (2) in the main text, a sinusoidal function curve can be obtained from the relationship between the bias voltage and the current intensity at BPD output. Based on this cosine function curve, we can establish a lookup table of PM bias voltage and operating point, in which the bias voltage that makes the BPD current maximum, minimum, and equal to zero corresponds to the operating point of , , (the sign depends on the current trend), respectively. It is worth noting that altering the weights via MRR also triggers a phase change, so a phase correction table also needs to be created to dynamically correct the bias based on the MRR voltage. Thus, we can select the appropriate work point based on the lookup table to complete a specific calculation when coding the input data. Since the switching of work points is done simultaneously with data encoding, the computational mode switching of SPPU is delay-free and without extra space consumption compared to other schemes. Considering that the SPPU state changes slowly with the environment, we will send out calibration signals at regular intervals, and calculate the error as the basis for whether calibration is required. In other words, after the calibration is completed, the bias voltage corresponding to the operating point is the encoding result of data 0 in different calculation modes. Here, we choose the vector [1, 0.75, 0.5, 0.25, 0, , , , ] as the calibration vector. Then, we encode the calibration vector in different computational modes, keeping the weight equal to one. Next, we record the output result of BPD as standard output. After an interval, the calibration vector is sent again and the output of BPD is also recorded. Finally, we use a mean absolute percentage error (MAPE) to represent the error between the current BPD output and the standard output, whose mathematical expression is given by where represents the standard output, represents the current BPD output, and represents the inter-code interval in the standard output. The calculated MAPE value will be used as the basis for whether or not the SPPU needs to be calibrated. When the MAPE value exceeds the specified threshold, we will update the lookup table in terms of the operating point and standard output.
APPENDIX C: 4-BIT QUANTIZATION OF NEURAL NETWORKS
According to existing reports, the inference precision of optical neural networks is generally no more than 8 bits. It means that neural networks trained on computers with a 32-bit precision or even higher need to be quantized for deployment to optical hardware. An excellent quantization strategy can reduce the neural network performance degradation caused by the loss of computational precision. Currently, the commonly used quantization strategies are post-training quantization and quantization-aware training [43]. Post-training quantization is usually the first choice for quantization strategies because they are simple and efficient to implement and do not require additional training with labeled data, as the training process does not take precision into account. It directly converts a trained 32-bit (or 64-bit) network to 8-bit (or less). Naturally, most current optical architectures adopt this strategy. However, when aiming for low-bit quantization of optical neural networks, such as 4-bit and below, post-training techniques may not be enough to mitigate the large quantization error incurred by low-bit quantization. In this case, we resort to quantization-aware training to achieve a lower precision optical neural network inference and to improve the noise and error tolerance of optical hardware. Quantization-aware training simulates neural network inference quantization during the training process, in order to find more optimal solutions than post-training quantization. Based on the optimization extension package TensorFlow Model Optimization, we trained the neural network with 4-bit precision, despite the fact that the precision of MRR weights in SPPU is 5.8 bits. This indicates that the inter-code spacing for weight coding reaches 10 times the standard deviation of the error, increasing the noise tolerance of the SPPU.
[38] T. J. O’Shea, N. B. T. West. Radio machine learning dataset generation with GNU radio. GNU Radio Conference(2016).
[39] M. R. Haque, S. A. Lima, S. Z. Mishu. Performance analysis of different neural networks for sentiment analysis on IMDb movie reviews. 3rd International Conference on Electrical, Computer & Telecommunication Engineering, 161-164(2019).
[40] S. Yarkareddy, T. Sasikala, S. Santhanalakshmi. Sentiment analysis of amazon fine food reviews. 4th International Conference on Smart Systems and Inventive Technology (ICSSIT), 1242-1247(2022).
[41] K. He, X. Zhang, S. Ren. Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition, 770-778(2016).
[42] I. Goodfellow, Y. Bengio, A. Courville. Deep Learning(2016).