Optical tensor core architecture for neural network training based on dual-layer waveguide topology and homodyne detection

Shaofu Xu; Weiwen Zou

doi:10.3788/COL202119.082501

1. Introduction

Deep learning becomes a milestone strategy of modern machine learning [1], performing with superior ability in many areas and applications [2 - 6] . One of the major driving forces of deep learning is the surge of computational power. Among the procedures of deep learning, neural network training consumes the most time and energy. This is because an inference only takes one forward propagation to complete. However, a complete training takes thousands of rounds of forward and backward propagations. For now, the computational power demanded by neural network training doubles every 3.4 months [7] due to dramatic neural network complexity expansion. Traditional digital processors are thereby faced with bottlenecks caused by neural network development.

Recently, optical neural networks (ONNs) were proposed as an alternative to break through electronic problems, such as the clock rate limit and energy dissipation of data movement[8]. By mapping the mathematical model of the neural network onto analog optical devices, ONNs obtain results on the fly of light with potential ultra-low energy consumption[9]. Various ONN architectures are proposed and demonstrated based on unitary optics[10,11], wavelength division multiplexing[12], free-space modulators[13], diffractive optics[14], free-space homodyne detection[15], etc. Especially, the free-space homodyne ONN[15] carries out dot-products by homodyne detection and electron accumulation (HDEA), enabling the matrix-matrix multiplications. Note that the matrix-matrix multiplications are the essential computing process of neural network training. Therefore, the free-space homodyne architecture can conduct neural network inference and training on the same hardware. However, free-space implementation is bulky and instable.

Here, we propose an optical tensor core (OTC) architecture that can be integrated into photonic chips for neural network training. In this architecture, the matrix-matrix multiplication is conducted by the dot-product units (DPUs) meshed on a two-dimensional (2D) plane. The principle of the DPUs is based on the HDEA process, i.e., multiplications are fulfilled by homodyne detection, and the summation is completed by electron accumulation. Here, the components of DPUs are optical waveguide devices so that the DPU array can be integrated. Besides, the input data are fed into the DPU array through dual-layer waveguides. Provided that the waveguide crossings are inevitable if the date-feeding waveguides and DPU array are deployed on a single 2D plane, the dual-layer waveguide topology of the data-feeding waveguides can mitigate the insertion loss and crosstalk of such crossings. The sub-millidecibel (mdB) insertion loss per crossing [16] guarantees a large-scale OTC. The proposed OTC succeeds the strengths of free-space homodyne ONN (high speed, high reconfigurability, and large scale) and potentially features aberration immunity and compactness by removing the third space dimension and lens structure of free-space architecture.

2. Principle

M \times S

Figure 1.(a) Schematic of the OTC. An example scale of 3 × 3 is depicted. ILC, inter-layer coupler; Mod array, modulator array. (b) Detailed schematic of a DPU. A portion of light is dropped from the bus waveguides. PS, phase shifter; DC, directional coupler. (c) Impulse response of the HDEA. Time constant τ of the circuit is defined as voltage decays to 1/e. (d) An example of electron accumulation. Optical pulses arrive at the HDEA with interval of 1/f_m. The accumulation time is T.

Download full size

View all figures

A_{i, k}

S

3. Results

3 \times 3

Figure 2.(a) FC network. The matrix multiplications are implemented on OTC. ReLU after each layer is conducted in auxiliary electronics. The output is the one-hot classification vector given by the softmax function. (b) The convolutional network. Convolutions are conducted on OTC. Max pooling layers shrink the image size by half. All layers are ReLU-activated except for the pooling layers and the last layer.

Download full size

View all figures

T

Figure 3 shows the training procedure of the FC network and the convolutional network. As shown in Fig. 3(a), the loss function of the FC network drops with the growth of training epochs. For reference, we draw the loss function of the standard MBGD algorithm conducted by the 64 bit digital computer. The OTC-trained loss drops along with the standard MBGD algorithm, converging to a very small value. The corresponding prediction accuracy of the FC network is illustrated in Fig. 3(b) . The training accuracy is calculated via 10,000 randomly picked inferences in the training set of MNIST, and the testing accuracy is calculated via 10,000 inferences in the test set. The initial parameters of the OTC training and standard training are the same. It can be found that the accuracies of the OTC training increase alongside with the standard MBGD algorithm. Finally, the training accuracy of the OTC reaches 100%, and the testing accuracy is around 98%, thus verifying the effectiveness of the OTC on the FC network training. Figure 3(c) shows the loss function of the convolutional network during training: the OTC-trained loss function almost overlaps with the standard-trained reference. From the prediction accuracy results in Fig. 3(d), we also observe that the training of the convolutional network on the OTC is effective. The training accuracy is around 99%, and the testing accuracy is around 98%. The results above validate the feasibility of OTC training on both the FC network and the convolutional network.

Figure 3.(a) Loss functions of the FC network during training. Results of the standard MBGD algorithm (Std. train) and the on-OTC training are illustrated. (b) The prediction accuracy of the FC network during training. The training accuracy and the testing accuracy of the standard MBGD algorithm are depicted without marks. The on-OTC training is depicted with marks. (c) Loss functions of the convolutional network during training. (d) The prediction accuracy of the convolutional network during training.

Download full size

View all figures

10 \times 86

Figure 4.Parameter visualization of the trained neural networks. (a) Trained parameters of the fourth layer in the FC network model. The standard-trained parameters are provided for reference, and the normalized deviation is depicted. (b) and (c) Distributions of trained parameters and deviations of the second and third layers of the FC network. The counts are normalized by the maximal counts. (d) Trained kernels of the first convolutional layer in the convolutional network. (e) and (f) Distributions of trained parameters and deviations of the first and second FC layers of the convolutional network. (b), (c), (e), and (f) share the same figure legends.

Download full size

View all figures

4. Conclusion

Category: Optoelectronics

Received: Sep. 29, 2020

Accepted: Dec. 21, 2020

Published Online: Apr. 20, 2021

The Author Email: Weiwen Zou (wzou@sjtu.edu.cn)

DOI:10.3788/COL202119.082501