Silicon photonics convolution accelerator based on coherent chips with sub-1 pJ/MAC power consumption

Ying Zhu; Lu Xu; Xin Hua; Kailai Liu; Yifan Liu; Ming Luo; Jia Liu; Ziyue Dang; Ye Liu; Min Liu; Hongguang Zhang; Daigao Chen; Lei Wang; Xi Xiao; Shaohua Yu

doi:10.1364/PRJ.536939

1. INTRODUCTION

Artificial intelligence (AI) has made significant advancements in various domains such as computer vision [1,2], natural language processing [3,4], speech recognition [5], healthcare [6], and more in recent years. However, the slowing-down Moore’s law and energy implications associated with AI have become the elephant in the room [7]. Consequently, the need for high-speed and energy-efficient AI hardware and systems has become increasingly urgent. Extensive research has been conducted, including exploring emerging computing materials and devices (e.g., phase change materials (PCMs) [8], memristors [9], and ferroelectric field-effect transistor [10,11]), architectures (e.g., near-memory [12] and in-memory computing [13,14]), computing mechanisms (e.g., TrueNorth [15] and BrainScaleS [16]), and software–hardware co-design [17 –20]. Among them, optical computing hardware is a promising platform.

Optical hardware offers multiple merits over its electronic counterparts, encompassing low power consumption, high speed, large bandwidth, and massive parallelism [21,22]. It can perform linear transformations at light speed and detect data at a rate exceeding 100 GHz while consuming minimal power [23 –25]. Its inherent optical nonlinearities enable direct processing of nonlinear operations in optical neural networks (ONNs), providing diverse options for neural network (NN) algorithm design [26]. Furthermore, ONNs operating in the analog domain require less energy and time consumption without the need to shuttle data back and forth between the arithmetic logic unit (ALU) and memories, thereby circumventing the von Neumann bottleneck [27]. Furthermore, the abundant bandwidth sources in the optical domain enable simultaneous signal processing and calculations at multiple wavelengths [28,29]. Moreover, the maturing silicon photonic process, which is compatible with complementary metal oxide semiconductor (CMOS) technology, holds promise for achieving large-scale integration of optical and electronic materials and devices, enabling an ultra-high computation density [30,31].

To date, many optoelectronic and optical NNs have emerged, among which optical reservoir NNs and optical feedforward NNs are closely followed. The reservoir NNs are a kind of NN that utilizes the physical system dynamic behaviors and memory effects to process sequential and temporal data [32]. Several optical reservoir NNs have been developed [33 –36]. In contrast to reservoir NNs, optical feedforward NNs primarily accelerate NN multiply-accumulate (MAC) operations as lights propagate through the optical systems. These systems generally fall into three categories according to their working principles [22]. The first category is the cascaded Mach–Zehnder interferometer (MZI) grids. The grids’ transform matrices are programmed by tuning the phase shifters in the grids. Therefore, when data-carried lights propagate, matrix–vector multiplications are completed. An MZI-based ONN can complete arbitrary matrix–vector multiplications within 0.01 ns [37]. The second category of NNs utilizes wavelength-division multiplexing (WDM) to realize MAC operations with incoherent lights. A photonic accelerator can achieve 11 tera-operations per second (TOPS) by utilizing fiber chromatic dispersion to accumulate optical intensities at different wavelengths [38]. A PCM crossbar-based integrated photonic tensor core can also achieve TOPS speeds [39]. A silicon photonic–electronic NN can solve fiber nonlinearity compensation over a distance of 10,080 km, comparable to a software-based NN processed in a 32-bit graphic processing unit (GPU) [26]. An optoelectronic neuromorphic accelerator with only one dual-polarization IQ modulator and one balance-photodetector (BPD) can achieve over 500 giga-operations per second (GOPS) [40]. A SiPh neuromorphic accelerator (SiPh-NA) based on integrated coherent transmit–receive optical sub-assemblies (IC-TROSAs) also achieves over 1 TOPS and has the potential to obtain several peta-OPS (POPS) with optical frequency combs (OFCs) [41]. Third, several novel photonic neural networks complete the computing with spatial diffraction, which can be achieved over POPS. While early-stage works utilize unconfigurable diffraction modules during data processing [42,43], a recent work leveraging the heaters on the diffraction cores realizes in situ training and programming for diffractive photonic computing [44]. Furthermore, a novel large-scale photonic chiplet exploring a distributed diffractive-interference hybrid photonic computing architecture is reported to achieve 160 TOPS/W [45], which, to the best of our knowledge, is one of the largest-scale hybrid photonic chiplets, and the models which can process is one of the most complex that a photonic accelerator can do. However, these photonic NNs encounter many challenges in integration, scalability, flexibility, and reconfiguration. Additionally, the existing photonic devices, primarily designed for communication and interconnections, do not meet computing requirements such as quantization resolution and linearity range, which impacts the computing performance of the optical hardware. Therefore, clear indications of design and optimization for computing devices and circuits are significant.

We demonstrate a novel silicon photonics convolution accelerator (SiPh-CA) based on Mach–Zehnder modulators (MZMs) and simplified coherent receivers with a broadcasting parallelism structure to accelerate the convolution that is one of the frequently processed operations in NNs. The to-be-convolved data and convolution kernels encode the amplitudes of electronic signals with multiple orthogonal frequencies to modulate carriers split from one laser source. The modulated to-be-convolved signals are broadcast into different channels, each of which comprises signals from the data and one kernel. They propagate through simplified coherent receivers to complete MAC operations. The utilized devices are commercially available, silicon photonic integrated, and CMOS compatible. By adjusting amplitude values and orthogonal frequencies in the electronic domain, the accelerator provides flexibility for optimal computing precision, speed, and programmability for convolutions, matrix–vector multiplications, and neural networks. Experimentally evaluated on sub-integrated coherent transmit–receive optical sub-assemblies (sub-IC-TROSAs) for convolutions and handwritten digit recognition, the SiPh-CA achieves 1.024 TOPS/channel and can achieve tens of TOPS with broadcasting parallelism in principle. It also reaches accuracies of 96.22%, 95.33%, and 93.00% as a one-layer fully connected NN, a one-layer convolutional NN, or a convolution layer of a multi-layer forward NN, respectively, competitive with GPU-run NNs. Compared to the IC-TROSA-based SiPh-NA [41], the proposed sub-IC-TROSA-based SiPh-CA occupies 1/3 less footprint and consumes 37.01% less power for the same performance without the inter-channel crosstalk, intra-channel noises, and increment of complexity and difficulty in processing the ultra-broad spectrum from the WDM technology. In addition, we discuss factors limiting the accelerator’s speed, precision, and power efficiency, proposing solutions and optimization requirements for improved performance.

2. PRINCIPLE OF THE ARCHITECTURE

The inspiration for the SiPh-CA design is from the convolution theorem that convolution in one domain (e.g., frequency domain) equals point-wise multiplication in the other domain (e.g., time domain) [46]. Elementwise multiplication can be realized through coherent modulation and detection. For example, two electronic signals $S = A_{1} \cos 2 π (f_{A} + f_{0}) t$ and $L = B_{1} \cos 2 π (f_{B} + f_{0}) t$ simultaneously modulate two carriers split from one single-wavelength source via two MZMs, where $A_{1}$ and $B_{1}$ are the amplitudes of the two signals, $f_{A}$ and $f_{B}$ are the signal start frequencies, $f_{0}$ is the frequency interval, and $t$ is the time. Afterward, the optical signals go through the two input ports of a 180° hybrid connected to a BPD, respectively. The BPD outputs a current $I$ with an amplitude proportional to $A_{1} B_{1}$ at frequency $f = | f_{B} - f_{A} |$ , when another two electronic signals $A_{2} \cos 2 π (f_{A} + 2 f_{0}) t$ and $B_{2} \cos 2 π (f_{B} + 2 f_{0}) t$ modulate the two carriers accompanied with the previous electronic signals, respectively. In this scenario, at $f = | f_{B} - f_{A} - f_{0} |, | f_{B} - f_{A} |, | f_{B} - f_{A} + f_{0} |$ of the BPD output $I$ frequency spectrum, the amplitude values are proportional to $| A_{2} B_{1} |, | A_{1} B_{1} + A_{2} B_{2} |, a n d | A_{1} B_{2} |$ , respectively. Generally, electronic signals $S_{e} = \sum_{i = 1}^{N} A_{i} \cos 2 π (f_{A} + i f_{0}) t, L_{e} = \sum_{j = 1}^{M} B_{j} \cos 2 π (f_{B} + j f_{0}) t,$ (1)modulate the carriers via MZMs, obtaining optical signals as $S_{o} = \sum_{i = 1}^{N} A_{i} \cos 2 π (f_{A} + i f_{0}) t \cos 2 π f_{c} t, L_{o} = \sum_{j = 1}^{M} B_{j} \cos 2 π (f_{B} + j f_{0}) t \cos 2 π f_{c} t,$ (2)where $f_{c}$ is the carrier frequency. After the coherent detection with a 180° hybrid and a BPD, in the $I$ amplitude spectrum at $f_{k} = | f_{B} - f_{A} + k f_{0} |$ , where $k = j - i$ , the value is $A (f_{k}) = α | \sum_{i = \max (1, 1 - k)}^{\min (N, M - k)} A_{i} B_{i + k} |,$ (3)where $α$ is a proportional parameter determined by the system status (e.g., the photodetector responsivity and driver voltages). Importantly, $\forall i, j$ , $| f_{B} - f_{A} + (j - i) f_{0} |$ and $| f_{B} + f_{A} + (i + j) f_{0} |$ should be different, which can be easily satisfied by setting $f_{A}, f_{B}$ , and $f_{0}$ (e.g., $f_{B} \geq f_{A} + N f_{0}$ and $f_{A} \geq \frac{1}{2} M f_{0}$ ). Otherwise, if a frequency $f = | f_{B} - f_{A} + (j_{1} - i_{1}) f_{0} | = | f_{B} - f_{A} + (j_{2} - i_{2}) f_{0} | = | f_{A} + f_{B} + (i_{3} + j_{3}) f_{0} |$ exists, the amplitude at this frequency is $α | \sum_{i} A_{i} B_{i + j_{1} - i_{1}} + \sum_{i} A_{i} B_{i + j_{2} - i_{2}} + \sum_{i} A_{i} B_{(j_{3} - i_{3}) - i + 2 f_{A} / f_{0}} |$ , which for convolutions is meaningless.

Within the process mentioned above, $2 M N$ operations have been completed. When the symbol rate is $R$ , the computing speed is $P = 2 R \cdot M N \cdot OPS,$ (4)where $R$ ’s maximum value is $R_{\max} = f_{0}$ to guarantee that all orthogonal frequency modulation signals in every symbol encompass integer periods. Note that the amplitude extraction should be processed with analog implementation. Therefore, the computing speed can remain.

Furthermore, since one image is usually convolved by multiple kernels in convolution NNs (CNNs), to reduce the processing time for one image, a broadcasting parallelism structure is implemented to realize parallel computing. The image modulation signal is equally divided through power splitters to convolve with kernel signals in every channel as in Fig. 1. When there exist $K$ channels, the parallel computing speed can achieve $P_{parallel} = 2 R \cdot M N K \cdot OPS .$ (5)More importantly, in the broadcasting parallelism structure, power consumption is $K + 1$ times the modulation power, while it is $2 K$ times the modulation power if the image is modulated and convolved with $K$ kernels individually. Therefore, the power consumption can decrease by $(K + 1) / 2 K$ .

3. EXPERIMENT AND RESULTS

A complete-structure IC-TROSA consists of a silicon photonic integrated circuit (PIC), a driver chip, and two trans-impedance amplifier (TIA) chips [47]. Only some parts of the complete-structure IC-TROSA, i.e., the sub-IC-TROSA, are necessary to construct the SiPh-CA. They are two MZMs for image and kernel data modulation and a one-channel coherent detection structure, as presented in Fig. 2(a). In the chip, two input ports and one output port exist. Input port 1, ${in}_{1}$ , can be used either as the port for the source laser to the integrated coherent transmitter (ICT) part or for the local light receive port to the integrated coherent receiver (ICR); input port 2, ${in}_{2}$ , is for the signal receiving. The output port ${out}_{1}$ is for the two modulated signals from the ICT. In the ICT part, the source laser is first equally distributed to two MZMs (four MZMs actually in the IC-TROSA, but two of them are switched off for the sub-IC-TROSA), via which the data are modulated into the coherent carriers. The two signals are then polarization multiplexed through a polarization combining device, i.e., the Pol.MUX, and output through the output port ${out}_{1}$ . In the ICR, the signal and local lights are incident into a 90° hybrid after a polarization demultiplexer (Pol.deMUX) and a splitter, respectively, and detected by a BPD.

The PIC was fabricated on an 8-in. silicon-on-insulator (SOI) wafer using a 180 nm lithography process. The footprint of the sub-IC-TROSA PIC can be as small as $2.9 mm \times 6.9 mm$ when all significant device blocks are integrated into one chip with proper layout and routing since the complete-structure IC-TROSA PIC occupies $5.8 mm \times 6.9 mm$ , which includes four MZMs, four-channel coherent detection blocks, and other blocks, as shown in Fig. 2(b). The component utilizes a ball grid array (BGA) packaging form, allowing it to integrate with microprocessors using microelectronic-compatible packaging techniques. The complete-structure IC-TROSA is as shown in Fig. 2(c); thereby the sub-IC-TROSA requires 1/2 fewer devices, providing the advantages of low power consumption and a small footprint. The 3 dB EO and OE bandwidths of the MZM and PDs are around 45 GHz with an approximately flat frequency response range from 0 to 40 GHz [41], which supports the SiPh-CA to process signals up to 40 GHz. A higher computing speed is possible since transmitters and receivers with higher bandwidths have been realized [48,49].

Based on an ICT and an ICR from two individual sub-IC-TROSAs, we built a SiPh-CA prototype with one computing channel, as shown in Fig. 3 (see Appendix A for system constructions). An arbitrary waveform generator (AWG) generates the waveforms of $S$ and $L$ , as in Eq. (1). These waveforms then modulate a laser light via two individual MZMs with drivers in the ICT from the sub-IC-TROSA 1. The modulated $S$ and $L$ signals as in Eq. (2) are combined by the Pol.MUX and output in $X$ and $Y$ polarizations simultaneously. The two signals are amplified by an erbium-doped fiber amplifier (EDFA) to compensate for the coupling power loss and insertion loss from the Pol.MUX and Pol.deMUX, which can be removed in future work. Afterward, $S$ and $L$ are separated by a polarization beam splitter (PBS) and eventually received by the input ports of the ICR with TIA in the sub-IC-TROSA 2. Note that while the ICR uses a 90° hybrid, a 180° hybrid is enough to support the function, whose structure and footprint are smaller than the former one. The TIA output current, which includes the waveforms with corresponding amplitudes and frequencies, is sampled by a digital storage oscilloscope (DSO). To identify the starting point of meaningful data from the sampled currents, two preamble patterns are added before the modulation signals $S_{e}$ and $L_{e}$ , respectively. The first preamble pattern is a positive high-voltage signal occupying the symbol period. In contrast, the second preamble pattern is a signal with a positive high-voltage level for the first half of the symbol period and a negative high-voltage level for the second half. In the post-processing stage, we analyze the output samples and employ a counter: when the sampled data exhibits a pattern of high voltages sustained for half the symbol period followed by low voltages for the remaining half, it signifies the start of the meaningful data.

Additionally, the prototype system uses several fibers to connect the chips and devices. Their limited length resolutions result in the $S$ and $L$ path length differences and introduce phase deviations into Eq. (2). The concern also applies to the fully integrated chips. To solve the problem, we use the path-difference calibration method as in Ref. [41]. By setting up a digital twin system in software and minimizing the output difference between the experimental hardware system and the software-based digital twin system, we obtain the path difference in the prototype as 0.0163 m and pre-compensate it in the modulation waveforms. The calibration is only required once, when the system is established and the path difference is unchanged. Furthermore, the EDFA, PC, and PBS between the ICT and ICR can be removed when the necessary devices are connected with waveguides and integrated into one chip, as shown in Fig. 1. In future work, we will design analog circuits rather than the AWG to generate mixed orthogonal frequency signals with data vectors as input. A photonic-assisted channelization scheme based on coherent dual optical frequency combs will substitute the DSO to directly extract the amplitudes at corresponding frequencies in the analog domain without reducing the data throughput and the computing speed [50 –52], which is completed by the fast Fourier transform in our prototype.

Figure 1.SiPh-CA conceptual diagram.

Download full size

View all figures

Figure 2.SiPh IC-TROSA. (a) The sub-silicon photonic coherent transceiver structure, in which only the devices labeled with dark colors function. (b) The microscope image of a silicon photonic coherent transceiver chip. (c) The highly compact SiPh IC-TROSA integrates a silicon coherent transceiver chip, a driver chip, and two TIA chips inside.

Download full size

View all figures

Figure 3.Experimental diagram. AWG, arbitrary waveform generator; ICT, integration coherent transmitter; EDFA, erbium-doped fiber amplifier; PC, polarization controller; PBS, polarization beam splitter; ICR, integration coherent receiver; TIA, trans-impedance amplifier; DSO, digital storage oscilloscope; $τ$ , symbol time width. In them, the ICT and ICR are the photonic integrated chips, which are co-packaged with drivers and TIAs in the printed circuit boards (green solid boxes). After removing the polarization multiplexer from the ICT, the function of the components in the gray dashed box can be realized in only one photonic integrated chip without the EDFA, PC, and PBS.

Download full size

View all figures

To showcase the SiPh-CA functions, we conducted two experiments. The first experiment utilizes the SiPh-CA to perform convolutions between different length vectors. The second experiment employs it to process four NNs for image recognition.

A. SiPh-CA for Convolutions

The prototype parameters are as follows: the start frequencies of $A$ and $B$ are $f_{A} = 5 GHz$ and $f_{B} = 18 GHz$ , respectively, and the occupied bandwidths $N f_{0} \leq 4 GHz, M f_{0} \leq 4 GHz$ to avoid overlapping among the information frequency spectra due to the carrier-suppressed double sideband modulation (CS-DSB) and the incomplete compressed carrier. In addition, they should not surpass the available analog bandwidths of the AWG, modulators, and BPDs. The frequency interval and the symbol rate are set to their maximum values, respectively, i.e., $f_{0} = 4 / n GHz$ and $R = f_{0}$ , where $N = M = n$ without loss of generality, to achieve a high computing speed. The convolution speed is $2 f_{0} n^{2} = 2 \cdot (4 / n) n^{2} = 8 n$ , which illustrates that the larger the vector dimension $n$ is, the larger the convolution speed can be. However, as the dimension $n$ increases, the optical signal-to-noise ratios (OSNRs) for each data drop, since the total signal power from all data is smaller than the maximum optical input power of the ICT and ICR. The deteriorated OSNRs reduce the SiPh-CA computing precision. Therefore, to demonstrate the performance upper limits of the SiPh-CA for convolutions, we test convolutions between $1_{n}$ and $1_{n}$ and convolutions between $R_{n}$ and $R_{n}$ , where $1_{n} = [1, 1, \dots,1] \in R^{n}$ is an $n$ -dimensional all 1 vector and $R_{n} \in R^{n}$ is an $n$ -dimensional randomly generated nonnegative real-valued vector under different $n$ .

For the convolution $1_{n} \otimes 1_{n}$ , whose theoretical outputs should be $[1, 2, \dots, n, \dots,2,1]$ , the similarities between the experimental results and the theoretical outputs are assessed by the correlation coefficients as shown in Table 1 (Corrcoef. $1_{n}$ ). They are also intuitively presented in Fig. 4. As observed, the SiPh-CA can provide a convolution speed of 640 GOPS if the Corrcoef. lower limit is set to 0.9700 as in our previous work SiPh-NA [41] (1.024 TOPS at $Corrcoef. = 0.9700$ ). The reason why the SiPh-CA cannot achieve the same correlation coefficients with that high speed is that the optical signals in the SiPh-CA are amplitude-modulated via one MZM, while the signals in the SiPh-NA are IQ-modulated via two MZMs with a higher OSNR. Nonetheless, the SiPh-CA processing the $n = 128$ scenario where the convolution speed is 1.024 TOPS still outputs a similar spectrum to the theoretical spectrum.Table 1.

Performance of the SiPh-CA for Convolutions

$n$	$f_{0}$ (GHz)	Corrcoef. ( $1_{n}$ )	Avg. Corrcoef. ( $R_{n}$ )	Avg. std.E ( $R_{n}$ )	Eff.B (bit)	Speed (GOPS)
4	1	0.9989	0.9856	0.0487	4.36	32
8	0.5	0.9980	0.9858	0.0520	4.27	64
16	0.25	0.9930	0.9774	0.0613	4.03	128
20	0.2	0.9928	0.9775	0.0612	4.03	160
32	0.125	0.9871	0.9700	0.0691	3.85	256
40	0.1	0.9816	0.9660	0.0732	3.77	320
64	0.0625	0.9784	0.9497	0.0887	3.50	512
80	0.05	0.9738	0.9373	0.0997	3.33	640
100	0.04	0.9698	0.9275	0.1070	3.22	800
128	0.03125	0.9587	0.9095	0.1181	3.08	1024

Figure 4.Output normalized amplitude spectrum of the SiPh-CA processing $1_{n} \otimes 1_{n}$ .

Download full size

View all figures

In the tests for $R_{n} \otimes R_{n}$ , 100 trials are conducted for each value of $n$ . The SiPh-CA performance is presented in Table 1 as averaged correlation coefficients [Avg. Corrcoef.( $R_{n}$ )], averaged error standard deviations [Avg. std.E( $R_{n}$ )], and effective bit precision (Eff.B) [53]. In Fig. 5, the blue dots present the experimental outputs, while the ideal output should be located in the gray line $y = x$ . Although they occupy a gradually expanding range with the vector dimension $n$ increasing, the blue dots are still around the gray line with an effective bit precision 3 bit even in the $n = 128, P = 1.024 TOPS$ case. The experimental results confirm that the SiPh-CA can complete nonnegative real-valued vector convolutions.

Figure 5.One trial of the measured results of the SiPh-CA processing $R_{n} \otimes R_{n}$ .

Download full size

View all figures

In addition, according to the approach working principle as Eqs. (2) and (3), the SiPh-CA can principally process convolution between two real-valued vectors and complex-valued vectors when considering the output phases. The SiPh-CA has performed convolutions between two real-valued vectors, including those with negative values, even though the output results are absolute-value taken due to the amplitude extraction, which will be demonstrated in the following NN experiments. In future work, we will use the SiPh-CA to realize real-valued and complex-valued convolutions with output phases included.

B. SiPh-CA for NNs

To demonstrate the effectiveness of the SiPh-CA for neural networks, we test four NNs to process handwritten digits with $8 \times 8$ pixels from the sklearn library [54]. They are two one-layer fully connected NNs, ${FC}_{10 \times 32}$ and ${FC}_{10 \times 64}$ ; a one-layer CNN, ${Conv}_{1 \times 424}$ ; and a forward CNN with a convolution layer and a fully connected layer, ${CF}_{3,16}$ . Although no general convolutions exist in fully connected NNs, the matrix–vector multiplications are a particular convolution whose stride is the matrix column dimension.

Since the SiPh-CA processes one-dimension data, the $8 \times 8$ images are flattened into $1 \times 64$ image vectors, while the $8 \times 8$ pixel images are downsampled to $4 \times 8$ and then flattened into $1 \times 32$ ones in the ${FC}_{10 \times 32}$ . The weight matrix sizes are (10, 32) and (10, 64) for ${FC}_{10 \times 32}$ and ${FC}_{10 \times 64}$ , respectively. ${Conv}_{1 \times 424}$ is an uncommon CNN where the image vector convolves with a (1, 424) kernel vector whose size is greater than the image vector size and the stride is 40. ${CF}_{3,16}$ is a general CNN where the image vectors are first convolved with 16 kernels whose size is [1, 3] and then processed with a max-pooling function with a stride equaling 2 and a fully connected layer with a (512, 10) weight matrix. During the experiment, the progress includes two steps: the offline training and the online test. The dataset is separated with a partition ratio of 3:1 for training and testing. In the offline training, the cost function is the cross-entropy, and the stochastic gradient descent (SGD) algorithm is for optimization, completed with PyTorch [55] in the digital computer. Afterward, the trained weight matrices/kernels and the test images are added to the SiPh-CA for online testing, as shown in Fig. 6(a). Due to the limited availability of silicon photonic chips, the convolutions between the image vector and 16 kernels are currently processed sequentially, as shown in Fig. 6(a). However, in future work, we plan to integrate multiple modulators onto a single chip to enable parallel computing, as depicted in Fig. 1.

Figure 6.Image recognition demonstration. (a) Demonstration for the SiPh-CA processing different NNs. (b) Inference accuracy comparison among the electronic digital computer, SiPh-NA, and SiPh-CA. (c) Confusion matrices for the NNs processed in the SiPh-CA.

Download full size

View all figures

To avoid overlapping among information spectra and to improve the one-image processing efficiency, the frequency settings are $f_{A} = 16.00 GHz$ for matrices and $f_{B} = 7.00 GHz$ for image data in ${FC}_{10 \times 64}$ and ${FC}_{10 \times 32}$ , $f_{A} = 12.00 GHz$ and $f_{B} = 5.00 GHz$ in ${Conv}_{1 \times 424}$ and ${CF}_{3,16}$ . The frequency intervals for these four NNs are 0.01 GHz, 0.02 GHz, 0.01 GHz, and 0.0625 GHz, respectively. The corresponding output positions in the frequency spectra are from 14.76 to 9.00 GHz with an interval of 0.64 GHz for the two fully connected NNs, and from 10.60 to 7.00 GHz with one of 0.40 GHz for ${Conv}_{1 \times 424}$ , as well as from 6.875 to 1.0938 GHz with one of 0.0625 GHz for ${CF}_{3,16}$ . The SiPh-CA computing results indicate that the SiPh-CA can provide competitive NN inference accuracies for the electronic digital computer, as shown in Figs. 6(b) and 6(c).

It should be noted that only the amplitudes at the corresponding frequencies of the output are extracted without phases considered. Therefore, while the convolution results cannot represent negative values, the amplitude extraction, i.e., taking the absolute values, is used as the nonlinear operation implemented in our experiments and the implemented NNs. In addition, in our former three tested NNs, no additional nonlinear operation except for the taking-the-absolute-values is processed in the inference phase since the softmax function included in the cross-entropy during training does not change the maximum value position in the multi-category classification tasks. However, in general scenarios, common-used nonlinear functions, e.g., ReLU, cannot be realized in the SiPh-CA physical structure, requiring extra digital circuits. Nevertheless, the SiPh MZMs can add trigonometric nonlinear operations to the operands, improving the NN expression capability.

4. DISCUSSION

We compare the SiPh-CA performance and cost with the SiPh-NA [41], as shown in Table 2. To achieve the same convolution speed within one computing channel, the SiPh-CA utilizes two of four MZMs, which occupy the most footprint on the PIC and enable half the number of drivers that consume much power. Therefore, over 1/3 to 1/2 PIC footprint and 37.01% (from 1.494 [41] to 0.941 pJ/MAC) power consumption can be saved, as shown in Fig. 2 and analyzed in Appendix B. With advanced technologies, the total energy efficiency has the potential to reach 4.581 pJ/MAC, reduced by 16.48% from 5.485 pJ/MAC of the SiPh-NA. However, the SiPh-CA computing precision is outperformed by the SiPh-NA [41] due to a smaller OSNR with a lower signal power and a higher noise. It is because the splitters before and behind the MZMs in the $I$ and $Q$ paths constrain the maximum value of the signal power, and the incomplete compressed carrier and noise exist in the grayed paths, although the grayed MZMs in Fig. 2(a) do not function in the ICT theoretically. Therefore, removing the unnecessary devices can improve the OSNR and further the SiPh-CA computing precision to surpass that of the SiPh-NA, which has been proven by the simulations in Appendix C and will be verified in future work. For the NN computing performance, although the effective bit precision is smaller, the inference accuracies for ${FC}_{10 \times 64}$ , ${FC}_{10 \times 32}$ , and ${Conv}_{1 \times 424}$ obtained from the SiPh-CA are better, which benefits from the NN robustness. Instead of using OFCs and WDM technology to scale the architecture, we developed the power-splitter-based broadcasting scheme for the SiPh-CA to enhance the parallelism. In this scheme, the image vector requests only one-time modulation, saving power consumption as mentioned in Section 2, and power splitters do not bring in the inter-channel crosstalk and intra-channel signal power variations from the WDM multiplexer filters. Moreover, the output spectrum processing difficulties remain manageable as the parallelism grows since the input data spectra do not expand by OFC combination. Furthermore, while the SiPh-CA can only process one-dimensional vectors with limited length, e.g., $n = 128$ in our presented experiments, we demonstrate a scalability method for large-size tensors of high order in Appendix D. Furthermore, compared to other optical computing frameworks and NVIDIA GPU A100 [56] whose performances have been summarized and compared with the SiPh-NA in Ref. [41], our SiPh-CA can also achieve a competitive computing speed, energy efficiency, and NN inference accuracy.Table 2.

Cost and Performance Comparison of the SiPh-NA [41] and the SiPh-CA

	SiPh-NA	SiPh-CA
Devices in one cell	4 MZMs, 4 drivers,	2 MZMs, 2 drivers,
	1 180° hybrid,	1 180° hybrid,
	1 BPD, 1 TIA,	1 BPD, 1 TIA
	2 PS, 4 splitters
Power consumption with advanced technologies	5.616 W	4.690 W
Bit precision	3.71	3.08
Convolution speed	1.024 TOPS	1.024 TOPS
max. MVM matrix	$10 \times 64$	$10 \times 64$
max. NN	${FC}_{10,64}$	${FC}_{10,64}$
Inference accuracy	95.78%	96.22%
Scalability	$\sqrt{N}$	$N$

Although the SiPh-CA can accelerate convolutions and MVMs for NNs, the computing precision is unsatisfying and deteriorates when the input vector size and the computing speed increase. It results from the limited extinction ratio and linear modulation range of the SiPh-CA. The limited extinction ratio means that the noise from the incomplete compressed carrier exists, and the limited linear modulation range constrains the total signal power. Both impact the OSNR as the input vector size increases. Therefore, to improve the SiPh-CA performance, modulators with a higher extinction ratio and a broader linear modulation range should be optimized at the device and material levels. In addition, during the experiments, the computing precision increases by averaging the output signals across multiple repetition periods. However, the symbol rate $R$ drops, and the computing speed does. To maintain the computing speed and to improve the computing precision, we could implement the same vectors into and average the outputs from multiple computing channels at the architecture level. Furthermore, the computing precision can be further improved when the impacts from the nonflat EO and OE frequency responses of the IC-TROSA are reduced by testing and pre-compensating the modulation amplitudes and tuning the output amplitudes at corresponding frequencies. The process is similar to the frequency equalization in the transmission system [57] and should be realized in the hardware implementation instead of software-based calibration.

5. CONCLUSION

In this work, a novel scalable, integrated, and flexible SiPh-CA is proposed and realized by the sub-IC-TROSAs. The one-channel SiPh-CA prototype reaches a convolution speed of 1.024 TOPS, and the implemented NN inference accuracies are comparable to those of the NN processed by the digital computer. The coherent PIC costs sub-1 pJ/MAC power consumption. The energy efficiency of the SiPh-CA can achieve lower than 5 pJ/MAC when it is fabricated with advanced CMOS and silicon photonic technology. These merits are competitive with those of the latest optical computing frameworks and superior to our previous SiPh-NA architecture. Furthermore, although the scalability is inferior to that of the SiPh-NA, by leveraging the splitter-based broadcasting scheme, the SiPh-CA can be linearly scaled to parallel process convolutions between an image and multiple kernels, maintaining manageable complexity for optoelectronic systems and avoiding the nonideal elements associated with OFCs and WDM technologies. In future work, by optimizing the device and circuit design and accompanying software/hardware codesign, we believe the SiPh-CA can achieve a faster computing speed, higher computing precision, and more effective energy consumption.

Category: Silicon Photonics

Received: Jul. 18, 2024

Accepted: Dec. 2, 2024

Published Online: Feb. 10, 2025

The Author Email: Xi Xiao (xiaoxi@noeic.com)

DOI:10.1364/PRJ.536939

CSTR:32188.14.PRJ.536939