Photonics Research, Volume. 13, Issue 2, 497(2025)

Silicon photonics convolution accelerator based on coherent chips with sub-1 pJ/MAC power consumption

Ying Zhu1, Lu Xu1, Xin Hua1, Kailai Liu2, Yifan Liu1, Ming Luo2, Jia Liu1, Ziyue Dang1, Ye Liu1, Min Liu1, Hongguang Zhang1, Daigao Chen1, Lei Wang3, Xi Xiao1,3、*, and Shaohua Yu3
Author Affiliations
  • 1National Information Optoelectronic Innovation Center, China Information and Communication Technologies Group Corporation (CICT), Wuhan 430074, China
  • 2State Key Laboratory of Optical Communication Technologies and Networks, China Information and Communication Technologies Group Corporation (CICT), Wuhan 430074, China
  • 3Peng Cheng Laboratory, Shenzhen 518055, China
  • show less

    Artificial intelligence (AI), owing to its substantial computing demands, necessitates computing hardware that offers both high speed and high power efficiency. A silicon photonic integrated circuit shows promise as a hardware solution due to its attributes, including high power efficiency, low latency, large bandwidth, and complementary metal–oxide–semiconductor (CMOS) compatibility. Here, we propose a silicon photonic convolution accelerator (SiPh-CA) and experimentally realize a prototype with sub-integrated coherent transmit–receive optical sub-assemblies (sub-IC-TROSAs). The prototype, compared to a previous IC-TROSA-based convolution accelerator, achieves almost the same performances of 1.024 TOPS/channel and 96.22% inference accuracy when it processes neural networks for image recognition, using half the numbers of the modulators and the drivers with which over 1/3 chip footprint and 37.01% power consumption are reduced. By incorporating a broadcasting scheme based on splitters and combiners, the approach can efficiently process multiple convolutions in parallel, achieving several tera operations per second. This scalability feature allows the SiPh-CA to process complex AI and high-performance computing tasks.

    1. INTRODUCTION

    Artificial intelligence (AI) has made significant advancements in various domains such as computer vision [1,2], natural language processing [3,4], speech recognition [5], healthcare [6], and more in recent years. However, the slowing-down Moore’s law and energy implications associated with AI have become the elephant in the room [7]. Consequently, the need for high-speed and energy-efficient AI hardware and systems has become increasingly urgent. Extensive research has been conducted, including exploring emerging computing materials and devices (e.g., phase change materials (PCMs) [8], memristors [9], and ferroelectric field-effect transistor [10,11]), architectures (e.g., near-memory [12] and in-memory computing [13,14]), computing mechanisms (e.g., TrueNorth [15] and BrainScaleS [16]), and software–hardware co-design [1720]. Among them, optical computing hardware is a promising platform.

    Optical hardware offers multiple merits over its electronic counterparts, encompassing low power consumption, high speed, large bandwidth, and massive parallelism [21,22]. It can perform linear transformations at light speed and detect data at a rate exceeding 100 GHz while consuming minimal power [2325]. Its inherent optical nonlinearities enable direct processing of nonlinear operations in optical neural networks (ONNs), providing diverse options for neural network (NN) algorithm design [26]. Furthermore, ONNs operating in the analog domain require less energy and time consumption without the need to shuttle data back and forth between the arithmetic logic unit (ALU) and memories, thereby circumventing the von Neumann bottleneck [27]. Furthermore, the abundant bandwidth sources in the optical domain enable simultaneous signal processing and calculations at multiple wavelengths [28,29]. Moreover, the maturing silicon photonic process, which is compatible with complementary metal oxide semiconductor (CMOS) technology, holds promise for achieving large-scale integration of optical and electronic materials and devices, enabling an ultra-high computation density [30,31].

    To date, many optoelectronic and optical NNs have emerged, among which optical reservoir NNs and optical feedforward NNs are closely followed. The reservoir NNs are a kind of NN that utilizes the physical system dynamic behaviors and memory effects to process sequential and temporal data [32]. Several optical reservoir NNs have been developed [3336]. In contrast to reservoir NNs, optical feedforward NNs primarily accelerate NN multiply-accumulate (MAC) operations as lights propagate through the optical systems. These systems generally fall into three categories according to their working principles [22]. The first category is the cascaded Mach–Zehnder interferometer (MZI) grids. The grids’ transform matrices are programmed by tuning the phase shifters in the grids. Therefore, when data-carried lights propagate, matrix–vector multiplications are completed. An MZI-based ONN can complete arbitrary matrix–vector multiplications within 0.01 ns [37]. The second category of NNs utilizes wavelength-division multiplexing (WDM) to realize MAC operations with incoherent lights. A photonic accelerator can achieve 11 tera-operations per second (TOPS) by utilizing fiber chromatic dispersion to accumulate optical intensities at different wavelengths [38]. A PCM crossbar-based integrated photonic tensor core can also achieve TOPS speeds [39]. A silicon photonic–electronic NN can solve fiber nonlinearity compensation over a distance of 10,080 km, comparable to a software-based NN processed in a 32-bit graphic processing unit (GPU) [26]. An optoelectronic neuromorphic accelerator with only one dual-polarization IQ modulator and one balance-photodetector (BPD) can achieve over 500 giga-operations per second (GOPS) [40]. A SiPh neuromorphic accelerator (SiPh-NA) based on integrated coherent transmit–receive optical sub-assemblies (IC-TROSAs) also achieves over 1 TOPS and has the potential to obtain several peta-OPS (POPS) with optical frequency combs (OFCs) [41]. Third, several novel photonic neural networks complete the computing with spatial diffraction, which can be achieved over POPS. While early-stage works utilize unconfigurable diffraction modules during data processing [42,43], a recent work leveraging the heaters on the diffraction cores realizes in situ training and programming for diffractive photonic computing [44]. Furthermore, a novel large-scale photonic chiplet exploring a distributed diffractive-interference hybrid photonic computing architecture is reported to achieve 160 TOPS/W [45], which, to the best of our knowledge, is one of the largest-scale hybrid photonic chiplets, and the models which can process is one of the most complex that a photonic accelerator can do. However, these photonic NNs encounter many challenges in integration, scalability, flexibility, and reconfiguration. Additionally, the existing photonic devices, primarily designed for communication and interconnections, do not meet computing requirements such as quantization resolution and linearity range, which impacts the computing performance of the optical hardware. Therefore, clear indications of design and optimization for computing devices and circuits are significant.

    We demonstrate a novel silicon photonics convolution accelerator (SiPh-CA) based on Mach–Zehnder modulators (MZMs) and simplified coherent receivers with a broadcasting parallelism structure to accelerate the convolution that is one of the frequently processed operations in NNs. The to-be-convolved data and convolution kernels encode the amplitudes of electronic signals with multiple orthogonal frequencies to modulate carriers split from one laser source. The modulated to-be-convolved signals are broadcast into different channels, each of which comprises signals from the data and one kernel. They propagate through simplified coherent receivers to complete MAC operations. The utilized devices are commercially available, silicon photonic integrated, and CMOS compatible. By adjusting amplitude values and orthogonal frequencies in the electronic domain, the accelerator provides flexibility for optimal computing precision, speed, and programmability for convolutions, matrix–vector multiplications, and neural networks. Experimentally evaluated on sub-integrated coherent transmit–receive optical sub-assemblies (sub-IC-TROSAs) for convolutions and handwritten digit recognition, the SiPh-CA achieves 1.024 TOPS/channel and can achieve tens of TOPS with broadcasting parallelism in principle. It also reaches accuracies of 96.22%, 95.33%, and 93.00% as a one-layer fully connected NN, a one-layer convolutional NN, or a convolution layer of a multi-layer forward NN, respectively, competitive with GPU-run NNs. Compared to the IC-TROSA-based SiPh-NA [41], the proposed sub-IC-TROSA-based SiPh-CA occupies 1/3 less footprint and consumes 37.01% less power for the same performance without the inter-channel crosstalk, intra-channel noises, and increment of complexity and difficulty in processing the ultra-broad spectrum from the WDM technology. In addition, we discuss factors limiting the accelerator’s speed, precision, and power efficiency, proposing solutions and optimization requirements for improved performance.

    2. PRINCIPLE OF THE ARCHITECTURE

    The inspiration for the SiPh-CA design is from the convolution theorem that convolution in one domain (e.g., frequency domain) equals point-wise multiplication in the other domain (e.g., time domain) [46]. Elementwise multiplication can be realized through coherent modulation and detection. For example, two electronic signals S=A1cos2π(fA+f0)t and L=B1cos2π(fB+f0)t simultaneously modulate two carriers split from one single-wavelength source via two MZMs, where A1 and B1 are the amplitudes of the two signals, fA and fB are the signal start frequencies, f0 is the frequency interval, and t is the time. Afterward, the optical signals go through the two input ports of a 180° hybrid connected to a BPD, respectively. The BPD outputs a current I with an amplitude proportional to A1B1 at frequency f=|fBfA|, when another two electronic signals A2cos2π(fA+2f0)t and B2cos2π(fB+2f0)t modulate the two carriers accompanied with the previous electronic signals, respectively. In this scenario, at f=|fBfAf0|,|fBfA|,|fBfA+f0| of the BPD output I frequency spectrum, the amplitude values are proportional to |A2B1|,|A1B1+A2B2|,and|A1B2|, respectively. Generally, electronic signals Se=i=1NAicos2π(fA+if0)t,Le=j=1MBjcos2π(fB+jf0)t,modulate the carriers via MZMs, obtaining optical signals as So=i=1NAicos2π(fA+if0)tcos2πfct,Lo=j=1MBjcos2π(fB+jf0)tcos2πfct,where fc is the carrier frequency. After the coherent detection with a 180° hybrid and a BPD, in the I amplitude spectrum at fk=|fBfA+kf0|, where k=ji, the value is A(fk)=α|i=max(1,1k)min(N,Mk)AiBi+k|,where α is a proportional parameter determined by the system status (e.g., the photodetector responsivity and driver voltages). Importantly, i,j, |fBfA+(ji)f0| and |fB+fA+(i+j)f0| should be different, which can be easily satisfied by setting fA,fB, and f0 (e.g., fBfA+Nf0 and fA12Mf0). Otherwise, if a frequency f=|fBfA+(j1i1)f0|=|fBfA+(j2i2)f0|=|fA+fB+(i3+j3)f0| exists, the amplitude at this frequency is α|iAiBi+j1i1+iAiBi+j2i2+iAiB(j3i3)i+2fA/f0|, which for convolutions is meaningless.

    Within the process mentioned above, 2MN operations have been completed. When the symbol rate is R, the computing speed is P=2R·MN·OPS,where R’s maximum value is Rmax=f0 to guarantee that all orthogonal frequency modulation signals in every symbol encompass integer periods. Note that the amplitude extraction should be processed with analog implementation. Therefore, the computing speed can remain.

    Furthermore, since one image is usually convolved by multiple kernels in convolution NNs (CNNs), to reduce the processing time for one image, a broadcasting parallelism structure is implemented to realize parallel computing. The image modulation signal is equally divided through power splitters to convolve with kernel signals in every channel as in Fig. 1. When there exist K channels, the parallel computing speed can achieve Pparallel=2R·MNK·OPS.More importantly, in the broadcasting parallelism structure, power consumption is K+1 times the modulation power, while it is 2K times the modulation power if the image is modulated and convolved with K kernels individually. Therefore, the power consumption can decrease by (K+1)/2K.

    3. EXPERIMENT AND RESULTS

    A complete-structure IC-TROSA consists of a silicon photonic integrated circuit (PIC), a driver chip, and two trans-impedance amplifier (TIA) chips [47]. Only some parts of the complete-structure IC-TROSA, i.e., the sub-IC-TROSA, are necessary to construct the SiPh-CA. They are two MZMs for image and kernel data modulation and a one-channel coherent detection structure, as presented in Fig. 2(a). In the chip, two input ports and one output port exist. Input port 1, in1, can be used either as the port for the source laser to the integrated coherent transmitter (ICT) part or for the local light receive port to the integrated coherent receiver (ICR); input port 2, in2, is for the signal receiving. The output port out1 is for the two modulated signals from the ICT. In the ICT part, the source laser is first equally distributed to two MZMs (four MZMs actually in the IC-TROSA, but two of them are switched off for the sub-IC-TROSA), via which the data are modulated into the coherent carriers. The two signals are then polarization multiplexed through a polarization combining device, i.e., the Pol.MUX, and output through the output port out1. In the ICR, the signal and local lights are incident into a 90° hybrid after a polarization demultiplexer (Pol.deMUX) and a splitter, respectively, and detected by a BPD.

    The PIC was fabricated on an 8-in. silicon-on-insulator (SOI) wafer using a 180 nm lithography process. The footprint of the sub-IC-TROSA PIC can be as small as 2.9  mm×6.9  mm when all significant device blocks are integrated into one chip with proper layout and routing since the complete-structure IC-TROSA PIC occupies 5.8  mm×6.9  mm, which includes four MZMs, four-channel coherent detection blocks, and other blocks, as shown in Fig. 2(b). The component utilizes a ball grid array (BGA) packaging form, allowing it to integrate with microprocessors using microelectronic-compatible packaging techniques. The complete-structure IC-TROSA is as shown in Fig. 2(c); thereby the sub-IC-TROSA requires 1/2 fewer devices, providing the advantages of low power consumption and a small footprint. The 3 dB EO and OE bandwidths of the MZM and PDs are around 45 GHz with an approximately flat frequency response range from 0 to 40 GHz [41], which supports the SiPh-CA to process signals up to 40 GHz. A higher computing speed is possible since transmitters and receivers with higher bandwidths have been realized [48,49].

    Based on an ICT and an ICR from two individual sub-IC-TROSAs, we built a SiPh-CA prototype with one computing channel, as shown in Fig. 3 (see Appendix A for system constructions). An arbitrary waveform generator (AWG) generates the waveforms of S and L, as in Eq. (1). These waveforms then modulate a laser light via two individual MZMs with drivers in the ICT from the sub-IC-TROSA 1. The modulated S and L signals as in Eq. (2) are combined by the Pol.MUX and output in X and Y polarizations simultaneously. The two signals are amplified by an erbium-doped fiber amplifier (EDFA) to compensate for the coupling power loss and insertion loss from the Pol.MUX and Pol.deMUX, which can be removed in future work. Afterward, S and L are separated by a polarization beam splitter (PBS) and eventually received by the input ports of the ICR with TIA in the sub-IC-TROSA 2. Note that while the ICR uses a 90° hybrid, a 180° hybrid is enough to support the function, whose structure and footprint are smaller than the former one. The TIA output current, which includes the waveforms with corresponding amplitudes and frequencies, is sampled by a digital storage oscilloscope (DSO). To identify the starting point of meaningful data from the sampled currents, two preamble patterns are added before the modulation signals Se and Le, respectively. The first preamble pattern is a positive high-voltage signal occupying the symbol period. In contrast, the second preamble pattern is a signal with a positive high-voltage level for the first half of the symbol period and a negative high-voltage level for the second half. In the post-processing stage, we analyze the output samples and employ a counter: when the sampled data exhibits a pattern of high voltages sustained for half the symbol period followed by low voltages for the remaining half, it signifies the start of the meaningful data.

    Additionally, the prototype system uses several fibers to connect the chips and devices. Their limited length resolutions result in the S and L path length differences and introduce phase deviations into Eq. (2). The concern also applies to the fully integrated chips. To solve the problem, we use the path-difference calibration method as in Ref. [41]. By setting up a digital twin system in software and minimizing the output difference between the experimental hardware system and the software-based digital twin system, we obtain the path difference in the prototype as 0.0163 m and pre-compensate it in the modulation waveforms. The calibration is only required once, when the system is established and the path difference is unchanged. Furthermore, the EDFA, PC, and PBS between the ICT and ICR can be removed when the necessary devices are connected with waveguides and integrated into one chip, as shown in Fig. 1. In future work, we will design analog circuits rather than the AWG to generate mixed orthogonal frequency signals with data vectors as input. A photonic-assisted channelization scheme based on coherent dual optical frequency combs will substitute the DSO to directly extract the amplitudes at corresponding frequencies in the analog domain without reducing the data throughput and the computing speed [5052], which is completed by the fast Fourier transform in our prototype.

    SiPh-CA conceptual diagram.

    Figure 1.SiPh-CA conceptual diagram.

    SiPh IC-TROSA. (a) The sub-silicon photonic coherent transceiver structure, in which only the devices labeled with dark colors function. (b) The microscope image of a silicon photonic coherent transceiver chip. (c) The highly compact SiPh IC-TROSA integrates a silicon coherent transceiver chip, a driver chip, and two TIA chips inside.

    Figure 2.SiPh IC-TROSA. (a) The sub-silicon photonic coherent transceiver structure, in which only the devices labeled with dark colors function. (b) The microscope image of a silicon photonic coherent transceiver chip. (c) The highly compact SiPh IC-TROSA integrates a silicon coherent transceiver chip, a driver chip, and two TIA chips inside.

    Experimental diagram. AWG, arbitrary waveform generator; ICT, integration coherent transmitter; EDFA, erbium-doped fiber amplifier; PC, polarization controller; PBS, polarization beam splitter; ICR, integration coherent receiver; TIA, trans-impedance amplifier; DSO, digital storage oscilloscope; τ, symbol time width. In them, the ICT and ICR are the photonic integrated chips, which are co-packaged with drivers and TIAs in the printed circuit boards (green solid boxes). After removing the polarization multiplexer from the ICT, the function of the components in the gray dashed box can be realized in only one photonic integrated chip without the EDFA, PC, and PBS.

    Figure 3.Experimental diagram. AWG, arbitrary waveform generator; ICT, integration coherent transmitter; EDFA, erbium-doped fiber amplifier; PC, polarization controller; PBS, polarization beam splitter; ICR, integration coherent receiver; TIA, trans-impedance amplifier; DSO, digital storage oscilloscope; τ, symbol time width. In them, the ICT and ICR are the photonic integrated chips, which are co-packaged with drivers and TIAs in the printed circuit boards (green solid boxes). After removing the polarization multiplexer from the ICT, the function of the components in the gray dashed box can be realized in only one photonic integrated chip without the EDFA, PC, and PBS.

    To showcase the SiPh-CA functions, we conducted two experiments. The first experiment utilizes the SiPh-CA to perform convolutions between different length vectors. The second experiment employs it to process four NNs for image recognition.

    A. SiPh-CA for Convolutions

    The prototype parameters are as follows: the start frequencies of A and B are fA=5  GHz and fB=18  GHz, respectively, and the occupied bandwidths Nf04  GHz,Mf04  GHz to avoid overlapping among the information frequency spectra due to the carrier-suppressed double sideband modulation (CS-DSB) and the incomplete compressed carrier. In addition, they should not surpass the available analog bandwidths of the AWG, modulators, and BPDs. The frequency interval and the symbol rate are set to their maximum values, respectively, i.e., f0=4/nGHz and R=f0, where N=M=n without loss of generality, to achieve a high computing speed. The convolution speed is 2f0n2=2·(4/n)n2=8n, which illustrates that the larger the vector dimension n is, the larger the convolution speed can be. However, as the dimension n increases, the optical signal-to-noise ratios (OSNRs) for each data drop, since the total signal power from all data is smaller than the maximum optical input power of the ICT and ICR. The deteriorated OSNRs reduce the SiPh-CA computing precision. Therefore, to demonstrate the performance upper limits of the SiPh-CA for convolutions, we test convolutions between 1n and 1n and convolutions between Rn and Rn, where 1n=[1,  1,,1]Rn is an n-dimensional all 1 vector and RnRn is an n-dimensional randomly generated nonnegative real-valued vector under different n.

    For the convolution 1n1n, whose theoretical outputs should be [1,2,,n,,2,1], the similarities between the experimental results and the theoretical outputs are assessed by the correlation coefficients as shown in Table 1 (Corrcoef. 1n). They are also intuitively presented in Fig. 4. As observed, the SiPh-CA can provide a convolution speed of 640 GOPS if the Corrcoef. lower limit is set to 0.9700 as in our previous work SiPh-NA [41] (1.024 TOPS at Corrcoef.=0.9700). The reason why the SiPh-CA cannot achieve the same correlation coefficients with that high speed is that the optical signals in the SiPh-CA are amplitude-modulated via one MZM, while the signals in the SiPh-NA are IQ-modulated via two MZMs with a higher OSNR. Nonetheless, the SiPh-CA processing the n=128 scenario where the convolution speed is 1.024 TOPS still outputs a similar spectrum to the theoretical spectrum.

    Performance of the SiPh-CA for Convolutions

    nf0 (GHz)Corrcoef. (1n)Avg. Corrcoef. (Rn)Avg. std.E (Rn)Eff.B (bit)Speed (GOPS)
    410.99890.98560.04874.3632
    80.50.99800.98580.05204.2764
    160.250.99300.97740.06134.03128
    200.20.99280.97750.06124.03160
    320.1250.98710.97000.06913.85256
    400.10.98160.96600.07323.77320
    640.06250.97840.94970.08873.50512
    800.050.97380.93730.09973.33640
    1000.040.96980.92750.10703.22800
    1280.031250.95870.90950.11813.081024

    Output normalized amplitude spectrum of the SiPh-CA processing 1n⊗1n.

    Figure 4.Output normalized amplitude spectrum of the SiPh-CA processing 1n1n.

    In the tests for RnRn, 100 trials are conducted for each value of n. The SiPh-CA performance is presented in Table 1 as averaged correlation coefficients [Avg. Corrcoef.(Rn)], averaged error standard deviations [Avg. std.E(Rn)], and effective bit precision (Eff.B) [53]. In Fig. 5, the blue dots present the experimental outputs, while the ideal output should be located in the gray line y=x. Although they occupy a gradually expanding range with the vector dimension n increasing, the blue dots are still around the gray line with an effective bit precision 3 bit even in the n=128,P=1.024  TOPS case. The experimental results confirm that the SiPh-CA can complete nonnegative real-valued vector convolutions.

    One trial of the measured results of the SiPh-CA processing Rn⊗Rn.

    Figure 5.One trial of the measured results of the SiPh-CA processing RnRn.

    In addition, according to the approach working principle as Eqs. (2) and (3), the SiPh-CA can principally process convolution between two real-valued vectors and complex-valued vectors when considering the output phases. The SiPh-CA has performed convolutions between two real-valued vectors, including those with negative values, even though the output results are absolute-value taken due to the amplitude extraction, which will be demonstrated in the following NN experiments. In future work, we will use the SiPh-CA to realize real-valued and complex-valued convolutions with output phases included.

    B. SiPh-CA for NNs

    To demonstrate the effectiveness of the SiPh-CA for neural networks, we test four NNs to process handwritten digits with 8×8 pixels from the sklearn library [54]. They are two one-layer fully connected NNs, FC10×32 and FC10×64; a one-layer CNN, Conv1×424; and a forward CNN with a convolution layer and a fully connected layer, CF3,16. Although no general convolutions exist in fully connected NNs, the matrix–vector multiplications are a particular convolution whose stride is the matrix column dimension.

    Since the SiPh-CA processes one-dimension data, the 8×8 images are flattened into 1×64 image vectors, while the 8×8 pixel images are downsampled to 4×8 and then flattened into 1×32 ones in the FC10×32. The weight matrix sizes are (10, 32) and (10, 64) for FC10×32 and FC10×64, respectively. Conv1×424 is an uncommon CNN where the image vector convolves with a (1, 424) kernel vector whose size is greater than the image vector size and the stride is 40. CF3,16 is a general CNN where the image vectors are first convolved with 16 kernels whose size is [1, 3] and then processed with a max-pooling function with a stride equaling 2 and a fully connected layer with a (512, 10) weight matrix. During the experiment, the progress includes two steps: the offline training and the online test. The dataset is separated with a partition ratio of 3:1 for training and testing. In the offline training, the cost function is the cross-entropy, and the stochastic gradient descent (SGD) algorithm is for optimization, completed with PyTorch [55] in the digital computer. Afterward, the trained weight matrices/kernels and the test images are added to the SiPh-CA for online testing, as shown in Fig. 6(a). Due to the limited availability of silicon photonic chips, the convolutions between the image vector and 16 kernels are currently processed sequentially, as shown in Fig. 6(a). However, in future work, we plan to integrate multiple modulators onto a single chip to enable parallel computing, as depicted in Fig. 1.

    Image recognition demonstration. (a) Demonstration for the SiPh-CA processing different NNs. (b) Inference accuracy comparison among the electronic digital computer, SiPh-NA, and SiPh-CA. (c) Confusion matrices for the NNs processed in the SiPh-CA.

    Figure 6.Image recognition demonstration. (a) Demonstration for the SiPh-CA processing different NNs. (b) Inference accuracy comparison among the electronic digital computer, SiPh-NA, and SiPh-CA. (c) Confusion matrices for the NNs processed in the SiPh-CA.

    To avoid overlapping among information spectra and to improve the one-image processing efficiency, the frequency settings are fA=16.00  GHz for matrices and fB=7.00  GHz for image data in FC10×64 and FC10×32, fA=12.00  GHz and fB=5.00  GHz in Conv1×424 and CF3,16. The frequency intervals for these four NNs are 0.01 GHz, 0.02 GHz, 0.01 GHz, and 0.0625 GHz, respectively. The corresponding output positions in the frequency spectra are from 14.76 to 9.00 GHz with an interval of 0.64 GHz for the two fully connected NNs, and from 10.60 to 7.00 GHz with one of 0.40 GHz for Conv1×424, as well as from 6.875 to 1.0938 GHz with one of 0.0625 GHz for CF3,16. The SiPh-CA computing results indicate that the SiPh-CA can provide competitive NN inference accuracies for the electronic digital computer, as shown in Figs. 6(b) and 6(c).

    It should be noted that only the amplitudes at the corresponding frequencies of the output are extracted without phases considered. Therefore, while the convolution results cannot represent negative values, the amplitude extraction, i.e., taking the absolute values, is used as the nonlinear operation implemented in our experiments and the implemented NNs. In addition, in our former three tested NNs, no additional nonlinear operation except for the taking-the-absolute-values is processed in the inference phase since the softmax function included in the cross-entropy during training does not change the maximum value position in the multi-category classification tasks. However, in general scenarios, common-used nonlinear functions, e.g., ReLU, cannot be realized in the SiPh-CA physical structure, requiring extra digital circuits. Nevertheless, the SiPh MZMs can add trigonometric nonlinear operations to the operands, improving the NN expression capability.

    4. DISCUSSION

    We compare the SiPh-CA performance and cost with the SiPh-NA [41], as shown in Table 2. To achieve the same convolution speed within one computing channel, the SiPh-CA utilizes two of four MZMs, which occupy the most footprint on the PIC and enable half the number of drivers that consume much power. Therefore, over 1/3 to 1/2 PIC footprint and 37.01% (from 1.494 [41] to 0.941 pJ/MAC) power consumption can be saved, as shown in Fig. 2 and analyzed in Appendix B. With advanced technologies, the total energy efficiency has the potential to reach 4.581 pJ/MAC, reduced by 16.48% from 5.485 pJ/MAC of the SiPh-NA. However, the SiPh-CA computing precision is outperformed by the SiPh-NA [41] due to a smaller OSNR with a lower signal power and a higher noise. It is because the splitters before and behind the MZMs in the I and Q paths constrain the maximum value of the signal power, and the incomplete compressed carrier and noise exist in the grayed paths, although the grayed MZMs in Fig. 2(a) do not function in the ICT theoretically. Therefore, removing the unnecessary devices can improve the OSNR and further the SiPh-CA computing precision to surpass that of the SiPh-NA, which has been proven by the simulations in Appendix C and will be verified in future work. For the NN computing performance, although the effective bit precision is smaller, the inference accuracies for FC10×64, FC10×32, and Conv1×424 obtained from the SiPh-CA are better, which benefits from the NN robustness. Instead of using OFCs and WDM technology to scale the architecture, we developed the power-splitter-based broadcasting scheme for the SiPh-CA to enhance the parallelism. In this scheme, the image vector requests only one-time modulation, saving power consumption as mentioned in Section 2, and power splitters do not bring in the inter-channel crosstalk and intra-channel signal power variations from the WDM multiplexer filters. Moreover, the output spectrum processing difficulties remain manageable as the parallelism grows since the input data spectra do not expand by OFC combination. Furthermore, while the SiPh-CA can only process one-dimensional vectors with limited length, e.g., n=128 in our presented experiments, we demonstrate a scalability method for large-size tensors of high order in Appendix D. Furthermore, compared to other optical computing frameworks and NVIDIA GPU A100 [56] whose performances have been summarized and compared with the SiPh-NA in Ref. [41], our SiPh-CA can also achieve a competitive computing speed, energy efficiency, and NN inference accuracy.

    Cost and Performance Comparison of the SiPh-NA [41] and the SiPh-CA

     SiPh-NASiPh-CA
    Devices in one cell4 MZMs, 4 drivers,2 MZMs, 2 drivers,
    1 180° hybrid,1 180° hybrid,
    1 BPD, 1 TIA, 1 BPD, 1 TIA
    2 PS, 4 splitters 
    Power consumption with advanced technologies5.616 W4.690 W
    Bit precision3.713.08
    Convolution speed1.024 TOPS1.024 TOPS
    max. MVM matrix10×6410×64
    max. NNFC10,64FC10,64
    Inference accuracy95.78%96.22%
    ScalabilityNN

    Although the SiPh-CA can accelerate convolutions and MVMs for NNs, the computing precision is unsatisfying and deteriorates when the input vector size and the computing speed increase. It results from the limited extinction ratio and linear modulation range of the SiPh-CA. The limited extinction ratio means that the noise from the incomplete compressed carrier exists, and the limited linear modulation range constrains the total signal power. Both impact the OSNR as the input vector size increases. Therefore, to improve the SiPh-CA performance, modulators with a higher extinction ratio and a broader linear modulation range should be optimized at the device and material levels. In addition, during the experiments, the computing precision increases by averaging the output signals across multiple repetition periods. However, the symbol rate R drops, and the computing speed does. To maintain the computing speed and to improve the computing precision, we could implement the same vectors into and average the outputs from multiple computing channels at the architecture level. Furthermore, the computing precision can be further improved when the impacts from the nonflat EO and OE frequency responses of the IC-TROSA are reduced by testing and pre-compensating the modulation amplitudes and tuning the output amplitudes at corresponding frequencies. The process is similar to the frequency equalization in the transmission system [57] and should be realized in the hardware implementation instead of software-based calibration.

    5. CONCLUSION

    In this work, a novel scalable, integrated, and flexible SiPh-CA is proposed and realized by the sub-IC-TROSAs. The one-channel SiPh-CA prototype reaches a convolution speed of 1.024 TOPS, and the implemented NN inference accuracies are comparable to those of the NN processed by the digital computer. The coherent PIC costs sub-1 pJ/MAC power consumption. The energy efficiency of the SiPh-CA can achieve lower than 5 pJ/MAC when it is fabricated with advanced CMOS and silicon photonic technology. These merits are competitive with those of the latest optical computing frameworks and superior to our previous SiPh-NA architecture. Furthermore, although the scalability is inferior to that of the SiPh-NA, by leveraging the splitter-based broadcasting scheme, the SiPh-CA can be linearly scaled to parallel process convolutions between an image and multiple kernels, maintaining manageable complexity for optoelectronic systems and avoiding the nonideal elements associated with OFCs and WDM technologies. In future work, by optimizing the device and circuit design and accompanying software/hardware codesign, we believe the SiPh-CA can achieve a faster computing speed, higher computing precision, and more effective energy consumption.

    APPENDIX A: SYSTEM CONSTRUCTION

    We utilize a 1550 nm laser source (N7714A, Agilent Technologies) as the carrier set to 16 dBm and amplified to 20 dBm by an EDFA (AEDFA-C-EX-DWDM-23-8FA, Amonics). The carrier is then modulated via the MZMs in the silicon photonic IC-TROSA by the electronic signals from an AWG (8194A, Keysight). Afterward, the modulation signals are amplified by another EDFA of the same type. These EDFAs compensate for the insertion losses from chip couplers and can be removed when all necessary structures are integrated into one chip to reduce footprint and costs. The IC-TROSAs are powered by a power supply (IT8322, ITECH). The IC-TROSA output is sampled by an oscilloscope (UXR0594A, Keysight).

    APPENDIX B: POWER CONSUMPTION ANALYSIS

    In the experimental system, the SiPh-CA core consists of a laser source, an integrated coherent transmitter with two drivers, and an integrated coherent receiver with a trans-impedance amplifier, contributing to a total power cost of 2.456 W in which a laser source consumes 0.398 W with the latest technology and laser quantum efficiency [42], and the sub-IC-TROSA consumes 2.058 W. [The four-channel ICT and ICR consume 4.50 W including the driver chip 0.65 W/channel, the TIA chip 0.192 W/channel, and the photonic integrated circuit (PIC) 1.132 W. For the two-channel ICT and one-channel ICR, the drive and TIA chips consume 1.3 W and 0.192 W, respectively, and the PIC power consumption is no more than 0.566 W.] No extra TEC is utilized, reducing total power consumption. EDFAs, which provide 9 dB gain to compensate for the coupling losses between the ICT/ICR and other separated devices, can be removed when all the necessary device blocks integrate within one chip. Thus, the overall energy efficiency is 2.398 pJ/MAC, and 0.941 pJ/MAC for the photonic chips. If the analog optoelectronic post-processing [41] is included, the total power consumption of the fully integrated chip based on the latest technologies can achieve Ptotal=Plaser,ICT/ICR+P2DACs+P2Drivers+P1TIA+P1Driver+POFCs+2(2N1)PBPDs+(2N1)PTIA=0.964+0.176×2+0.180×2+0.0112+0.18+1+2×255×3×50×109+255×0.00715=4.690  W,where 2N1 is the result size; power consumption of our experiments is from the laser source, ICT, and ICR, DAC [58], drivers [59], TIAs with a GHz-level bandwidth [60] and a MHz-level bandwidth [61], OFC source [62], and BPDs with supply voltage 3 V and dark current 50 nA. It has to be mentioned that the power of the OFC source is only a power-level value based on the reference. It is possible to be larger and increased by amplifiers in practical implementation. Therefore, the total system energy efficiency can probably achieve 4.581 pJ/MAC.

    APPENDIX C: PERFORMANCE IMPROVEMENT

    As mentioned in the main paper, the sub-IC-TROSA-based SiPh-CA obtains a smaller computing precision than the IC-TROSA-based SiPh-NA, possibly because of the noise from the unnecessary MZM paths. To prove it, we utilize the Lumerical Interconnect to simulate the prototype system, as shown in Fig. 7. Table 3 presents the simulation system settings. The three systems are (1) the IC-TROSA-based SiPh-NA where four MZMs exist and the signals are single-sideband modulated [Fig. 7(a)]; (2) the experimental sub-IC-TROSA-based SiPh-CA where four MZMs exist but only two work and the signals are double-sideband modulated [Fig. 7(b)]; (3) the expected MZM-based SiPh-CA where only two MZMs exist and work and the signals are double-sideband modulated [Fig. 7(c)]. Their simulation results are presented in Figs. 8(a)–8(c), respectively. We operate the convolutions between two four-dimensional vectors [1,1,1,1][1,1,1,1]=[1,2,3,4,3,2,1] in the three simulation systems. By comparing Figs. 8(a) and 8(b), we can conclude that the extra unnecessary paths introduce noises into the computing, reducing the computing precision. The optimization from Figs. 8(a), 8(b) to 8(c) proves that the SiPh-CA can surpass the SiPh-NA with the equivalent structure by removing the unnecessary paths and modulating all lights coupled into the PIC input port. It should be noted that we establish the same simulation systems as in our experiments to mimic the practical scenarios. But in future work, the unnecessary polarization multiplexer (Pol.MUX), EDFA, polarization controller (Pol. Controller), and polarization demultiplexer (Pol.deMUX) will be removed from the integrated chips for further performance improvements.

    Simulation System Setting

    ParameterSampling RateNumber of SamplesSource FrequencyLinewidthRINExtinction RatioInsertion Loss
    Value100 GHz101193 THz50 kHz−140 dB30 dB15 dB

    Simulation diagrams. (a) Diagram of the IC-TROSA-based SiPh-NA. (b) Diagram of the sub-IC-TROSA-based SiPh-CA. (c) Diagram of the MZM-based SiPh-CA.

    Figure 7.Simulation diagrams. (a) Diagram of the IC-TROSA-based SiPh-NA. (b) Diagram of the sub-IC-TROSA-based SiPh-CA. (c) Diagram of the MZM-based SiPh-CA.

    Simulation results. (a) IC-TROSA-based SiPh-NA output spectrum. (b) IC-TROSA-based SiPh-CA output spectrum. (c) MZM-based SiPh-CA output spectrum.

    Figure 8.Simulation results. (a) IC-TROSA-based SiPh-NA output spectrum. (b) IC-TROSA-based SiPh-CA output spectrum. (c) MZM-based SiPh-CA output spectrum.

    APPENDIX D: SCALABILITY METHOD FOR CONVOLUTIONS OF LARGE-SIZE TENSORS OF HIGH ORDER

    Generally, the photonic or electronic computing core can process convolutions between tensors of limited sizes. To process convolutions between large-size tensors of high order, convolution decomposition is a commonly employed method in digital computers, in addition to utilizing multiple computing cores to enhance parallelism. Herein, we present the convolution decomposition method for the SiPh-CA.

    Convolution decomposition for large-size tensors of high order.

    Figure 9.Convolution decomposition for large-size tensors of high order.

    Assume the SiPh-CA can complete a convolution between an N0-length vector and K0M0-length vectors with an acceptable bit resolution. To complete a convolution between one image tensor TIN1,N2,C of size N1×N2×C and K kernel tensors TkM1,M2,C of size M1×M2×C in the CNN, where without loss of generality, we assume K>K0,N1·N2>N0,andM1>M2>M0, as shown in Fig. 9, the following decompositions are performed. System.Xml.XmlElementSystem.Xml.XmlElementSystem.Xml.XmlElementSystem.Xml.XmlElementSystem.Xml.XmlElement

    Note: in Steps 3 and 4, zero-padding may be required at the beginning or end of each row of TI,cN1,N2 according to the CNN structure and is applied before and after the kernel tensors to align the positions. In Step 5, the convolution data from the SiPh-CA outputs should be shifted before summation to obtain the complete results. For simplicity, the zero-padding and data shifts are omitted in this simplified description.

    [2] M. Tan, Q. Le. EfficientNet: rethinking model scaling for convolutional neural networks. International Conference on Machine Learning, 6105-6114(2019).

    [3] A. Vaswani, N. Shazeer, N. Parmar. Attention is all you need. Proceedings of the 31st International Conference on Neural Information Processing Systems, 6000-6010(2017).

    [4] T. Brown, B. Mann, N. Ryder. Language models are few-shot learners. Proceedings of the 34th International Conference on Neural Information Processing Systems, 33, 1877-1901(2020).

    [5] A. Baevski, W.-N. Hsu, A. Conneau. Unsupervised speech recognition. Proceedings of the 35th International Conference on Neural Information Processing Systems, 34, 27826-27839(2021).

    [7] C.-J. Wu, R. Raghavendra, U. Gupta. Sustainable AI: environmental implications, challenges and opportunities. Proc. Mach. Learn. Syst., 4, 795-813(2022).

    [10] M. Jerry, P.-Y. Chen, J. Zhang. Ferroelectric FET analog synapse for acceleration of deep neural network training. IEEE International Electron Devices Meeting (IEDM), 6.2.1-6.2.4(2017).

    [13] K. Roy, I. Chakraborty, M. Ali. In-memory computing in emerging memory technologies for machine learning: an overview. 57th ACM/IEEE Design Automation Conference (DAC), 1-6(2020).

    [16] S. Schmitt, J. Klähn, G. Bellec. Neuromorphic hardware in the loop: training a deep spiking network on the brainscales wafer-scale system. International Joint Conference on Neural Networks (IJCNN), 2227-2234(2017).

    [18] S. Han. Efficient methods and hardware for deep learning(2017).

    [19] Y. Zhu, G. L. Zhang, T. Wang. Statistical training for neuromorphic computing using memristor-based crossbars considering process variations and noise. Design, Automation & Test in Europe Conference & Exhibition (DATE), 1590-1593(2020).

    [20] P. Spilger, E. Müller, A. Emmel. hxtorch: PyTorch for brainscales-2: perceptrons on analog neuromorphic hardware. IoT Streams for Data-Driven Predictive Maintenance and IoT, Edge, and Mobile for Embedded Machine Learning, 189-200(2020).

    [28] J. Gu, C. Feng, Z. Zhao. Efficient on-chip learning for optical neural networks through power-aware sparse zeroth-order optimization. Proceedings of the AAAI Conference on Artificial Intelligence, 35, 7583-7591(2021).

    [29] Y. Zhu, M. Liu, L. Xu. Multi-wavelength parallel training and quantization-aware tuning for wdm-based optical convolutional neural networks considering wavelength-relative deviations. Proceedings of the 28th Asia and South Pacific Design Automation Conference, 384-389(2023).

    [40] Y. Zhu, X. Zhang, X. Hua. Optoelectronic neuromorphic accelerator at 523.27 GOPS based on coherent optical devices. Optical Fiber Communication Conference, M2J-4(2023).

    [46] C. D. McGillem, G. R. Cooper. Continuous and Discrete Signal and System Analysis(1991).

    [47] X. Xiao, L. Wang, M. Luo. High baudrate silicon photonics for the next-generation optical communications. European Conference on Optical Communication (ECOC), 1-4(2022).

    [48] X. Hu, D. Wu, H. Zhang. Ultrahigh-speed silicon-based modulators/photodetectors for optical interconnects. Optical Fiber Communications Conference and Exhibition (OFC), 1-3(2023).

    [54] F. Pedregosa, G. Varoquaux, A. Gramfort. Scikit-learn: machine learning in Python. J. Mach. Learn. Res., 12, 2825-2830(2011).

    [55] A. Paszke, S. Gross, F. Massa. PyTorch: an imperative style, high-performance deep learning library. Proceedings of the 33rd International Conference on Neural Information Processing Systems, 32, 8026-8037(2019).

    [59] S. Nakano, M. Nagatani, K. Tanaka. A 180-mW linear MZM driver in CMOS for single-carrier 400-Gb/s coherent optical transmitter. European Conference on Optical Communication (ECOC), 1-3(2017).

    [61] A. D. Güngördü, G. Dündar, M. B. Yelten. A high performance TIA design in 40 nm CMOS. IEEE International Symposium on Circuits and Systems (ISCAS), 1-5(2020).

    Tools

    Get Citation

    Copy Citation Text

    Ying Zhu, Lu Xu, Xin Hua, Kailai Liu, Yifan Liu, Ming Luo, Jia Liu, Ziyue Dang, Ye Liu, Min Liu, Hongguang Zhang, Daigao Chen, Lei Wang, Xi Xiao, Shaohua Yu, "Silicon photonics convolution accelerator based on coherent chips with sub-1 pJ/MAC power consumption," Photonics Res. 13, 497 (2025)

    Download Citation

    EndNote(RIS)BibTexPlain Text
    Save article for my favorites
    Paper Information

    Category: Silicon Photonics

    Received: Jul. 18, 2024

    Accepted: Dec. 2, 2024

    Published Online: Feb. 10, 2025

    The Author Email: Xi Xiao (xiaoxi@noeic.com)

    DOI:10.1364/PRJ.536939

    Topics