Efficient stochastic parallel gradient descent training for on-chip optical processor

Yuanjian Wan; Xudong Liu; Guangze Wu; Min Yang; Guofeng Yan; Yu Zhang; Jian Wang

doi:10.29026/oea.2024.230182

Introduction

The rapid development of the internet and artificial intelligence has put forward ever-increasing requirements for higher and higher data information capacity of fiber and free-space optical communications. Space-division multiplexing (SDM)^{1, 2} employs multiple orthogonal channels in space to simultaneously carry and transmit data information for capacity scaling, which has been realized in few-mode fibers (FMFs)^{3, 4}, multi-core fiber^{5, 6}, ring-core fiber^{7, 8}, and free-space communication systems^{9, 10}. However, many SDM schemes suffer from inevitable channel crosstalk because of the mixing between different channels when transmitting data information together, which leads to the degradation in signal quality at the receiver. Therefore, digital signal processing (DSP) is always required to mitigate the channel crosstalk and recover the original signal in the electrical domain^11-13. Unfortunately, high-speed DSP chips in the electrical domain are highly complex, difficult to design, and high power consumption¹⁴. In this scenario, a laudable goal would be to undo the spatial channel mixing in the optical domain¹. Additionally, optical switching function in SDM optical communication systems is also highly desirable to facilitate flexible data management^15-17.

In recent years, the integrated reconfigurable optical processor can perform linear transformation like filter^18-20, switching^{21, 22}, temporal integration²³, temporal differentiation²⁴, Hilbert transformation²⁵ and matrix multiplication²⁶ and has been widely used in photonic microwave filter²⁷, channel equalization²⁸, optical switching²⁹, multi-task photonic signal processor^{20, 30-32}, quantum computing³³ and optical computing^34-37. What’s more, it has also been reported to undo the crosstalk caused by mode mixing in multi-mode waveguide³⁸, FMF^{39, 40} or MZI²⁹ mesh automatically and get better signal quality after optimization. However, most existing integrated linear optical processor or optical neural networks (ONNs) are implemented by pre-training the weights of the matrix in electronic computer and adjusting the phase shifter voltage according to the training result, which is very inefficient^{16, 30}. Additionally, due to external disturbance, thermal crosstalk inside the chip and fabrication tolerance of the complementary metal oxide semiconductor (CMOS) process, there is always a certain difference between the actual matrix of the integrated photonic chip and the target matrix, which will increase geometrically as the size of the chip increases. To cope with this problem, researchers proposed on-chip training schemes for ONNs, including gradient descent (GD) algorithm and its deformation⁴¹, such as numerical gradient descent algorithm²⁹, genetic algorithm (GA)^{42, 43}, particle swarm optimization (PSO) algorithm⁴⁴, bacterial foraging optimization algorithm (BFOA)⁴⁵ and so on. However, in the optimization process of GD algorithm, it is usually necessary to establish a relatively clear mathematical model, but it is difficult for the training of ONN. Although Swarm Intelligence algorithm like GA, PSO or BFOA does not depend on the establishment of mathematical network models, a sufficiently large population size is required to guarantee the reliability of the optimization results, which leads to increased complexity. Generally, compared to GA, PSO and BFOA algorithms, the SPGD algorithm has less computational load. Furthermore, since the strategy of Swarm Intelligence algorithm is based on guessing and searching, it is difficult to determine their complexity and their results are not stable enough.

In this paper, we use the Stochastic Parallel Gradient Descent algorithm (SPGD) to train the integrated optical processor experimentally, which is universally used in adaptive optics systems^{46, 47} because of its advantages of simple parameters, low complexity, high convergence efficiency and good stability. Moreover, we compare it with the GD, GA, and PSO algorithms and draw a conclusion that the processor trained by SPGD algorithm needs less computation than other algorithms, especially when the scale of matrix is large. Lastly, using the fabricated on-chip optical processor trained by SPGD, we implement 6 channels self-configurable optical switching and channel descrambling with favorable performance, which may find potential applications in fiber-optic communication systems based on mode-division multiplexing (MDM), a subset of SDM, as illustrated in Fig. 1(a).

Figure 1.(a) Conceptual diagram of the on-chip optical processor for optical switching and channel descrambling in MDM communication systems. (b) Schematic configuration of the integrated reconfigurable optical processor. θ and ϕ mean the phase shift of the phase shifters. MDM: mode-division multiplexing; MUX: multiplexer; DEMUX: demultiplexer.

Download full size

View all figures

Concept and principle

The concept and principle of the integrated optical processor are shown in Fig. 1(b), which can be divided to three parts. Parts (1) and (3) are unitary optical processors, representing unitary matrix $U$ and $V^{†}$ , respectively, and part (2) represents the diagram matrix $Σ$ . As we all know, arbitrary matrix $M$ can be decomposed into two unitary matrixes $U$ , $V$ and a diagram matrix $Σ$ as $M = U Σ V^{H}$ by the singular value decomposition (SVD)^{48, 49}. When the power split ratio ρ=0.5 and the phase shifter phase is $θ$ and $ϕ$ , the transport matrix of Mach−Zehnder interferometers (MZI) can be expressed as:

$U_{MZI} = \frac{1}{2} (\begin{matrix} e^{i ϕ} (e^{i θ} - 1) & i (1 + e^{i θ}) \\ {ie}^{i ϕ} (1 + e^{i θ}) & 1 - e^{i θ} \end{matrix}) = (\begin{matrix} u_{11} & u_{12} \\ u_{21} & u_{22} \end{matrix}) .$ （1）

It can be seen that $U_{MZI} \cdot U_{MZI}^{†} = E$ so the MZI can represent any rotational operation in a two-dimensional unitary space. In other words, by changing $θ$ and $ϕ$ controlled by voltage we can achieve any second-order unitary matrix $U (2)$ . It can be proven that any N×N unitary matrix $U (N)$ can be decomposed into the product of a series of $U (2)$ as:

$U (N) = R_{12} R_{13} \cdot \cdot \cdot R_{l m} \cdot \cdot \cdot R_{n (n - 1)} .$ （2）

Therefore, we can use cascaded MZI mesh to construct any N×N unitary matrix and real matrix.

Optical switching in optical networks can be described by a matrix. For example, the matrix means the switching state. Besides, crosstalk between channels in and SDM communication system can be represented by a matrix $\tilde{U}$ , so descrambling is to train the optical processor to enable its transmission matrix $U$ to satisfy $U \tilde{U} = E$ , which means the data information mixed into other channels is returned to the original channel. Thus, we generate a unitary matrix randomly and use

$J = \sum_{j = 1}^{N} \sum_{k = 1}^{N} {| U_{j k}^{sim} - E_{j k} |}^{2},$ （3）

as the Loss function ( $U^{sim}$ is $U \tilde{U}$ and the $E$ is the diagonal matrix) to train the optical processor by Stochastic Parallel Gradient Descent (SPGD) algorithm, which is more suitable for the optimization of complex systems with many control variables and cannot establish accurate mathematical models. The flow chart is shown in Fig. 2. Before optimization begins, we set random perturbation $Δ u$ , learning rate $γ$ , the total number of iterations $T$ and randomly generated $U$ , then we reset the voltages applied to all phase shifters $u_{i}$ and calculate the evaluation function $J [u_{i}]$ . Next we generate random disturbance voltage $Δ u_{_{i}}$ and individually apply $u_{i} + Δ u_{i}$ and $u_{i} - Δ u_{i}$ to calculate $J [u_{i} + Δ u_{i}]$ and $J [u_{i} - Δ u_{i}]$ . Then we get $Δ J = J [u_{i} + Δ u_{i}] - J [u_{i} - Δ u_{i}]$ and update voltages by $u_{i + 1} = u_{i} + γ Δ u Δ J$ . Finally, we judge whether $i$ is greater than T. If $i$ is greater than T, we end the loop, and if $i$ is not greater than $T$ , we continue the loop until the condition is met. Compared to the GD algorithm, it can change all variables together and get the evaluation function $J [u_{i}]$ rather than changing each variable one by one to calculate the evaluation function in a iterations, which greatly reduce the amount of computation.

Figure 2.Flow chart of Stochastic Parallel Gradient Descent (SPGD) algorithm.

Download full size

View all figures

To verify the optimization effect of the algorithm, we use an electronic computer to train the on-chip optical processor for optical switching, optical channel descrambling, and achieving both of them together (i.e. optical channel descrambling and switching), with the obtained results shown in Fig. 3. Fig. 3(a−c) are the results of optical switching matrix. The light power distributions matrix is calculated by unitary matrix cascade mentioned above. It is easy to prove that there is a total of 720 6×6 optical switching matrices, all of which are encoded as 1–720 and can be set as targets for optimization. The emulated light power distributions after training are shown in Fig. 3(a) when the switching state is $I_{1} - O_{2}$ , $I_{2} - O_{1}$ , $I_{3} - O_{5}$ , $I_{4} - O_{6}$ , $I_{5} - O_{3}$ , $I_{6} - O_{4}$ , whose normalized light intensity distributions is shown in Fig. 3(b) and the evaluation function changing with iteration rounds of optimization is shown in Fig. 3(c). During the optical channels descrambling process, we randomly generate a set of phases in the part (1) of our chip to emulate the crosstalk matrix $\tilde{U}$ and optimize the phase shift of the MZI in the part (1) of our chip to minimize evaluation function $J$ . Fig. 3(d) shows the normalized light intensity distributions before descrambling and Fig. 3(e) is the normalized light intensity distributions after training. Fig. 3(f) is the evaluation function changing with iteration rounds. On this basis, we set optical switching matrix T as the target and optimize the phase shift to make evaluation function $J$ minimum, which means optical switching and channel descrambling functions are achieved on the same chip. Fig. 3(g) is the normalized light intensity distributions before training with crosstalk when the switching state is $I_{1} - O_{5}$ , $I_{2} - O_{3}$ , $I_{3} - O_{2}$ , $I_{4} - O_{4}$ , $I_{5} - O_{1}$ , $I_{6} - O_{6}$ . Fig. 3(h) is the normalized light intensity distributions after training with crosstalk in the same switching state and Fig. 3(i) is the evaluation function changing with iteration rounds.

Figure 3.Training results in electronic computer for optical switching, optical channel descrambling, and optical channel descrambling and switching. (a) Emulated light power distributions and (b) normalized light intensity distributions after training when the switching state is I₁−O₂, I₂−O₁, I₃−O₅, I₄−O₆, I₅−O₃, I₆−O₄. (d, e) Normalized light intensity distributions (d) before and (e) after training when randomly generating a set of phases in the part (1) of our chip to emulate crosstalk. (g, h) Normalized light intensity distributions (g) before and (h) after training with crosstalk when the switching state is: I₁−O₅, I₂−O₃, I₃−O₂, I₄−O₄, I₅−O₁, I₆−O₆. (c, f, i) The evaluation function changing with iteration rounds.

Download full size

View all figures

Experimental results

The diagram of the experimental setup for online training is depicted in Fig. 4(a). Silicon-based photonic chips can be used to function as a 6×6 matrix. Continuous-wave (CW) light at a wavelength of 1550 nm is divided into 6 paths and coupled into the photonic chip in turn through a 1×6 optical micro-electro-mechanical systems (MEMS) switch controlled by a computer. The optical power distribution $U_{i}$ of the six output ports are detected by optical power meter array and the matrix $U_{\exp} = (U_{1}, U_{2}, U_{3}, U_{4}, U_{5}, U_{6})$ is the transmission matrix of chip. Polarization controllers (PCs) are used in each path to change the polarization of light to ensure the maximum coupling efficiency between fibers and optical waveguides. Phase shifters in the chip are driven by a multiple outputs regulated power supply controlled by the same computer. Fig. 4(b) shows the microscopy image of optical processor, which is fabricated on the standard 220-nm-silicon-on-insulator (SOI) wafer using 248 nm deep ultraviolet (DUV) lithography and inductively coupled plasma (ICP) etching in United Microelectronics Center (CUMEC). The silicon photonic chip is packaged on a PCB, and light is coupled into the chip through an array of fibers.

Figure 4.(a) Schematic of experimental configuration. (b) Microscopy image of optical processor. VSA: voltage source array; PD: photodetector array.

Download full size

View all figures

In the process of training, we randomly generate an arbitrary exchange matrix T as the target of optimization and use the SPGD to optimize the element configuration of the matrix online. The obtained results are shown in Fig. 5. Fig. 5(a) is the evaluation function changing with iteration rounds, where the X-axis represents the rounds of iteration and the Y-axis means the evaluation function defined by:

Figure 5.Online training results for optical switching at a wavelength of 1550 nm. (a) The evaluation function changing with iteration rounds when the switching state is I₁−O₃, I₂−O₁, I₃−O₄, I₄−O₆, I₅−O₂, I₆−O₅. The insets figures show the light power distributions when the round of iteration equals 50, 300, and 600, respectively. (b) The measured light power distributions after training. (c) The normalized light intensity distributions of measured results. (d, e) The measured light power distributions and normalized light intensity distributions when the switching state is I₁−O₃, I₂−O₆, I₃−O₄, I₄−O₂, I₅−O₁, I₆−O₅.

Download full size

View all figures

$J = \sum_{j = 1}^{N} \sum_{k = 1}^{N} {| U_{j k}^{exp} - T_{j k} |}^{2} .$ （4）

The insets in Fig. 5(a) show the light power distributions when the round of iteration equals 50, 300, and 600, respectively. The light power and normalized intensity distributions after training are shown in Fig. 5(b) and Fig. 5(c), respectively, and the crosstalk of all channels is less than −12.5 dB. In order to reveal the reconfigurability of our chip, we choose another switching state as the target, and the light power and normalized intensity distributions are shown in Fig. 5(d) and Fig. 5(e), respectively.

In addition to the optical switching, we also perform online automatic configuration of arbitrary unitary matrices for mitigating for crosstalk between channels in MDM communication systems. Before the experiment starts, we randomly generate a set of voltages applied to the phase shifter in part (1) of the chip, which is equivalent to constructing a unitary matrix and emulate the mixing between six modes in FMF. During the experiment we use the SPGD algorithm to configure the phase shifter voltage to make the evaluation function defined by

$J = \sum_{j = 1}^{N} \sum_{k = 1}^{N} {| U_{j k}^{\exp} - T_{j k} |}^{2} .$ （5）

The evaluation function changing with iteration rounds is shown in Fig. 6(a) and insets show the light power distributions when the round of iteration equals 1, 300, and 600, respectively. The light power distributions of 6 outputs before descrambling is shown in Fig. 6(b) and the distributions after descrambling is shown in Fig. 6(c). It is obvious that the crosstalk decreases as the iterations increases and the crosstalk is less than 11 dB after descrambling. Fig. 6(d) and Fig. 6(e) shows descrambling results of training when generating another crosstalk matrix.

Figure 6.Online training results for optical channel descrambling at a wavelength of 1550 nm. (a) The evaluation function changing with iteration rounds. The insets show the light power distributions when the round of iteration equals 1, 300, and 600, respectively. (b) The light power distributions before training. (c) The light power distributions after training. (d, e) The results of training when generating another matrix $\tilde{U}$ .

Download full size

View all figures

Moreover, we also implement both optical switching and channel descrambling (two important modules in the MDM system) on the same chip together. Similar to descrambling, we also randomly generate a unitary matrix $\tilde{U}$ to emulate the crosstalk and we use the part (2) of this chip to configure the matrix $U$ to enable $\tilde{U} U = E$ . The obtained results are shown in Fig. 7. Fig. 7(a) is the training process and the insets show the light power distributions when the round of iteration equals 1, 100, and 400, respectively. The results of the light power distributions after training are shown in Fig. 7(b) and Fig. 7(c) when the switching state is $I_{1} - O_{4}$ , $I_{2} - O_{1}$ , $I_{3} - O_{5}$ , $I_{4} - O_{6}$ , $I_{5} - O_{2}$ , $I_{6} - O_{2}$ . Fig. 7(d) and Fig. 7(e) are the results of training when we set another switching state of $I_{1} - O_{5}$ , $I_{2} - O_{3}$ , $I_{3} - O_{1}$ , $I_{4} - O_{6}$ , $I_{5} - O_{2}$ , $I_{6} - O_{3}$ .

Figure 7.Online training results for optical channel descrambling and switching at a wavelength of 1550 nm. (a) The evaluation function changing with iteration rounds when the switching state is I₁−O₄, I₂−O₁, I₃−O₅, I₄−O₆, I₅−O₃, I₆−O₂. The insets show the light power distributions when the round of iteration equals 1, 100, and 400, respectively. (b) The light power distributions before training. (c) The light power distributions after training. (d, e) The results of training when generating another matrix $\tilde{U}$ and the switching state is I₁−O₅, I₂−O₃, I₃−O₁, I₄−O₆, I₅−O₂, I₆−O₄.

Download full size

View all figures

Finally, in order to compare the signal quality before and after descrambling and verify the application of our photonic chip in the fiber-optic communication system with high-speed advanced modulation signals, we set up the experimental configuration shown in the Fig. 8(a). The 16-ary quadrature amplitude modulation (16-QAM) signal at 20 Gbaud rate is modulated onto an optical carrier by a lithium niobate in-phase and quadrature (IQ) modulator driven by the arbitrary waveform generator (AWG). The light is then split into six branches through a 1×6 power coupler, passing through six fiber delay lines for decorrelation between each channel. The output of the chip is connected to a coherent receiver, where the received 16-QAM signals and local oscillator (LO) are mixed for coherent detection. The detected signals are collected by an optical real-time oscilloscope (OSC) and sent to computer to get the constellations and the bit-error rate (BER) by off-line DSP. The erbium-doped fiber amplifier (EDFA) is used to amplify the signal light. Meanwhile, a variable optical attenuator (VOA) is used to change the optical signal-to-noise ratio (OSNR) of the output signal for measuring the BER transmission performance. The obtained results are shown in Fig. 8(b−f). The BER performance under four conditions is shown in Fig. 8(b). The BER performance after descrambling is nearly one-tenth of it before descrambling at the same OSNR, so it is clear that the quality of the signal is improved significantly after descrambling in our chip. The constellation chart of back-to-back is shown in Fig. 8(c) while the constellation chart of signal passing through the chip without crosstalk is shown in Fig. 8(d). Fig. 8(e) is the constellation chart of signal with random perturbations applied by the part (1) of the chip and Fig. 8(f) is the constellation chart after descrambling. By comparing Fig. 8(e) and Fig. 8(f), it is obvious that the signal quality is greatly improved and the designed and fabricated on-chip optical processor assisted by SPGD training has a significant effect on signal descrambling.

Figure 8.Experimental setup and measured results for optical channel descrambling. (a) Experimental setup for the 6×6 optical descrambling systems. (b) The measured BER performance for back-to-back, optimization without crosstalk, before optimization with crosstalk, and after optimization with crosstalk systems. (c) The measured constellation chart at the back-to-back. (d) The measured constellation chart without crosstalk. (e) The measured constellation chart before optimization with crosstalk. (f) The measured constellation chart after optimization with crosstalk. PC: polarization controller; AWG: arbitrary waveform generator; EDFA: erbium-doped fiber amplifier; VOA: variable optical attenuator; OSC: oscilloscope; DSP: digital signal processing.

Download full size

View all figures

Discussion

At present, the scale of neural networks based on MZI mesh has reached 64×64⁵⁰. The online learning of such large-scale matrices is very time-consuming. To get the transform matrix, we need to toggle the optical input port in turn through a 1×6 MEMS optical switch and collect optical light intensity distribution through the photodetector array. Besides, the photocurrents should be converted into digital signals, and the updated voltages of phase shifters also need to be converted to analog signals. However, commonly available ADC and DAC chips on the market typically have a bandwidth not exceeding 1 GHz and the MEMS optical switch speeds typically range from kilohertz to megahertz, with some advanced designs achieving switching times in the nanosecond range, which is the main time consumption of matrix online training. Traditional GD algorithm firstly generates a set of variables $u_{1}, u_{2},..., u_{n}$ (n is the number of variables) randomly and calculates the evaluation function $J [u_{1}, u_{2},..., u_{n}]$ . Then it applies perturbation $Δ u$ to all variables in turn and get the transport matrix of the photonic chip to calculate the evaluation function $J_{k} [u_{1}, u_{2},..., u_{k} + Δ u,..., u_{n}] (k = 1,2,..., n)$ . The partial derivative for $u_{k}$ is calculated by ${J^{'}}_{k} = \frac{J_{k} - J}{Δ u}$ and the gradient of the evaluate function J can be expressed as: $\nabla J = ({J^{'}}_{1}, {J^{'}}_{2},..., {J^{'}}_{n})$ . Finally, we update the variables in new round and so on until the evaluate function is less than the threshold. Therefore, in order to train a N×N unitary matrix with N(N−1) phase shifts, we need update the voltage N(N−1) times, measure the transmission matrix N(N−1) times, and calculate the evaluation function N(N−1) times in one iteration round by the GD algorithm. If the number of iterations is T, we need to update the variables and get the transform matrix N(N–1)×T times. Additionally, Swarm Intelligence algorithms like GA and PSO need to generate a group, in which each individual represents a set of variables to be optimized. During training, all individuals in the group should be calculated, and the best individual should be selected. Therefore, we need to calculate the transmission matrix and evaluation function M (M represents the population size, typically ranging from 10 to 100) times in one iteration round. Thus, we need update the variables and get the transform matrix M×T times. Whereas, the SPGD algorithm used in our experiment apply perturbation $Δ u$ to all variables together. Consequently, it only needs to update the voltage of phase shifters three times. As a result, we only need to update the variables and obtain the transformation matrix 3×T times during the optimization. This significantly reduces the time and power consumption for optimization. In order to further verify this conclusion, we study the performance of GD, GA, PSO and SPGD algorithms to train the same 6×6, 10×10, 16×16, and 32×32 matrices numerically. Specifically, we generate a random unitary matrix as crosstalk and use the above four algorithms to optimize the on-chip processor to make the ${| \tilde{U} U - E |}^{2}$ minimum. A threshold is set for making the program automatically jump out of the loop when the evaluate function is below it and the number of iterations is recorded. Due to the random of the SPGD algorithm, we take the average of the calculated results to ensure the reliability of the results. It is worth mentioning that both GA and PSO algorithm need a large enough population size to ensure the performance of optimization but a too large group will lead to the waste of computation. Therefore, we train the photonic chip with difference population size and took the best. The results are shown in the Table1.

From Table 1, It can be seen that the SPGD algorithm requires much less computation compares to the GD, GA and PSO algorithms. Especially, when the matrix sizes are 6×6, 10×10 and 16×16, there is not much difference of the total amount of computation between these four algorithms. However, When the matrix size increases to 32×32, the number of updates for SPGD algorithm is close to only one fifth of the traditional GD algorithm. This can be explained by the fact that the number of iterations for the GD algorithm is N(N−1)×T, indicating a square relationship with N. Consequently, it is foreseeable that the difference will become more apparent as the size of the matrix expands further. Therefore, the algorithm has significant implications for larger scale optical matrix computation acceleration chips.

Table 1. Performance of different algorithms.

View table
View all Tables
Table 1. Performance of different algorithms.

Algorithm Numbers of update Matrix sizes
N=6 N=10 N=16 N=32
GD N(N−1)×T 690 3870 13200 93248
GA M×T 1048 9046.67 39732 171200
PSO M×T 1024 5912 31056 116145
SPGD 3×T 297.9 1092.6 4752.6 18053.1

Conclusion

We propose and demonstrate an integrated reconfigurable optical processor trained by SPGD algorithm and use it for unique optical switching and channel descrambling. On this basis, we apply it to optical communication systems with 20-Gbaud 16-QAM signals and measure the constellations and BER performance, through which we can draw the conclusion that the 6×6 integrated optical processor has the application prospect in robust optical switching and channel descrambling of MDM-based fiber-optic communication systems. In addition, it can also be applied to various application scenarios in SDM-based fiber and free-space optical communication systems. Lastly, we compare it with traditional GD algorithm and show its distinct advantages, especially when the matrix size is large. The suggestion that the SPGD algorithm has the potential to optimize large-scale optical matrix computation acceleration chips points to a significant breakthrough in computational optics. This innovation holds the promise of accelerating complex optical matrix computations, which are fundamental to various optical processing applications, including signal processing and optical computing. It hints at a future where optical processors can handle even more intricate tasks efficiently.

Category: Research Articles

Received: Sep. 30, 2023

Accepted: Feb. 4, 2024

Published Online: Jul. 19, 2024

The Author Email: Jian Wang (JWang)

DOI:10.29026/oea.2024.230182

Table 1. Performance of different algorithms.

Table 1. Performance of different algorithms.