The rapid advancement of artificial intelligence has significantly increased the demand for computing resources. However, the continued performance improvements of microelectronic chips are constrained by the deceleration of Moore’s law.1
Advanced Photonics, Volume. 7, Issue 4, 046003(2025)
Highly integrated all-optical nonlinear deep neural network for multi-thread processing
Optical neural networks have emerged as feasible alternatives to their electronic counterparts, offering significant benefits such as low power consumption, low latency, and high parallelism. However, the realization of ultra-compact nonlinear deep neural networks and multi-thread processing remain crucial challenges for optical computing. We present a monolithically integrated all-optical nonlinear diffractive deep neural network (AON-D2NN) chip for the first time. The all-optical nonlinear activation function is implemented using germanium microstructures, which provide low loss and are compatible with the standard silicon photonics fabrication process. Assisted by the germanium activation function, the classification accuracy is improved by 9.1% for four-classification tasks. In addition, the chip’s reconfigurability enables multi-task learning in situ via an innovative cross-training algorithm, yielding two task-specific inference results with accuracies of 95% and 96%, respectively. Furthermore, leveraging the wavelength-dependent response of the chip, the multi-thread nonlinear optical neural network is implemented for the first time, capable of handling two different tasks in parallel. The proposed AON-D2NN contains three hidden layers with a footprint of only 0.73 mm2. It can achieve ultra-low latency (172 ps), paving the path for realizing high-performance optical neural networks.
1 Introduction
The rapid advancement of artificial intelligence has significantly increased the demand for computing resources. However, the continued performance improvements of microelectronic chips are constrained by the deceleration of Moore’s law.1
Although diffractive ONNs have made remarkable strides, two fundamental challenges, the absence of optical nonlinear activation functions and the lack of multi-threaded processing capabilities, continue to impede further performance advancements. Currently, most ONNs achieve nonlinearity through microelectronic chips,11,31,32 necessitating frequent optical-to-electrical and electrical-to-optical conversions. This approach incurs significant penalties in terms of reduced operational bandwidth and increased power consumption, underscoring the critical need for high-performance and on-chip optical activation functions. Efforts to address this challenge can be broadly categorized into all-optical and optical-electrical-optical schemes. All-optical activation functions are hindered by high loss or complex fabrication processes, making them impractical for implementing deep neural networks.33
Therefore, finding suitable materials and architectures for implementing optical nonlinearity and multi-threaded processing is of great importance. To mitigate these, we propose an all-optical nonlinear diffractive deep neural network () chip for the first time, supporting multi-thread processing. Linear and nonlinear operations are implemented through the integration of thermal phase shifters and germanium microstructures onto a diffractive slab waveguide, respectively. Benefiting from the low-loss germanium activation function, the chip achieves a compact design with three hidden layers within a footprint of only . Compared with a linear ONN with a similar configuration, it delivers a 9.1% improvement in accuracy for four-class classification tasks. The chip’s reconfigurability enables multi-task learning in situ via an innovative cross-training algorithm, yielding two task-specific inference results with accuracies of 95% and 96%, respectively. Furthermore, the wavelength-dependent response of the chip facilitates multi-thread parallel processing, demonstrated by the simultaneous execution of two binary classification tasks on a single chip, achieving accuracies of 92.5% and 90.5%. This novel architecture offers significant advantages in integration density, multi-thread inference, scalability of input size, and network depth, paving the way for densely integrated, multi-thread, nonlinear ONNs.
Sign up for Advanced Photonics TOC Get the latest issue of Advanced Photonics delivered right to you!Sign up now
2 Results
2.1 Principle of the Proposed
The conceptual design of the proposed chip is illustrated in Fig. 1(a), showcasing its ability to process multiple classification tasks simultaneously via multiple “cores” embedded within the chip. Unlike conventional microelectronic chips, where cores are physically separated in space, the exploits the wavelength dimension to implement multiple cores within a shared physical structure. By utilizing wavelength-dependent light propagation, the chip enables distinct operations to be executed simultaneously at different wavelengths.
Figure 1.Conceptual diagram and structural schematic of the
Utilizing wavelength-dependent properties, we design the chip, illustrated in Fig. 1(b). The chip simultaneously receives input light at multiple wavelengths. Input data are preprocessed by reshaping and encoding before being converted into voltage signals that are applied to thermal phase shifters. Through the thermo-optic effect, temperature variations induce corresponding phase shifts of the input light, enabling phase encoding of the data. The light then propagates through the diffractive slab waveguide, where various computational operations are performed. By modulating the voltage applied to the thermal phase shifters, the local effective refractive index of the waveguide can be dynamically adjusted, steering the light propagation paths to execute arbitrary matrix operations.42 Nonlinear activation function is achieved through the nonlinear absorption response of germanium material. As light passes regions covered with germanium microstructures, the nonlinear activation function is inherently applied. Finally, the intensity distribution of the output light at the output ports is analyzed to extract the inference results, completing the computational process.
2.2 Chip Fabrication and Characterization
The chip was fabricated using a standard silicon photonics foundry process. Detailed views of the chip’s microscopic structure and the packaged device are shown in Figs. 2(a) and 2(b), respectively. The chip architecture consists of an input layer, three hidden layers, and an output layer. The input layer consists of 64 straight waveguides, each equipped with a thermal phase shifter for phase encoding. The hidden layers are constructed from a diffractive slab waveguide covered with thermal phase shifters and germanium microstructures. Specifically, the hidden layers contain seven columns of thermal phase shifters and two columns of germanium microstructures to support computational tasks. The output layer is formed by 10 output waveguides. Additional details regarding the chip structure are provided in Note 1 in the Supplementary Material.
Figure 2.Fabrication and characterization of
Figure 2(c) depicts the variation in output optical power at a specific port as a function of the electrical power applied to a thermal phase shifter. The experimental measurement shows that an electrical power input of 48.3 mW is sufficient to induce a phase shift. The thermal phase shifter exhibits a resistance of 650 Ω, allowing phase shifts from 0 to with an applied voltage ranging from 0 to 4 V. The chip’s nonlinear response curve, shown in Fig. 2(d), records the output optical power as input light is injected into the input waveguide. The results reveal a nonlinear threshold of , which is relatively high due to the lateral diffusion of light during propagation, leading to reduced power density. When the input light intensity exceeds this threshold, the response curve displays a distinct nonlinear behavior.
To further validate the feasibility of the proposed nonlinear activation function, a dedicated test structure was designed and fabricated. As illustrated in Fig. 2(e), the structure comprises a germanium block measuring in length, in width, and in height, positioned above a straight waveguide with a width of and a height of . Figure 2(f) presents the response curve of this structure and simulated optical propagation fields under varying input optical power. When light propagates through the region covered by the germanium block, it couples into the germanium material. As the input light intensity increases, the absorption of light intensifies, resulting in a nonlinear input–output relationship, consistent with findings in our previous studies.43,44 The experimental results indicate that the test structure exhibits an insertion loss of . However, as the waveguide supports only fundamental mode, higher-order modes induced by the germanium microstructures dissipate, resulting in a lower actual loss attributed to germanium absorption.
To evaluate the chip’s insertion loss, we measured the light intensity at all output ports with no voltage applied to the thermal electrodes. The total insertion loss is measured to be 22.14 dB, consisting of 4.8 dB from the germanium material and 17.34 dB from diffraction loss, primarily due to higher-order mode losses caused by mode mismatch as light transitions from the diffraction structure to the output waveguides. Notably, the total insertion loss does not increase with the length of the diffraction structure. Fabricated on a silicon-on-insulator platform using a standard complementary-metal-oxide-semiconductor (CMOS) process, the chip features a simple structure with high tolerance to fabrication errors, making it well-suited for large-scale manufacturing. Its compact footprint of only demonstrates a high level of integration density.
2.3 Performance of the Proposed
Without nonlinearity, even deep neural networks would be confined to solving only linearly separable problems, as illustrated in Fig. 3(a). Nonlinear activation functions introduce essential nonlinearity to the model, enabling neural networks to model complex data distributions and tackle more advanced deep learning tasks. To demonstrate the effectiveness of the germanium-based activation function, two comparative experiments are carried out with and without activation functions. A detailed description of the experimental system is provided in Note 2 in the Supplementary Material.
Figure 3.Experiment results of
We randomly select 200 images from the handwritten digit dataset for the training set and 120 images for the test set. Limited by the number of input ports, dimensionality reduction for the data is necessary. Utilizing the bilinear interpolation algorithm, the images are resized directly to images, then flattened into vectors, eliminating the need for additional feature extraction steps. The data are linearly mapped to a range between 0 and and loaded onto the input layer via thermal phase shifters. After propagation through the network, the optical intensity distribution at the output ports is detected. Each output port corresponded to a classification category, with the category of the port exhibiting the highest intensity chosen as the classification result. During training, the cross-entropy is employed as the loss function, defined as follows:
We test both the chip and a linear using the same datasets and compare the training results in Figs. 3(b)–3(f). Figure 3(b) presents the evolution of the loss function during the training iterations. With the activation function, the loss function drops more rapidly and converges to a lower minimum value. Figure 3(c) illustrates the variation in accuracy, where the average test set accuracy reaches 75.6% for the linear and 84.7% for the , reflecting a 9.1% improvement due to the activation function. Figures 3(d) and 3(e) present the confusion matrices for both conditions, highlighting that the introduction of nonlinearity significantly enhances the chip’s ability to distinguish the digit “8.” These experimental results demonstrate that the can effectively approximate more complex classifiers effectively. The inference results for a subset of the data are shown in Fig. 3(f), demonstrating that the incorporation of the nonlinear activation function significantly enhances the chip’s classification confidence and accuracy. In particular, the digit 8 is misclassified in the absence of the nonlinear activation function, whereas it is correctly identified when the nonlinear activation function is introduced. Even in highly nonlinear datasets, the nonlinear activation function continues to exhibit similar effects. For further details, please refer to the Note 4 in the Supplementary Material.
2.4 Multi-task Learning in situ of
Multi-task learning in neural networks refers to the training of multiple tasks on the same chip, allowing for the generation of corresponding inference results based on different input tasks. This approach significantly reduces hardware requirements and the need for repetitive model training. Research in the field of computer science has explored various methods to facilitate multi-task learning. For instance, repeated pruning techniques can be employed to eliminate neurons that have minimal impact on the network’s output, and the remaining redundant neurons can be leveraged to train additional tasks. This enables the completion of multiple, distinct tasks within a single network.47
Due to the sparsity inherent in the , modifying the voltage values of certain thermal electrodes within a specific range has minimal impact on the output results. As a result, although redundant neurons cannot be removed via traditional pruning techniques, the chip’s reconfigurability enables multi-task learning through an innovative cross-training algorithm by in-situ training. We choose two distinct binary classification tasks—handwritten digit recognition and handwritten letter classification—for training. Initially, the model is trained on the first task. Once the expected performance is achieved, we calculate the derivative of the loss function to the voltage value applied on each thermal phase shifter. Based on these derivative values, we identify phase shifters with minimal impact on the loss function and use them to train the second task. When the accuracy of the first task on the test set falls below a predefined threshold or when the accuracy of the second task reaches a predefined target, we return to training the first task. This alternating training procedure ensures that both tasks ultimately perform well. Detailed training methods are provided in Note 5 in the Supplementary Material.
The training results are shown in Fig. 4. As the task switches, the number of phase shifters used for training gradually decreases, and the accuracy of the test sets for both tasks stabilizes finally, as seen in Fig. 4(a). Without altering the operational state of the chip, it can provide corresponding inference results for different input tasks. Figure 4(b) presents the confusion matrices for the two tasks at various stages of training. The classification accuracy for the two tasks reaches 95% and 96%, respectively, indicating that multiple tasks can be handled on a single chip without the need for retraining or manufacturing multiple optical chips. This reduces both training and manufacturing costs, demonstrating the versatility of the chip.
Figure 4.Experiment result of the multi-task learning. (a) The experiment result of the cross-training with the number of iterations. (b) The confusion matrix of the two tasks during the iteration.
2.5 Multi-thread Parallel Processing of
Multi-thread parallel processing is a crucial strategy for enhancing chip performance, enabling simultaneous execution of multiple different tasks. Exploiting the wavelength-dependent properties of the diffraction structures, the chip leverages distinct transmission behaviors at different wavelengths, enabling parallel task processing in the wavelength domain.
To verify this, we simultaneously input two light signals at wavelengths of 1550 and 1560 nm into the . The specific experimental setup is shown in Fig. 5(a). Two different classification tasks are executed concurrently on the same chip. Due to the constraints of the input layer design, it is not possible to independently load different data for each wavelength. Therefore, a single input layer is shared across both tasks, with data loaded onto two distinct wavelength channels simultaneously. Fundamentally, this approach is similar to using two different datasets for two classification tasks, as both scenarios require the same chip to execute two distinct computational operations. The outputs are then processed through filters to compute the results for each wavelength channel. In future work, independent input layers could be designed to accommodate multi-dataset processing. Task 1 involves distinguishing between the digits “0,” “5” and “1,” “8,” whereas task 2 distinguishes between “0,” “1” and “5,” “8.”
Figure 5.Experiment results of the
In the experiment, light exiting the output ports is directed into a wavelength-selective switch, which separates the optical signals of two different wavelengths with the current chip. This enables the extraction of two output vectors, each representing the output values across all ports for a specific wavelength. Different wavelengths correspond to different tasks, and the loss functions for the two tasks are computed separately based on their respective output vectors. The total evaluation metric is then obtained by summing the individual loss values for optimization. In this experiment, the loss function remains cross-entropy, and training is performed using the parallel stochastic gradient descent algorithm, consistent with the methodology described in Sec. 2.3.
The experimental results, shown in Fig. 5(b), reveal that the maximum classification accuracies for the two parallel tasks are 92.5% and 91.5%, respectively. The result demonstrates that, despite performing two tasks in parallel, the accuracy loss is minimal with a decrease of only about 2% and 2.5%, respectively, compared with when each task is processed sequentially. Figure 5(c) presents the output distribution for a subset of the data. For task 1, higher light intensity at port 1 indicates that the input data belongs to either the “0” or “5” category. For task 2, a higher output intensity at port 1 signifies that the input belongs to either the “0” or “1” category. These findings validate that the proposed architecture effectively supports multi-threaded operations.
3 Discussions and Conclusions
The proposed leverages the characteristics of propagation-as-computation, facilitating an inference process that features both low power consumption and ultra-low latency. The power consumption arises from the thermal phase shifters and external control equipment required to sustain the chip’s operational state, resulting in a total power usage of . The latency defined as the time delay between the input layer and the output layer, is experimentally demonstrated to be only 172 ps (more details are presented in Note 6 in the Supplementary Material).
Table 1 presents a comparison of the key performance indicators of the with other state-of-the-art nonlinear deep ONNs. Thanks to the low-loss nonlinear activation function, the proposed architecture contains three hidden layers within a compact footprint. In addition, the architecture supports multi-thread processing, further broadening its range of potential applications.
|
The performance of the can be further optimized at multiple levels. Strategies such as expanding the chip size, fine-tuning the dimensions of the thermal phase shifters, adjusting the spacing between layers, and increasing the number of parallel wavelengths can all contribute to enhanced performance. It is important to note that the maximum number of threads is limited by the structural scale. In this study, the experimental chip supports a maximum of four threads within the -band. A detailed discussion can be found in the Note 7 in the Supplementary Material. In operation, the primary power consumption of the arises from the thermal phase shifters on the diffraction slab waveguide, which are necessary to maintain the chip’s active state. Replacing these thermal phase shifters with non-volatile phase-change materials could enable zero static power consumption, leading to a significant improvement in the chip’s energy efficiency.49
Increasing the number of cascaded layers is also a key strategy for enhancing network performance. In optical-electrical-optical (OEO) schemes, regenerating the optical source at each layer theoretically enables unlimited cascading—a key advantage of such systems. However, in fully all-optical approaches, inherent losses and the presence of a nonlinear threshold fundamentally limit the number of cascaded layers, making infinitely deep architectures impractical. Nevertheless, germanium offers distinct advantages over other nonlinear materials, including lower loss and compatibility with standard fabrication processes, thereby allowing for a greater number of cascaded layers. As shown in Fig. 2(f), when the germanium length is , light undergoes two coupling periods as it propagates through the germanium-covered region. Reducing the germanium length will further decrease the number of coupling periods, thereby mitigating the loss associated with the nonlinear activation function and enabling the to support a greater number of hidden layers.
In terms of throughput, it is typically limited by the response speed of the nonlinear activation function. Recently, an all-optical nonlinear activation function designed and fabricated using two-dimensional materials has demonstrated an impressive response speed of 2.08 THz.52 In the future, integrating such high-speed nonlinear activation functions with the proposed architecture could unlock ultra-high throughput.
In inference tasks, diffraction architectures offer distinct advantages over MZI networks and microring arrays. They achieve higher integration density compared with MZI networks and exhibit greater robustness against disturbances than microring arrays. However, diffraction architectures also have some limitations, including relatively high diffraction loss due to higher-order mode losses caused by mode mismatch as light transitions from the diffraction structure to the output waveguides. In addition, they lack rigorous mathematical proofs for constructing arbitrary matrices, relying primarily on numerical analysis to validate their functionality, which makes them less suitable for precise matrix operations. By contrast, MZI networks and microring arrays do not face these challenges, as they exhibit low loss and can accurately implement arbitrary matrices. Nevertheless, in inference tasks, an exact mathematical formulation of the system is often unnecessary, as algorithmic training can effectively optimize the system to meet specific task requirements. This is precisely where diffraction architectures excel.
In summary, based on the low-loss germanium activation function, we design and fabricate an chip for the first time, achieving dense integration with three hidden layers. This architecture excels in classification tasks with a compact footprint, occupying only and fully compatible with standard CMOS processes. We also developed a cross-training algorithm, enabling the training of two distinct tasks on a single chip, allowing task-specific inference based on input data. Furthermore, the supports multi-threaded parallel processing, enabling simultaneous handling of two classification tasks. It can achieve ultra-low latency (172 ps). These features underscore the significant advantages of in integration density, multi-threaded inference, input scalability, and network depth, paving the way for densely integrated, multi-thread ONNs.
Jialong Zhang is a master candidate at Huazhong University of Science and Technology (HUST), China. Currently, he works on the on-chip diffractive neural network. His work aims to optimize the diffractive structure and explore sensor-accounting integration.
Bo Wu is a PhD student at Huazhong University of Science and Technology, China. Currently, he works on the optimization of optical matrix and monolithic integration of various architectures of photonic computing chips, including optical convolutional neural network, optical recurrent neural network, optical Ising machine, and so on. His work aims to improve the integration density and energy efficiency of optical intelligent computing.
Shiji Zhang is a PhD student at Huazhong University of Science and Technology, China. Currently, he works on the optical convolution and high performance optical activation function. His work aims to explore high-performance optical systems.
Junwei Cheng received his undergraduate training in optics engineering from HUST and got a PhD from HUST in 2024. He specializes in developing optical neural networks based on silicon photonics platform. His work aims to provide efficient computing hardware for artificial intelligence applications.
Yilun Wang received his PhD in electronics science and technology from Huazhong University of Science and Technology (HUST) in 2021, and then worked as a postdoctoral researcher at the Wuhan National Laboratory for Optoelectronics, HUST. Currently, he works as an associate professor at National University of Defense Technology. His research interests focus on silicon-based optoelectronic heterogeneous integrated devices and their applications in optical communications and computing.
Hailong Zhou received his undergraduate training in optics engineering from HUST and got a PhD from HUST in 2017, and now works as an associate professor at Wuhan National Laboratory for Optoelectronics in HUST. He is mainly engaged in the research of optical matrix computation and its application, including optical neural computation and optical logic computation. In recent years, he has published more than 30 SCI academic papers.
Jianji Dong is a professor at Huazhong University of Science and Technology, where he received his PhD in 2008. He then worked as a postdoctoral research fellow in the Department of Engineering at the University of Cambridge until 2010. His current research focuses on optical computing. In 2024, he was supported by the National Natural Science Foundation for Distinguished Young Scholars of China. He serves as the executive editor of Frontier of Optoelectronics and organizes photonics open courses, including a special issue on optoelectronic computing.
Xinliang Zhang received his PhD in physical electronics from the HUST, Wuhan, China, in 2001. He is currently the president of Xidian University and a professor at the Wuhan National Laboratory for Optoelectronics and School of Optical and Electronic Information, HUST. He is the author or coauthor of more than 400 journal articles. His current research interest is optoelectronic devices and integration. In 2016, he was elected as an OSA fellow.
Get Citation
Copy Citation Text
Jialong Zhang, Bo Wu, Shiji Zhang, Junwei Cheng, Yilun Wang, Hailong Zhou, Jianji Dong, Xinliang Zhang, "Highly integrated all-optical nonlinear deep neural network for multi-thread processing," Adv. Photon. 7, 046003 (2025)
Category: Research Articles
Received: Feb. 6, 2025
Accepted: Apr. 17, 2025
Posted: Apr. 17, 2025
Published Online: May. 19, 2025
The Author Email: Hailong Zhou (hailongzhou@hust.edu.cn), Jianji Dong (jjdong@hust.edu.cn)