Highly integrated all-optical nonlinear deep neural network for multi-thread processing

Jialong Zhang; Bo Wu; Shiji Zhang; Junwei Cheng; Yilun Wang; Hailong Zhou; Jianji Dong; Xinliang Zhang

doi:10.1117/1.AP.7.4.046003

1 Introduction

The rapid advancement of artificial intelligence has significantly increased the demand for computing resources. However, the continued performance improvements of microelectronic chips are constrained by the deceleration of Moore’s law.1^–3 Optical computing has emerged as a promising paradigm to address these limitations, offering distinct advantages such as low latency, reduced power consumption, and inherent parallelism.4^–10 A milestone in this field was the successful vowel recognition achieved using Mach–Zehnder interferometer (MZI) mesh,11 which catalyzed considerable interest in optical implementations for artificial intelligence. Subsequently, alternative architectures utilizing spatial phase modulators,12^–16 microring resonator arrays,17^–19 and other innovations20^–25 have been developed to construct optical neural networks (ONNs). Among these novel architectures, MZI mesh and resonant structures face significant challenges in scaling to dense integrations within the constraints of wafer size. These limitations stem from the bulky nature of fundamental optical components and their sensitivity to environmental perturbations. By contrast, on-chip diffractive architectures for ONNs have gained prominence due to their superior performance.26^–29 For instance, the integration of micro-heater arrays has enabled online training within diffractive ONN hardware, facilitating multimodal inference tasks and enhancing the versatility of these networks.30

Although diffractive ONNs have made remarkable strides, two fundamental challenges, the absence of optical nonlinear activation functions and the lack of multi-threaded processing capabilities, continue to impede further performance advancements. Currently, most ONNs achieve nonlinearity through microelectronic chips,11^,31^,32 necessitating frequent optical-to-electrical and electrical-to-optical conversions. This approach incurs significant penalties in terms of reduced operational bandwidth and increased power consumption, underscoring the critical need for high-performance and on-chip optical activation functions. Efforts to address this challenge can be broadly categorized into all-optical and optical-electrical-optical schemes. All-optical activation functions are hindered by high loss or complex fabrication processes, making them impractical for implementing deep neural networks.33^–35 On the other hand, optical-electrical-optical activation schemes introduce substantial system complexity due to the reliance on electro-optic conversions.36^,37 Consequently, only a limited number of optical deep nonlinear neural networks have been reported, such as photonic deep neural networks utilizing photodetector-driven microring activation functions.37 However, this approach necessitates the use of transimpedance amplifiers to amplify the photocurrent, increasing the power consumption. Another promising avenue involves the use of phase-change materials as all-optical activation functions.38 Nevertheless, these materials require re-crystallization to reset their state after each response, which significantly adds to the operational complexity. Furthermore, both approaches are constrained by the integration density of their linear components, leading to relatively large footprints. In addition to the challenge of nonlinearity, existing on-chip ONNs are typically limited to process one task at a time or perform identical computations in parallel.39^–41 ONNs capable of parallel processing different tasks by exploiting optical dimensions (e.g., wavelength, mode, polarization) could provide a promising approach to improve both inferential capability and computational power density.

Therefore, finding suitable materials and architectures for implementing optical nonlinearity and multi-threaded processing is of great importance. To mitigate these, we propose an all-optical nonlinear diffractive deep neural network ( $AON - D^{2} NN$ ) chip for the first time, supporting multi-thread processing. Linear and nonlinear operations are implemented through the integration of thermal phase shifters and germanium microstructures onto a diffractive slab waveguide, respectively. Benefiting from the low-loss germanium activation function, the $AON - D^{2} NN$ chip achieves a compact design with three hidden layers within a footprint of only $0.73 {mm}^{2}$ . Compared with a linear ONN with a similar configuration, it delivers a 9.1% improvement in accuracy for four-class classification tasks. The chip’s reconfigurability enables multi-task learning in situ via an innovative cross-training algorithm, yielding two task-specific inference results with accuracies of 95% and 96%, respectively. Furthermore, the wavelength-dependent response of the chip facilitates multi-thread parallel processing, demonstrated by the simultaneous execution of two binary classification tasks on a single chip, achieving accuracies of 92.5% and 90.5%. This novel architecture offers significant advantages in integration density, multi-thread inference, scalability of input size, and network depth, paving the way for densely integrated, multi-thread, nonlinear ONNs.

2 Results

2.1 Principle of the Proposed $AON - D^{2} NN$

The conceptual design of the proposed $AON - D^{2} NN$ chip is illustrated in Fig. 1(a), showcasing its ability to process multiple classification tasks simultaneously via multiple “cores” embedded within the chip. Unlike conventional microelectronic chips, where cores are physically separated in space, the $AON - D^{2} NN$ exploits the wavelength dimension to implement multiple cores within a shared physical structure. By utilizing wavelength-dependent light propagation, the chip enables distinct operations to be executed simultaneously at different wavelengths.

$Conceptual diagram and structural schematic of the AON-D2NN. (a) The conceptual diagram of AON-D2NN. Multiple cores introduced by different wavelength channels in the chip can process multiple classification tasks in parallel. (b) The structural schematic of the AON-D2NN. The input data are encoded and loaded onto the input waveguide array using thermal phase shifters. Light of different wavelengths is simultaneously injected into the diffractive slab waveguide, enabling parallel processing. At the output port, the inference results for each task are determined based on the magnitude of the light intensity, allowing for concurrent task execution.$

Figure 1.Conceptual diagram and structural schematic of the $AON - D^{2} NN$ . (a) The conceptual diagram of $AON - D^{2} NN$ . Multiple cores introduced by different wavelength channels in the chip can process multiple classification tasks in parallel. (b) The structural schematic of the $AON - D^{2} NN$ . The input data are encoded and loaded onto the input waveguide array using thermal phase shifters. Light of different wavelengths is simultaneously injected into the diffractive slab waveguide, enabling parallel processing. At the output port, the inference results for each task are determined based on the magnitude of the light intensity, allowing for concurrent task execution.

Download full size

View all figures

Utilizing wavelength-dependent properties, we design the $AON - D^{2} NN$ chip, illustrated in Fig. 1(b). The chip simultaneously receives input light at multiple wavelengths. Input data are preprocessed by reshaping and encoding before being converted into voltage signals that are applied to thermal phase shifters. Through the thermo-optic effect, temperature variations induce corresponding phase shifts of the input light, enabling phase encoding of the data. The light then propagates through the diffractive slab waveguide, where various computational operations are performed. By modulating the voltage applied to the thermal phase shifters, the local effective refractive index of the waveguide can be dynamically adjusted, steering the light propagation paths to execute arbitrary matrix operations.42 Nonlinear activation function is achieved through the nonlinear absorption response of germanium material. As light passes regions covered with germanium microstructures, the nonlinear activation function is inherently applied. Finally, the intensity distribution of the output light at the output ports is analyzed to extract the inference results, completing the computational process.

2.2 Chip Fabrication and Characterization

The $AON - D^{2} NN$ chip was fabricated using a standard silicon photonics foundry process. Detailed views of the chip’s microscopic structure and the packaged device are shown in Figs. 2(a) and 2(b), respectively. The chip architecture consists of an input layer, three hidden layers, and an output layer. The input layer consists of 64 straight waveguides, each equipped with a thermal phase shifter for phase encoding. The hidden layers are constructed from a diffractive slab waveguide covered with thermal phase shifters and germanium microstructures. Specifically, the hidden layers contain seven columns of thermal phase shifters and two columns of germanium microstructures to support computational tasks. The output layer is formed by 10 output waveguides. Additional details regarding the chip structure are provided in Note 1 in the Supplementary Material.

$Fabrication and characterization of AON-D2NN chip. (a) The microscope photograph of the fabricated chip. (b) The photograph of the packaged chip. It consists of an input layer, three hidden layers, and an output layer. The input layer is constituted by a set of 64 waveguides, each equipped with a thermal phase shifter that facilitates phase encoding. The hidden layers are composed of a diffractive slab waveguide covered by an array of phase shifters and germanium materials, which serve to implement linear matrix operations and nonlinear activation functions, respectively. The output layer is composed of ten output waveguides. (c) The measured output optical power varies with the applied power of a single electrode. (d) Measured response curve of the AON-D2NN. The x-axis represents the optical power entering the photonic chip, which is obtained by subtracting the coupling loss of the grating from the laser output power. The y-axis denotes the normalized total optical power measured at all output ports of the chip. (e) Schematic structure of the test activation function. A germanium block with dimensions of 2.8 μm in length, 0.5 μm in width, and 0.26 μm in height is placed on a waveguide with a width of 0.7 μm and a height of 0.22 μm. (f) Measured response curve and the simulated optical propagation fields of the test activation function.$

Figure 2.Fabrication and characterization of $AON - D^{2} NN$ chip. (a) The microscope photograph of the fabricated chip. (b) The photograph of the packaged chip. It consists of an input layer, three hidden layers, and an output layer. The input layer is constituted by a set of 64 waveguides, each equipped with a thermal phase shifter that facilitates phase encoding. The hidden layers are composed of a diffractive slab waveguide covered by an array of phase shifters and germanium materials, which serve to implement linear matrix operations and nonlinear activation functions, respectively. The output layer is composed of ten output waveguides. (c) The measured output optical power varies with the applied power of a single electrode. (d) Measured response curve of the $AON - D^{2} NN$ . The $x$ -axis represents the optical power entering the photonic chip, which is obtained by subtracting the coupling loss of the grating from the laser output power. The $y$ -axis denotes the normalized total optical power measured at all output ports of the chip. (e) Schematic structure of the test activation function. A germanium block with dimensions of $2.8 μ m$ in length, $0.5 μ m$ in width, and $0.26 μ m$ in height is placed on a waveguide with a width of $0.7 μ m$ and a height of $0.22 μ m$ . (f) Measured response curve and the simulated optical propagation fields of the test activation function.

Download full size

View all figures

Figure 2(c) depicts the variation in output optical power at a specific port as a function of the electrical power applied to a thermal phase shifter. The experimental measurement shows that an electrical power input of 48.3 mW is sufficient to induce a $2 π$ phase shift. The thermal phase shifter exhibits a resistance of 650 Ω, allowing phase shifts from 0 to $π$ with an applied voltage ranging from 0 to 4 V. The chip’s nonlinear response curve, shown in Fig. 2(d), records the output optical power as input light is injected into the input waveguide. The results reveal a nonlinear threshold of $\sim 14.12 W$ , which is relatively high due to the lateral diffusion of light during propagation, leading to reduced power density. When the input light intensity exceeds this threshold, the response curve displays a distinct nonlinear behavior.

To further validate the feasibility of the proposed nonlinear activation function, a dedicated test structure was designed and fabricated. As illustrated in Fig. 2(e), the structure comprises a germanium block measuring $2.8 μ m$ in length, $0.5 μ m$ in width, and $2.6 μ m$ in height, positioned above a straight waveguide with a width of $0.7 μ m$ and a height of $0.22 μ m$ . Figure 2(f) presents the response curve of this structure and simulated optical propagation fields under varying input optical power. When light propagates through the region covered by the germanium block, it couples into the germanium material. As the input light intensity increases, the absorption of light intensifies, resulting in a nonlinear input–output relationship, consistent with findings in our previous studies.43^,44 The experimental results indicate that the test structure exhibits an insertion loss of $\sim 2.4 dB$ . However, as the waveguide supports only fundamental mode, higher-order modes induced by the germanium microstructures dissipate, resulting in a lower actual loss attributed to germanium absorption.

To evaluate the chip’s insertion loss, we measured the light intensity at all output ports with no voltage applied to the thermal electrodes. The total insertion loss is measured to be 22.14 dB, consisting of 4.8 dB from the germanium material and 17.34 dB from diffraction loss, primarily due to higher-order mode losses caused by mode mismatch as light transitions from the diffraction structure to the output waveguides. Notably, the total insertion loss does not increase with the length of the diffraction structure. Fabricated on a silicon-on-insulator platform using a standard complementary-metal-oxide-semiconductor (CMOS) process, the chip features a simple structure with high tolerance to fabrication errors, making it well-suited for large-scale manufacturing. Its compact footprint of only $2.04 m m \times 0.36 m m$ demonstrates a high level of integration density.

2.3 Performance of the Proposed $AON - D^{2} NN$

Without nonlinearity, even deep neural networks would be confined to solving only linearly separable problems, as illustrated in Fig. 3(a). Nonlinear activation functions introduce essential nonlinearity to the model, enabling neural networks to model complex data distributions and tackle more advanced deep learning tasks. To demonstrate the effectiveness of the germanium-based activation function, two comparative experiments are carried out with and without activation functions. A detailed description of the experimental system is provided in Note 2 in the Supplementary Material.

Figure 3.Experiment results of $AON - D^{2} NN$ chip, compared with the linear configuration. (a) Conceptual diagram illustrating the role of nonlinearity in neural networks. (b) The cost function of the training datasets and the test datasets of the binary classification task. (c) The accuracy of the training datasets and the test datasets of the quadruple classification task with and without activation function. (d) Confusion matrix of the quadruple classification task after training without activation function. (e) Confusion matrix of the quadruple classification task after training with an activation function. (f) The inference results of some data samples.

Download full size

View all figures

We randomly select 200 images from the handwritten digit dataset for the training set and 120 images for the test set. Limited by the number of input ports, dimensionality reduction for the data is necessary. Utilizing the bilinear interpolation algorithm, the $28 \times 28$ images are resized directly to $8 \times 8$ images, then flattened into $64 \times 1$ vectors, eliminating the need for additional feature extraction steps. The data are linearly mapped to a range between 0 and $π$ and loaded onto the input layer via thermal phase shifters. After propagation through the network, the optical intensity distribution at the output ports is detected. Each output port corresponded to a classification category, with the category of the port exhibiting the highest intensity chosen as the classification result. During training, the cross-entropy is employed as the loss function, defined as follows: $O_{i}^{'} = softmax (O_{i}),$ (1) $L = - \frac{1}{m} \sum_{i = 1}^{m} \log (O_{i}^{'} ({\hat{y}}_{i})),$ (2)where $m$ denotes the total number of training datasets and $O_{i}$ represents the optical intensity vector ( $4 \times 1$ ) detected at the output port when the $i$ ’th sample is input. By applying the softmax function to $O_{i}$ , we obtain the normalized probability distribution $O_{i}^{'}$ . Here, ${\hat{y}}_{i}$ signifies the true class label of the $i$ ’th sample, and $O_{i}^{'} ({\hat{y}}_{i})$ indicates the probability value at the port corresponding to ${\hat{y}}_{i}$ . We utilize the parallel stochastic gradient descent algorithm to train the networks, which can efficiently compute the gradient and significantly reduce the communication overhead between devices.45^,46 Combined with the Adam optimizer, this approach facilitates rapid convergence, thereby expediting the training process; more details are seen in Note 3 in the Supplementary Material.

We test both the $AON - D^{2} NN$ chip and a linear $D^{2} NN$ using the same datasets and compare the training results in Figs. 3(b)–3(f). Figure 3(b) presents the evolution of the loss function during the training iterations. With the activation function, the loss function drops more rapidly and converges to a lower minimum value. Figure 3(c) illustrates the variation in accuracy, where the average test set accuracy reaches 75.6% for the linear $D^{2} NN$ and 84.7% for the $AON - D^{2} NN$ , reflecting a 9.1% improvement due to the activation function. Figures 3(d) and 3(e) present the confusion matrices for both conditions, highlighting that the introduction of nonlinearity significantly enhances the chip’s ability to distinguish the digit “8.” These experimental results demonstrate that the $AON - D^{2} NN$ can effectively approximate more complex classifiers effectively. The inference results for a subset of the data are shown in Fig. 3(f), demonstrating that the incorporation of the nonlinear activation function significantly enhances the chip’s classification confidence and accuracy. In particular, the digit 8 is misclassified in the absence of the nonlinear activation function, whereas it is correctly identified when the nonlinear activation function is introduced. Even in highly nonlinear datasets, the nonlinear activation function continues to exhibit similar effects. For further details, please refer to the Note 4 in the Supplementary Material.

2.4 Multi-task Learning in situ of $AON - D^{2} NN$

Multi-task learning in neural networks refers to the training of multiple tasks on the same chip, allowing for the generation of corresponding inference results based on different input tasks. This approach significantly reduces hardware requirements and the need for repetitive model training. Research in the field of computer science has explored various methods to facilitate multi-task learning. For instance, repeated pruning techniques can be employed to eliminate neurons that have minimal impact on the network’s output, and the remaining redundant neurons can be leveraged to train additional tasks. This enables the completion of multiple, distinct tasks within a single network.47

Due to the sparsity inherent in the $AON - D^{2} NN$ , modifying the voltage values of certain thermal electrodes within a specific range has minimal impact on the output results. As a result, although redundant neurons cannot be removed via traditional pruning techniques, the chip’s reconfigurability enables multi-task learning through an innovative cross-training algorithm by in-situ training. We choose two distinct binary classification tasks—handwritten digit recognition and handwritten letter classification—for training. Initially, the model is trained on the first task. Once the expected performance is achieved, we calculate the derivative of the loss function $L$ to the voltage value applied on each thermal phase shifter. Based on these derivative values, we identify phase shifters with minimal impact on the loss function and use them to train the second task. When the accuracy of the first task on the test set falls below a predefined threshold or when the accuracy of the second task reaches a predefined target, we return to training the first task. This alternating training procedure ensures that both tasks ultimately perform well. Detailed training methods are provided in Note 5 in the Supplementary Material.

The training results are shown in Fig. 4. As the task switches, the number of phase shifters used for training gradually decreases, and the accuracy of the test sets for both tasks stabilizes finally, as seen in Fig. 4(a). Without altering the operational state of the chip, it can provide corresponding inference results for different input tasks. Figure 4(b) presents the confusion matrices for the two tasks at various stages of training. The classification accuracy for the two tasks reaches 95% and 96%, respectively, indicating that multiple tasks can be handled on a single chip without the need for retraining or manufacturing multiple optical chips. This reduces both training and manufacturing costs, demonstrating the versatility of the $AON - D^{2} NN$ chip.

Figure 4.Experiment result of the multi-task learning. (a) The experiment result of the cross-training with the number of iterations. (b) The confusion matrix of the two tasks during the iteration.

Download full size

View all figures

2.5 Multi-thread Parallel Processing of $AON - D^{2} NN$

Multi-thread parallel processing is a crucial strategy for enhancing chip performance, enabling simultaneous execution of multiple different tasks. Exploiting the wavelength-dependent properties of the diffraction structures, the $AON - D^{2} NN$ chip leverages distinct transmission behaviors at different wavelengths, enabling parallel task processing in the wavelength domain.

To verify this, we simultaneously input two light signals at wavelengths of 1550 and 1560 nm into the $AON - D^{2} NN$ . The specific experimental setup is shown in Fig. 5(a). Two different classification tasks are executed concurrently on the same chip. Due to the constraints of the input layer design, it is not possible to independently load different data for each wavelength. Therefore, a single input layer is shared across both tasks, with data loaded onto two distinct wavelength channels simultaneously. Fundamentally, this approach is similar to using two different datasets for two classification tasks, as both scenarios require the same chip to execute two distinct computational operations. The outputs are then processed through filters to compute the results for each wavelength channel. In future work, independent input layers could be designed to accommodate multi-dataset processing. Task 1 involves distinguishing between the digits “0,” “5” and “1,” “8,” whereas task 2 distinguishes between “0,” “1” and “5,” “8.”

Figure 5.Experiment results of the $AON - D^{2} NN$ chip with multi-thread processing. (a) The specific experimental setup of multi-thread processing. (b) Comparison of the results obtained from two binary classification inference tasks under two-thread parallel processing and sequential task processing. Task 1 involves distinguishing between the digits “0,” “5” and “1,” “8,” whereas task 2 distinguishes between “0,” “1” and “5,” “8.” (c) The inference results of some data samples.

Download full size

View all figures

In the experiment, light exiting the output ports is directed into a wavelength-selective switch, which separates the optical signals of two different wavelengths with the current chip. This enables the extraction of two output vectors, each representing the output values across all ports for a specific wavelength. Different wavelengths correspond to different tasks, and the loss functions for the two tasks are computed separately based on their respective output vectors. The total evaluation metric is then obtained by summing the individual loss values for optimization. In this experiment, the loss function remains cross-entropy, and training is performed using the parallel stochastic gradient descent algorithm, consistent with the methodology described in Sec. 2.3.

The experimental results, shown in Fig. 5(b), reveal that the maximum classification accuracies for the two parallel tasks are 92.5% and 91.5%, respectively. The result demonstrates that, despite performing two tasks in parallel, the accuracy loss is minimal with a decrease of only about 2% and 2.5%, respectively, compared with when each task is processed sequentially. Figure 5(c) presents the output distribution for a subset of the data. For task 1, higher light intensity at port 1 indicates that the input data belongs to either the “0” or “5” category. For task 2, a higher output intensity at port 1 signifies that the input belongs to either the “0” or “1” category. These findings validate that the proposed architecture effectively supports multi-threaded operations.

3 Discussions and Conclusions

The proposed $AON - D^{2} NN$ leverages the characteristics of propagation-as-computation, facilitating an inference process that features both low power consumption and ultra-low latency. The power consumption $P_{total}$ arises from the thermal phase shifters and external control equipment required to sustain the chip’s operational state, resulting in a total power usage of $\sim 8.47 W$ . The latency defined as the time delay between the input layer and the output layer, is experimentally demonstrated to be only 172 ps (more details are presented in Note 6 in the Supplementary Material).

Table 1 presents a comparison of the key performance indicators of the $AON - D^{2} NN$ with other state-of-the-art nonlinear deep ONNs. Thanks to the low-loss nonlinear activation function, the proposed architecture contains three hidden layers within a compact footprint. In addition, the architecture supports multi-thread processing, further broadening its range of potential applications.

Table 1. Comparison of the $AON - D^{2} NN$ chip with state-of-the-art nonlinear deep ONNs.

View table
View all Tables
Table 1. Comparison of the $AON - D^{2} NN$ chip with state-of-the-art nonlinear deep ONNs.

Basic unit Footprint ( ${mm}^{2}$ ) Latency (ps) Number of hidden layers Working threads
MRR+PCM³⁸ 0.72 N/A 1 Single
Attenuator + PD³⁷ 9.3 570 3 Single
MZI + PD⁴⁸ 34.2 410 3 Single
Our work 0.73 172 3 Multithread

The performance of the $AON - D^{2} NN$ can be further optimized at multiple levels. Strategies such as expanding the chip size, fine-tuning the dimensions of the thermal phase shifters, adjusting the spacing between layers, and increasing the number of parallel wavelengths can all contribute to enhanced performance. It is important to note that the maximum number of threads is limited by the structural scale. In this study, the experimental chip supports a maximum of four threads within the $C$ -band. A detailed discussion can be found in the Note 7 in the Supplementary Material. In operation, the primary power consumption of the $AON - D^{2} NN$ arises from the thermal phase shifters on the diffraction slab waveguide, which are necessary to maintain the chip’s active state. Replacing these thermal phase shifters with non-volatile phase-change materials could enable zero static power consumption, leading to a significant improvement in the chip’s energy efficiency.49^–51

Increasing the number of cascaded layers is also a key strategy for enhancing network performance. In optical-electrical-optical (OEO) schemes, regenerating the optical source at each layer theoretically enables unlimited cascading—a key advantage of such systems. However, in fully all-optical approaches, inherent losses and the presence of a nonlinear threshold fundamentally limit the number of cascaded layers, making infinitely deep architectures impractical. Nevertheless, germanium offers distinct advantages over other nonlinear materials, including lower loss and compatibility with standard fabrication processes, thereby allowing for a greater number of cascaded layers. As shown in Fig. 2(f), when the germanium length is $2.8 μ m$ , light undergoes two coupling periods as it propagates through the germanium-covered region. Reducing the germanium length will further decrease the number of coupling periods, thereby mitigating the loss associated with the nonlinear activation function and enabling the $AON - D^{2} NN$ to support a greater number of hidden layers.

In terms of throughput, it is typically limited by the response speed of the nonlinear activation function. Recently, an all-optical nonlinear activation function designed and fabricated using two-dimensional materials has demonstrated an impressive response speed of 2.08 THz.52 In the future, integrating such high-speed nonlinear activation functions with the proposed $AON - D^{2} NN$ architecture could unlock ultra-high throughput.

In inference tasks, diffraction architectures offer distinct advantages over MZI networks and microring arrays. They achieve higher integration density compared with MZI networks and exhibit greater robustness against disturbances than microring arrays. However, diffraction architectures also have some limitations, including relatively high diffraction loss due to higher-order mode losses caused by mode mismatch as light transitions from the diffraction structure to the output waveguides. In addition, they lack rigorous mathematical proofs for constructing arbitrary matrices, relying primarily on numerical analysis to validate their functionality, which makes them less suitable for precise matrix operations. By contrast, MZI networks and microring arrays do not face these challenges, as they exhibit low loss and can accurately implement arbitrary matrices. Nevertheless, in inference tasks, an exact mathematical formulation of the system is often unnecessary, as algorithmic training can effectively optimize the system to meet specific task requirements. This is precisely where diffraction architectures excel.

In summary, based on the low-loss germanium activation function, we design and fabricate an $AON - D^{2} NN$ chip for the first time, achieving dense integration with three hidden layers. This architecture excels in classification tasks with a compact footprint, occupying only $0.73 m m^{2}$ and fully compatible with standard CMOS processes. We also developed a cross-training algorithm, enabling the training of two distinct tasks on a single chip, allowing task-specific inference based on input data. Furthermore, the $AON - D^{2} NN$ supports multi-threaded parallel processing, enabling simultaneous handling of two classification tasks. It can achieve ultra-low latency (172 ps). These features underscore the significant advantages of $AON - D^{2} NN$ in integration density, multi-threaded inference, input scalability, and network depth, paving the way for densely integrated, multi-thread ONNs.

Jialong Zhang is a master candidate at Huazhong University of Science and Technology (HUST), China. Currently, he works on the on-chip diffractive neural network. His work aims to optimize the diffractive structure and explore sensor-accounting integration.

Bo Wu is a PhD student at Huazhong University of Science and Technology, China. Currently, he works on the optimization of optical matrix and monolithic integration of various architectures of photonic computing chips, including optical convolutional neural network, optical recurrent neural network, optical Ising machine, and so on. His work aims to improve the integration density and energy efficiency of optical intelligent computing.

Shiji Zhang is a PhD student at Huazhong University of Science and Technology, China. Currently, he works on the optical convolution and high performance optical activation function. His work aims to explore high-performance optical systems.

Junwei Cheng received his undergraduate training in optics engineering from HUST and got a PhD from HUST in 2024. He specializes in developing optical neural networks based on silicon photonics platform. His work aims to provide efficient computing hardware for artificial intelligence applications.

Yilun Wang received his PhD in electronics science and technology from Huazhong University of Science and Technology (HUST) in 2021, and then worked as a postdoctoral researcher at the Wuhan National Laboratory for Optoelectronics, HUST. Currently, he works as an associate professor at National University of Defense Technology. His research interests focus on silicon-based optoelectronic heterogeneous integrated devices and their applications in optical communications and computing.

Hailong Zhou received his undergraduate training in optics engineering from HUST and got a PhD from HUST in 2017, and now works as an associate professor at Wuhan National Laboratory for Optoelectronics in HUST. He is mainly engaged in the research of optical matrix computation and its application, including optical neural computation and optical logic computation. In recent years, he has published more than 30 SCI academic papers.

Jianji Dong is a professor at Huazhong University of Science and Technology, where he received his PhD in 2008. He then worked as a postdoctoral research fellow in the Department of Engineering at the University of Cambridge until 2010. His current research focuses on optical computing. In 2024, he was supported by the National Natural Science Foundation for Distinguished Young Scholars of China. He serves as the executive editor of Frontier of Optoelectronics and organizes photonics open courses, including a special issue on optoelectronic computing.

Xinliang Zhang received his PhD in physical electronics from the HUST, Wuhan, China, in 2001. He is currently the president of Xidian University and a professor at the Wuhan National Laboratory for Optoelectronics and School of Optical and Electronic Information, HUST. He is the author or coauthor of more than 400 journal articles. His current research interest is optoelectronic devices and integration. In 2016, he was elected as an OSA fellow.

Category: Research Articles

Received: Feb. 6, 2025

Accepted: Apr. 17, 2025

Posted: Apr. 17, 2025

Published Online: May. 19, 2025

The Author Email: Hailong Zhou (hailongzhou@hust.edu.cn), Jianji Dong (jjdong@hust.edu.cn)

DOI:10.1117/1.AP.7.4.046003

CSTR:32187.14.1.AP.7.4.046003

Table 1. Comparison of the AON-D2NN chip with state-of-the-art nonlinear deep ONNs.

Table 1. Comparison of the AON-D2NN chip with state-of-the-art nonlinear deep ONNs.

Table 1. Comparison of the $AON - D^{2} NN$ chip with state-of-the-art nonlinear deep ONNs.

Table 1. Comparison of the $AON - D^{2} NN$ chip with state-of-the-art nonlinear deep ONNs.