Achieving superior accuracy in photonic neural networks with physical multi-synapses

Zhuonan Jia; Haopeng Tao; Guang-Bin Huang; Ting Mei

doi:10.1117/1.APN.4.4.046010

1 Introduction

Artificial intelligence (AI) has seen extensive applications in recent decades, spanning from facial recognition to autonomous driving and natural language processing.1 However, the exponential surge in data poses challenges for traditional digital computing, leading to issues such as the “memory wall” and high energy consumption in AI tasks. The waning effect of Moore’s Law further highlights the limitations of traditional digital computing.2 Leveraging light as an information carrier presents a promising alternative, sparking strong interest in integrating photonics into AI applications.3

Traditional artificial neural networks on electronic computers execute mathematical algorithms through numerical calculations in digital form. Efforts have been made to convert these calculations into more energy-efficient analog forms via physical transformations, which can be electronic or photonic. Many photonic neural networks have adopted deep learning models, such as diffractive deep neural networks ( $D^{2} NN s$ ),4^–8 spatial convolution networks,9 and on-chip networks based on Mach-Zehnder interferometers (MZIs)10^–12 or micro-ring resonators.13^–17 These networks perform synaptic connections and nonlinear activations, for which matrix operations are executed through photonic transformations. Feature extraction is accomplished by optimizing connection strengths, akin to digital neural networks, with weights computed on computers and then applied to physical devices. Experimental test accuracy of $D^{2} NN$ s on the MNIST dataset has reached $\sim 90 %$ ,4^,18^,19 and a recent hybridization of $D^{2} NN$ with subsequent electronic digital networks achieved a classification test accuracy of 97.1%.20 In on-chip networks, the MZI architecture typically employs singular value decomposition for matrix operations, achieving a classification test accuracy of 90.5% on the MNIST dataset.11 The micro-ring architecture can implement matrix operations using a crossbar configuration via coupling relationships. A phase-change material-based micro-ring network has demonstrated an accuracy of 95.3% on the MNIST dataset.15

Experimental accuracy of hardware-architected networks is usually lower than that of digital computation due to errors in the mathematical description of physical transformations, device fabrication, and weight quantization. Most hardware architectures are designed for analog computing based on an isomorphic mathematics relationship with digital computing. Analog computing is generally considered less accurate than digital computing because digital computing operates with discrete values that can be precisely represented, whereas analog computing deals with continuous signals that are more susceptible to noise and other sources of error. Achieving software-equivalent accuracy will require improvement in both devices and hardware-aware training techniques for analog accelerators.21

However, what if an artificial neural network employs a physical transformation without relying on mathematical descriptions? The so-called physical neural network employs physics-aware training based on a hybrid in situ-in silico algorithm. A physical–digital hybrid architecture employing broadband optical second harmonic generation performed the MNIST task with 97% test accuracy.22 Extreme learning machines (ELMs),23^–25 featuring an untrained hidden layer of random neurons and a trained output layer, can adopt real physical synapses without requiring mathematical descriptions, and the training is naturally physics-aware. Unlike software-architected numerical synapses, such physical synaptic connections are not analog computation, warranting exploration for accuracy performance compared with digital synapses. Photonic ELM networks26^–28 harnessing free-space optical diffraction or multiple optical scattering have achieved accuracies of 92.18%27 or even close to 98%28 on the MNIST dataset.

This study delves into the accuracy performance of physical synapses compared with digital synapses in ELM neural networks. Free-space optical diffraction establishes the physical synaptic connections between input neurons and hidden neurons in the photonic neural network, whereas in the digital ELM networks, synaptic connections are realized using a random number template (RNT) and the angular-spectrum diffraction model. Tests on the MNIST,29 Fashion-MNIST,30 and CIFAR-1031 datasets reveal that the photonic network outperforms digital ELM networks. A strategy of photonic multi-synapses is demonstrated to show significant enhancement in accuracy. An input image is duplicated into multiple copies in a spatial array and projected onto the sensing layer of a camera via free-space diffraction. A hidden neuron thus receives multi-synaptic connections from an input neuron through many pathways. Experimental results show classification accuracies of 99.79% for MNIST, 98.26% for Fashion-MNIST, and 90.29% for CIFAR-10. These results are comparable to state-of-the-art digital deep learning algorithms and surpass most hardware architectures while also outperforming their digital counterparts. In addition, the photonic architecture achieves fast training speeds in the order of seconds, 2.89 TOPs/s computing speed, and low optical energy consumption at the attojoule-per-MAC (multiply-accumulate) level. The photonic multi-synapse neural network offers a simple and high-performance hardware architecture. By utilizing real physical synaptic connections, the system incorporates numerous real-world factors into the inference and training processes, bringing new opportunities for enhancing network accuracy. Photonic multi-synapses present a mechanism beyond nonlinearity and network depth and have the potential to play an important role in advancing intelligent devices and signal processing.

2 Results

2.1 Photonic Multi-synapse Neural Network Based on Input Duplicates

As illustrated in Fig. 1(a), the photonic multi-synapse neural network implements random hidden neurons via optical diffraction between the input layer and the hidden layer within the ELM framework. Photonic multi-synapses are established from an input neuron ( $x_{u}$ ) to a hidden neuron ( $h_{v}$ ) in multiple pathways, such that the synaptic connection weight is the sum of the weights across all $p$ paths, $w_{u v} = \sum_{p} {w_{p}}^{(u v)}$ . This setup involves duplicating an input image using a spatial light modulator to generate multiple copies that reach the camera through distinct spatial paths, as shown in Fig. 1(b). Through optoelectronic conversion, the optical information is transformed into the digital domain, leading to hidden neurons producing output $h_{v} = G (\sum_{1}^{n} x_{u} w_{u v})$ , with $G$ representing the nonlinear activation function. Subsequent processes involve down-sampling from hidden neurons $h$ to $h^{'}$ using a mean method, followed by matrix multiplication with the tensor core $β$ , and cognitive decision-making through digital computation. In the experimental schematic illustrated in Fig. 1(c), a grayscale input image is encoded as a phase image in the duplicate array, which is read by the laser light on the spatial light modulator. The region of interest (ROI) window defines the portion of the diffraction pattern on the camera used for further processing. Detailed experimental and training procedures are outlined in Methods, Supplementary Note 3, and Fig. S5 in the Supplementary Material.

$Photonic multi-synapse neural network. (a) Schematic of network architecture. (b) Photonic multi-synaptic connection implemented by input duplicates and multi-pathway diffractive propagation. (c) Experimental schematic.$

Figure 1.Photonic multi-synapse neural network. (a) Schematic of network architecture. (b) Photonic multi-synaptic connection implemented by input duplicates and multi-pathway diffractive propagation. (c) Experimental schematic.

Download full size

View all figures

2.2 Image Classification Experiments

Three datasets including MNIST, Fashion-MNIST, and CIFAR-10 were employed for the image classification experiments. Binary images in MNIST and Fashion-MNIST were phase-encoded using 0 and $π$ , whereas grayscale images in CIFAR-10 were linearly mapped within a phase range of $[0, 1.8 π]$ , as illustrated in Fig. 2(a). The diffractive projections of all images in the datasets were captured by a camera and subsequently processed for training and classification by computers. Original images, formatted as $q \times q pixels$ (with $q$ being 28 for MNIST and Fashion-MNIST, and 32 for CIFAR-10), were duplicated into arrays of $5 \times 5$ and $9 \times 9$ for establishing photonic multi-synaptic connections. Increasing the duplicate interval [Fig. 2(b)] led to reduced test accuracy, as seen in Fig. 2(c), prompting the adoption of null intervals in subsequent investigations. The reason for this effect is explained in Supplementary Note 4 and Fig. S11 in the Supplementary Material. In addition, for grayscale images as illustrated in Fig. 2(d), optimal performance was achieved with a phase encoding range closest to $2 π$ , leading to the subsequent adoption of the maximum achievable phase modulation of $1.8 π$ . The impact of varying the ROI size of the camera on test accuracy is highlighted in Fig. 2(e), with the ROI eventually set to $400 \times 400$ for further experiments. Camera exposure gain, affecting the nonlinear activation functions, was set to 100% for optimization purposes, as depicted in Fig. 2(f). Furthermore, the introduction of color channels in the original CIFAR-10 dataset demonstrated enhanced test accuracy, validated by aggregating projection results for RGB channels, as shown in Figs. 2(g) and 5(e).

Figure 2.Experimental image classification. (a) Image examples from the datasets of MNIST, Fashion-MNIST, and CIFAR-10 (grayscale) in the phase-encoded form. (b) Schematic diagram of input duplicate array. (c) Test accuracy versus duplicate interval. (d) Reduction rate of classification error with respect to mono-synaptic connections. (e) Impact of ROI window on test accuracy. (f) Test accuracy under different camera exposure settings, with exposure time of 3 ms and exposure gains of 100% and 4000%. (g) Comparison of test accuracy between grayscale and color images considering joint training of RGB channels for CIFAR-10. (h) Test accuracies for the three datasets in duplicate array formats of $1 \times 1$ , $5 \times 5$ , and $9 \times 9$ with 10,000 hidden neurons. (i)–(k) Confusion matrices for the three datasets in the $9 \times 9$ format with 10,000 hidden neurons. (l)–(n) Test accuracy under different numbers of hidden neurons for the three datasets. CIFAR-10 (grayscale) is adapted in panels (c)–(f) and CIFAR-10 (RGB) is adapted in panels (h), (k), and (n).

Download full size

View all figures

Ultimately, image classification experiments using multi-synaptic connections were carried out on the three datasets [Fig. 2(h)]. The introduction of photonic multi-synaptic connections ( $5 \times 5$ , $9 \times 9$ ) significantly led to a notable enhancement in test accuracy compared with mono-synaptic connections ( $1 \times 1$ ), with a more pronounced effect observed for more complex images. Test accuracies of 99.79%, 98.26%, and 90.29% were achieved for MINST, Fashion-MNIST, and CIFAR-10, respectively, by employing $9 \times 9$ input duplicates, with their respective confusion matrices shown in Figs. 2(i)–2(k). Furthermore, as the number of hidden neurons increased, the test accuracy gradually improved and approached saturation [Figs. 2(l)–2(n)]. A network comprising 900 hidden neurons demonstrated stable performance, with training times in the order of seconds. Notably, employing multi-synaptic connections with $9 \times 9$ input duplicates on the CIFAR-10 dataset enhanced test accuracy by 19.94% compared with using mono-synaptic connections with 10,000 hidden neurons, surpassing traditional digital ELM networks. These image classification results are comparable to state-of-the-art deep digital networks and outperform most hardware-architected networks.

ELMs facilitate the random generation of hidden neurons without deliberate control over their connection strengths from input neurons. In traditional digital ELM networks, an RNT is utilized to define synaptic connection strengths for projecting an input neuron to hidden neurons. Conversely, photonic neural networks leverage diffractive optical propagation for physical synaptic connections, enabling the direct implementation of physical transformation without the need for mathematical representation. On the other hand, mathematical models such as the angular-spectrum diffraction model can be employed to construct digital ELM networks by simulating diffractive optical propagation numerically. For the three datasets, the model-based and the RNT-based networks exhibit similar test accuracy. A noteworthy finding from Figs. 3(a)–3(c) indicates that physical synapses outperform their numerical counterparts, particularly evident with more complex image datasets. This suggests that the mathematical transformation introduces information fidelity loss due to errors inherent in the numerical method and digital computation. These errors act as noise-impacting feature extraction within the hidden layer. By contrast, the physical transformation is immune to all these issues, facilitating high-fidelity feature extraction.

Figure 3.Test accuracy comparison. (a)–(c) Test accuracies on the three datasets for networks with mono-synaptic connections. (d) Test accuracies on the CIFAR-10 (grayscale images) dataset comparing the multi-synaptic connection effects.

Download full size

View all figures

The diffractive propagation responsible for synaptic connections involves transforming the spatial frequency spectrum from the input plane to the camera sensing plane. However, because the input is limited in area, crosstalk noise can arise in the frequency spectrum. Arranging the input into a duplicate array helps alleviate this crosstalk noise, as elaborated in Supplementary Note 4 and Fig. S10 in the Supplementary Material. This could explain the improvement in accuracy observed with multi-synaptic connections, demonstrated by both photonic experimentation and optical model-based numerical calculations in Fig. 3(d), but not with RNT-based numerical synaptic connections. With multi-synaptic connections, the optical model-based network achieves higher test accuracy than the RNT-based network [Figs. S1(d) and S3(c) in the Supplementary Material]. This observation suggests that optical model-based numerical synapses excel at feature extraction compared with RNT-based synapses. Although model-based numerical synaptic connections show improved accuracy with increased duplicates from $1 \times 1$ to $5 \times 5$ , further increments to $9 \times 9$ do not yield additional benefits due to numerical implementation errors. Multi-synaptic connections offer more consistent and reliable feature extraction compared with mono-synaptic connections, with physical implementation once again demonstrating superior performance in enhancing accuracy through multi-synaptic configurations (Fig. S7 in the Supplementary Material).

3 Discussion

Our findings reveal that physical synapses outperform digital synapses in accuracy within ELM neural networks. The hardware-architected physical transformation, involved in both training and inference, circumvents the numerical errors and device parameter issues encountered in hardware-mimicking mathematical transformation. By introducing photonic multi-synapses in the physical transformation configuration, feature mapping capabilities can be significantly boosted. This mechanism, extending beyond traditional network depth and nonlinearity concepts, can further enhance accuracy performance through physical transformation. Through the input duplicates scheme implementing multi-synaptic connections, the suppression of crosstalk noise in the spatial frequency spectrum. Achieving classification accuracies of 99.79%, 98.26%, and 90.29% on benchmark datasets MNIST, Fashion-MNIST, and CIFAR-10, respectively, this photonic architecture competes effectively with deep digital networks using a single layer and outperforms its digital ELM counterparts (Table S4 in the Supplementary Material). Operating at a computing speed of 2.89 TOPs/s, with direct energy efficiency of 1.53 POPs/J and training times in seconds, our hardware-architected photonic multi-synapse neural network showcases its efficiency. Optical computing dominates over 99% of the computational composition, featuring optical energy consumption at the attojoule-per-MAC level (Supplementary Note 5 and Table S5 in the Supplementary Material). This innovative architecture proves to be hardware-friendly, enabling feature extraction solely through diffraction propagation.

Enhancing the accuracy of neural networks is crucial for practical applications. Neural networks based on physical transformation bridge the simulation-reality gap, addressing limitations inherent in analog computation. Controllable physical systems hold significant potential for executing AI tasks. The strategy of multi-synaptic connections achieves high accuracy performance in photonic neural networks, with lower complexity in physical implementation compared with digital computation. We anticipate that multi-synaptic connections will serve as a versatile strategy applicable to neural networks implemented through physical transformation. The proposed architecture is poised to inspire application-oriented AI development, drive innovation in modern computing paradigms, and cater to various fields requiring advanced computational capabilities.

4 Appendix: Methods

4.1 Experimental Setup

The experimental setup of the photonic neural network, as illustrated in Fig. 4, utilizes a continuous laser operating at a wavelength of 532 nm. The light beam was set to horizontal polarization using a polarizer, with light power adjusted to $0.86 μ W$ using an attenuator. After experiencing spot shaping by a spatial filter, the light was reflected by a beam splitter onto the surface of a spatial light modulator (SLM, E19×12-500-1200, Meadowlark), which is a reflective liquid crystal on silicon phase-type SLM with $1920 \times 1200$ modulation units, each with a pixel interval of $8.3 μ m$ and an 8-bit precision of phase modulation. The SLM was calibrated to guarantee a linear phase response from 0 to $1.8 π$ at a wavelength of 532 nm. An input image was written on the SLM in a phase-encoded $1 \times 1$ , $5 \times 5$ , or $9 \times 9$ duplicate array formats, and after passing through the SLM, the light carrying the phase-encoded field underwent 18-cm diffractive propagation before reaching the camera (CMOS, E3ISPM09000KPB, TOUPCAM). The selection of the diffraction distance and the duplicate array formats are described in Supplementary Note 3 and shown in Figs. S6 and S9 in the Supplementary Material.

Figure 4.Experimental setup for photonic neural network.

Download full size

View all figures

The ELM’s random hidden neurons were generated through diffraction propagation in the optical path and optoelectronic conversion by the camera. In the neural network, the former functioned as matrix operation and the latter functioned as nonlinear activation. The camera was set with a resolution of $992 \times 998$ , an ROI window of $400 \times 400$ , an exposure time of 3 ms, and an exposure gain of 100%. The image data collected by the camera were transmitted as a hidden layer matrix to the subsequent digital network. The acquisition time for all samples in one dataset is approximately 4 h, whereas the fastest acquisition time is determined by the maximum refresh rate of the SLM, which is 60 Hz. The digital network, performing image downsampling, matrix operation, and network training, was run on a computer equipped with an Intel(R) Core(TM) i9-14900KF CPU @ 3.20 GHz processor and 64 GB RAM.

4.2 Neural Network Architectures

4.2.1 General ELM neural network

A general ELM is a feedforward neural network with a single hidden layer employing random and untrained neurons [Fig. 5(a)]. For $Q$ arbitrary distinct samples $(x_{i}, t_{i})$ , the input $x$ of $i$ th sample can be represented as $x_{i} = [x_{i 1}, x_{i 2}, \dots, x_{i n}]^{T}$ , and the label as $t_{i} = [t_{i 1}, t_{i 2}, \dots, t_{i 10}]^{T}$ . The hidden layer output $h_{i} = G (w x_{i} + b)$ , where $w$ is the hidden layer weight matrix, $b$ is the hidden layer bias vector and $G$ is the nonlinear activation function. The output can be represented as $y_{i} = h_{i} β$ . In ELM neural networks, $w$ and $b$ are randomly generated and untrained, with only the output layer weight matrix $β$ being trained. By adjusting $β$ , the network optimization goal is to minimize the error between the predicted labels and their target labels for all $Q$ samples. The output layer matrix for $Q$ samples with $m$ hidden neurons is defined as $Y = H β$ , where $H = {[\begin{matrix} h_{1}^{T} \\ h_{2}^{T} \\ ⋮ \\ h_{N}^{T} \end{matrix}]}_{Q \times m} .$ (1)

Figure 5.Neural network architectures. (a) General ELM neural network. (b) RNT-based neural network. (c) Optical model-based neural network. (d) Photonic neural network for grayscale image classification. (e) Photonic neural network for color image classification.

Download full size

View all figures

Given the target label $T$ , training corresponds to solving the problem: $\min ‖ H β - T ‖^{2}$ . The least-square solution to the problem is $β = H^{- 1} T$ . Unlike the iterative training approach in deep learning, the training weights are determined analytically by solving a linear system in ELMs, empowering fast training. To make the solution more robust and to have better generalization performance, the learning parameter $β$ is determined by applying the ridge regression theory to the training by introducing the regularization coefficient $C$ : $β = H^{T} {(\frac{I}{C} + H H^{T})}^{- 1} T (Q \leq m)$ (2)and $β = (\frac{I}{C} + H^{T} H)^{- 1} H^{T} T (Q \geq m),$ (3)where $I$ is the identity matrix.

4.2.2 Random number template (RNT)-based neural network

As shown in Fig. 5(b), random projection is achieved by convolving the 2D input matrix with an RNT that maps the connection strengths from an input neuron to hidden neurons, and no bias is involved. After convolution, a square function serves as the activation function for nonlinear mapping to generate the hidden neurons $h$ . Downsampling is executed using a mean method to reduce the matrix size of hidden neurons to $5 \times 5,$ $10 \times 10$ , $15 \times 15$ , $30 \times 30$ , $60 \times 60$ , $80 \times 80$ , and $100 \times 100$ , corresponding to 25, 100, 225, 900, 3600, 6400, and 10,000 neurons, respectively, denoted as $h^{'}$ . A mean method for downsampling is applied to adjust the size of a matrix (see Downsampling method). The reduced matrix $h^{'}$ is flattened into 1D format $h^{″}$ and then connected to the fully connected output layer. Network parameters for the RNT-based neural network are provided in Fig. S2 in the Supplementary Material.

4.2.3 Optical model-based neural network

As shown in Fig. 5(c), the network structure is similar to the RNT-based neural network, except that the transform relation based on the angular spectrum diffraction theory is adopted to replace the RNT for projection. It involves a numerical implementation of a mathematical transformation based on an optical model and thus is a digital neural network as well.

In the angular spectrum diffraction theory,32 the spatial frequency spectrum of an incident optical field $F_{0} (u, v)$ is given by the Fourier transform $A_{0} (U, V) = \int_{- \infty}^{+ \infty} \int_{- \infty}^{+ \infty} F_{0} (u, v) e^{[- j 2 π (U u + V v)]} d u d v,$ (4)where $j = \sqrt{- 1}$ , $U$ , and $V$ are the spatial frequencies. The diffraction transfer function for free space propagation is $S (U, V) = e^{j k z \sqrt{1 - (λ U)^{2} - (λ V)^{2}}},$ (5)Where $k = 2 π / λ$ , $z$ is the diffraction distance, and $λ$ is the wavelength. In the spatial frequency domain, the spatial frequency spectrum at distance $z$ can be calculated as $A_{z} (U, V) = A_{0} (U, V) S (U, V) .$ (6)

The distribution of the diffraction field at distance $z$ is calculated by the inverse Fourier transform of the spatial frequency spectrum $F_{z} (u, v) = \int_{- \infty}^{+ \infty} \int_{- \infty}^{+ \infty} A_{z} (U, V) e^{[j 2 π (U u + V v)]} d U d V,$ (7)resulting in an intensity distribution as $I (u, v) = | F_{z} (u, v) |^{2} .$ (8)

However, discrete numerical implementation of the above mathematical transformation is more complicated, governed by the Nyquist sampling theorem to avoid aliasing33 $| \frac{1}{2 π} \frac{\partial φ (U)}{\partial U} |_{\max} \leq \frac{1}{2 Δ U},$ (9)where $φ (U) = k z \sqrt{1 - (λ U)^{2}}$ is the phase of $S (U)$ and $Δ U$ is the sampling interval of $S (U)$ . According to Eq. (9), the number of sampling points $N_{u}$ should satisfy $N_{u} \geq \frac{2 z / Δ u}{\sqrt{4 (Δ u / λ)^{2} - 1}},$ (10)where $Δ u$ is the sampling interval of $F_{0} (u)$ . Our calculation adopts $z = 18 cm$ , $λ = 532 nm$ , and $Δ u = Δ v = 8.3 μ m$ , requiring $N_{u} = N_{v} \geq 1391$ . Therefore, the number of samples for the incident field is increased to $2048 \times 2048$ by padding the surrounding of a 2D input matrix $F_{i} [q, q]$ with zeros to satisfy Eq. (10), $F_{0} (N_{u}, N_{v}) = [\begin{matrix} 0 & \dots & 0 \\ ⋮ & F_{i} [q, q] & ⋮ \\ 0 & \dots & 0 \end{matrix}] .$ (11)

The discrete Fourier transform of the incident field is $A_{0} (N_{U}, N_{V}) = \sum_{N_{u} = - \frac{N}{2}}^{\frac{N}{2} - 1} \sum_{N_{v} = - \frac{N}{2}}^{\frac{N}{2} - 1} F_{0} (N_{u}, N_{v}) e^{- j 2 π (N_{U} N_{u} Δ U Δ u + N_{V} N_{v} Δ V Δ v)},$ (12)where $Δ U = 1 / N Δ u$ and $Δ V = 1 / N Δ v$ are sampling intervals and $N_{U}$ , $N_{V} \in [- N / 2, N / 2 - 1]$ , $N = 2048$ . The discretized transfer function is represented as $S (N_{U}, N_{V}) = e^{j k z (\sqrt{1 - (λ N_{U} Δ U)^{2} - (λ N_{V} Δ V)^{2}})},$ (13)and the spatial frequency spectrum of the diffraction field at distance $z$ can be expressed as $A_{z} (N_{U}, N_{V}) = A_{0} (N_{U}, N_{V}) S (N_{U}, N_{V}) .$ (14)

The diffraction field can be obtained through the inverse discrete Fourier transform $F_{z} (N_{u}, N_{v}) = \frac{1}{N^{2}} \sum_{N_{U} = - \frac{N}{2}}^{\frac{N}{2} - 1} \sum_{N_{V} = - \frac{N}{2}}^{\frac{N}{2} - 1} A_{z} (N_{U}, N_{V}) e^{j 2 π (N_{U} N_{u} Δ U Δ u + N_{V} N_{v} Δ V Δ v)} .$ (15)

Based on the above formalism, the fast Fourier transform (FFT) algorithm is applied to calculate the diffraction field $F_{z} (N_{u}, N_{v}) = IFFT {FFT [F_{0} (N_{u}, N_{v})] S (N_{U}, N_{V})},$ (16)where FFT and IFFT represent the fast Fourier transform operation and the inverse fast Fourier transform operation, respectively. The intensity of the diffraction field is further adapted to calculate the hidden neurons $h$ in the ELM network. The subsequent procedure is identical to that of the RNT-based neural network. Network parameters for optical model-based neural network are provided in Fig. S4 in the Supplementary Material.

4.2.4 Photonic neural network

In network structures shown in Figs. 5(d) and 5(e) for classifying grayscale and color image samples, respectively, projection from input neurons to hidden neurons is obtained solely through experimental measurements and this physical transformation does not rely on mathematical formalism, which is different from the above numerical models. The hidden neurons $h$ are experimentally represented by the intensity of the diffraction field captured in the camera’s ROI region. This completes the optical computation of the photonic neural network, whereas the structure of the backend digital network remains the same as the RNT-based neural network. It should be noted that for the network classifying RGB images, after downsampling the diffraction intensities of the RGB data individually, the intensities of each channel are summed to obtain $h^{'}$ . Network parameters for photonic neural networks are provided in Fig. S8 in the Supplementary Material.

4.3 Down Sampling Method

To reduce the matrix size for hidden neurons, a simple mean interpolation method is programmed with the following MATLAB code of “ds” function lists to compress a $O \times O$ matrix to $o \times o$ (https://github.com/zacgogogo/-ds-function). This is a custom image compression method, whereas other algorithms, such as various pooling methods, can also be used for downsampling.

4.4 Training and Testing for ELM Neural Networks

In this work, we employed the complete datasets of MNIST, Fashion-MNIST, and CIFAR-10 for training and testing. Both MNIST and Fashion-MNIST datasets contain 60,000 training samples and 10,000 testing samples, whereas the CIFAR-10 dataset contains 50,000 training samples and 10,000 testing samples of color images. The grayscale images of CIFAR-10 used in this work are obtained by weighted summation of RGB pixel values according to $Grayscale = 0.299 R + 0.587 G + 0.114 B .$ (17)

In our network structure, the hidden neurons $h$ are flattened into the 1D form $h^{″}$ after downsampling. The hidden layer matrix $H$ is constructed from $h^{″}$ for all training samples, and Eq. (3) is adapted for network training. The value of $\log C$ is an integer ranging from $- 10$ to 10, including 21 regularization coefficients for optimization. Ultimately, we selected the output weight matrix $β$ corresponding to the appropriate regularization coefficient as the final result of the network training. The selected regularization coefficients are listed in Tables S1–S3 in the Supplementary Material. For testing, the testing samples are fed into the trained network to perform inference. In addition, the training was conducted on a computer with the configuration mentioned in the experimental setup, and with the “tic-toc” command for measuring the code execution time in MATLAB.

Zhuonan Jia is a PhD student in optical engineering at the School of Physics Science and Technology, Northwestern Polytechnical University. His research interest focuses on optical neural networks.

Haopeng Tao is a PhD student in optical engineering at the School of Physics Science and Technology, Northwestern Polytechnical University. His research interest focuses on optical neural networks.

Guang-Bin Huang is a chair professor at the Southeast University, China, and is the founder of Mind PointEye, Singapore. He was a full professor at the School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore. His two works on extreme learning machines (ELMs) have been listed by Google Scholar in 2017 as top two and top seven, respectively, in its “Classic Papers: Articles that Have Stood the Test of Time,” as top ten in artificial intelligence. He has been listed by Thomson Reuters as a “Highly Cited Researcher” (in two fields: engineering and computer science) since 2014. He has received the Best Paper Award from IEEE Transactions on Neural Networks and Learning Systems (2013).

Ting Mei is a professor at the School of Physical Science and Technology, Northwestern Polytechnical University. He received his BS and MS degrees in optical engineering from the Zhejiang University and his PhD in electrical engineering from the National University of Singapore. Previously, he held the position of tenured associate professor at the Nanyang Technological University and served as the director of the Institute of Optoelectronic Materials and Technology at the South China Normal University. His research specializes in nanophotonics and semiconductor optoelectronics.

Category: Research Articles

Received: May. 15, 2025

Accepted: May. 20, 2025

Published Online: Jul. 16, 2025

The Author Email: Guang-Bin Huang (gbhuang@ieee.org), Ting Mei (ting.mei@ieee.org)

DOI:10.1117/1.APN.4.4.046010

CSTR:32397.14.1.APN.4.4.046010