Modern machine-learning applications require huge artificial networks demanding computational power and memory. Light-based platforms promise ultrafast and energy-efficient hardware, which may help realize next-generation data processing devices. However, current photonic networks are limited by the number of input-output nodes that can be processed in a single shot. This restricted network capacity prevents their application to relevant large-scale problems such as natural language processing. Here, we realize a photonic processor for supervised learning with a capacity exceeding $1.5\times {10}^{10}$ optical nodes, more than one order of magnitude larger than any previous implementation, which enables photonic large-scale text encoding and classification. By exploiting the full three-dimensional structure of the optical field propagating in free space, we overcome the interpolation threshold and reach the over-parameterized region of machine learning, a condition that allows high-performance sentiment analysis with a minimal fraction of training points. Our results provide a novel solution to scale up light-driven computing and open the route to photonic natural language processing.

1. INTRODUCTION

Advanced artificial intelligence systems are becoming extremely demanding in terms of training time and energy consumption [1,2]. An ever-larger number of trainable parameters is required to exploit over-parameterization and achieve state-of-the-art performances [3,4]. Large-scale, energy-efficient computational hardware is becoming a subject of intense interest.

Photonic neuromorphic computing systems [5–7] offer large throughput [8] and energetically efficient [9] hardware accelerators based on integrated silicon photonic circuits [10–14], engineered meta-materials [15], or 3D-printed diffractive masks [16,17]. Other light-based computing architectures leverage reservoir computing (RC) [18] and extreme learning machine (ELM) [19] computational paradigms, where the input data is mapped into a feature space through a fixed set of random weights and training is performed only on the linear readout layer. Optical reservoir computers [20–33] and photonic extreme learning machines (PELMs) [34–37] apply successfully to various learning tasks, ranging from time series prediction [38,39] to image classification [40,41]. Despite the remarkable performances achieved by these architectures, their impact on large-scale problems is limited, mainly due to size constraints.

Here, we demonstrate photonic machine learning at an ultralarge scale by mapping the optical output layer in the full three-dimensional (3D) structure of the optical field. This original approach offers inherent scalability to the optical device and effortless access to the over-parameterized learning region. The 3D-PELM that we implement can process simultaneously up to 250,000 total input-output nodes via spatial light modulation, featuring a total network capacity one order of magnitude larger than existing optical reservoir computers for supervised learning. Our large-scale photonic network allows us to optically implement a massive text classification problem, the sentiment analysis of the Internet Movie Database (IMDb) [44], and enable, for the first time to our knowledge, the observation of the double descent phenomenon [45–47] on a photonic computing device. We demonstrate that, thanks to its huge number of optical output nodes, the 3D-PELM successfully classifies text by using only a limited number of training points, realizing energy-efficient, large-scale text processing.

Sign up for Photonics Research TOC Get the latest issue of Advanced Photonics delivered right to you！Sign up now

2. RESULTS

A. Three-Dimensional PELM

A PELM classifies a dataset with $N$ points $\mathbf{X}=[{X}_{1},\dots ,{X}_{N}]$ by mapping the input sample ${X}_{i}\in {\mathbb{R}}^{L}$ into a high-dimensional feature space through a nonlinear transformation that is performed by a photonic platform [34]. To perform the classification, the output matrix $\mathbf{H}$, which contains the measured optical signals, is linearly combined with a set of $M$ trainable readout weights $\beta $ (see Appendix A). In all previous PELM realizations [25,34,35,39,41], the intensities stored in $\mathbf{H}$ only contain information on the optical field at a single spatial or temporal plane. Here, to scale up the optical network, we exploit the entire three-dimensional optical field. Figure 1(B) shows a schematic of our 3D-PELM: the photonic network uses as output nodes the full 3D structure of the optical field propagating in free space. The transformation of the input data sample into an optical intensity volume occurs through linear wave mixing by coherent free-space propagation and intensity detection. Specifically, the 3D optical mapping reads as $${\mathbf{H}}_{3\mathrm{D}}=\sum _{j}{\mathbf{H}}_{j},$$$${\mathbf{H}}_{j}={G}_{j}[{\mathbf{M}}_{j}\text{\hspace{0.17em}}\mathrm{exp}\text{\hspace{0.17em}}i(\mathbf{X}+\mathbf{W})],$$where ${\mathbf{M}}_{j}$ is the transfer matrix that models coherent light propagation to the $j$th detection plane, e.g., the discretized diffraction operator. The ${\mathbf{M}}_{j}$ has complex coefficients, and thus the photonic scheme implements a complex-valued neural network [42,43]. In Eq. (2) the input dataset is encoded by phase modulation, while $\mathbf{W}$ is a pre-selected random matrix that embeds the input signal and provides network biases. ${G}_{j}$ is the response function of the $j$th detector to the impinging light field.

Figure 1.Three-dimensional PELM for language processing. (A) The text database entry is a paragraph of variable length. Text pre-processing: a sparse representation of the input paragraph is mapped into a Hadamard matrix with phase values in $[0,\pi ]$. (B) The mask is encoded into the optical wavefront by a phase-only SLM. Free-space propagation of the optical field maps the input data into a 3D intensity distribution (speckle-like volume). (C) Sampling the propagating laser beam in multiple far-field planes enables upscaling the feature space. Intensities picked from all the spatial modes form the output layer ${\mathbf{H}}_{3\mathrm{D}}$ that undergoes training via ridge regression. By using three planes ($j=3$), we get a network capacity $C{>10}^{10}$. (D) The example shows a binary text classification problem for large-scale rating.

In experiments, to collect uncorrelated output data ${\mathbf{H}}_{j}$, we employ multiple cameras that simultaneously image the intensity in distinct far-field planes (see Appendix B). These speckle-like distributions [Fig. 1(C)] are uncorrelated to each other to acquire non-redundant information. Importantly, exploiting the intensity of multiple optical planes is crucial for increasing the effective number of readout channels. In fact, the intensity on a single camera has a finite correlation length and near pixels are necessarily correlated, which makes the number of independent channels lower than the available sensor pixels. Our 3D scheme circumvents this limitation and allows to achieve a huge number of output nodes.

B. Optical Encoding of Natural Language Sentences

The natural language processing (NLP) task we considered is the classification of the IMDb dataset, constructed by Maas et al. [44]. To encode natural language on the optical setup, we need to develop an optics-specific representation of the text data. Since text classification tasks have remained elusive in photonic neuromorphic computing, we here introduce the problem. A similar issue is crucial also in digital NLP, where the text encoding problem consists in obtaining a numerical representation of the input that is convenient for digital learning architectures [48]. Specifically, given a vocabulary, i.e., the set of all unique words in the corpus, a basic encoding method is the one-hot technique, in which each word corresponds to a Boolean vector with vocabulary size $V$. Paragraphs of different lengths can be represented by Boolean vectors of size $V$, with the element ${x}_{\mathit{ki}}=1$, which indicates the presence of the $k$th vocabulary word in the paragraph ${X}_{i}$ [Fig. 1(A)]. An input dataset with $N$ samples, in which each database entry is a text paragraph [see Fig. 1(A)], thus becomes an $N\times V$ matrix. The dimension of the single entry ${X}_{i}$ scales with the number of words $V$ in the vocabulary (${10}^{4}-{10}^{5}$). Photonic text processing hence necessitates an optical platform capable of encoding large-size input data. Given the large number of input modes $L$ supported by spatial light modulation, our 3D-PELM is the most convenient scheme for the scope.

Optical encoding of natural language in the 3D-PELM requires using a spatial light modulator (SLM) with a fixed number of input modes for sentences with variable lengths. We follow the one-hot method employed in digital NLP. We use an extension of one-hot encoding, the so-called tf-idf representation, and we construct the input representation ${\mathbf{X}}_{\text{tfidf}}$. In information retrieval, the $\mathrm{tf}\text{\u2212}\mathrm{idf}(i,j)$ statistics reflects how important a word $k$ is to a document $j$ (frequency ${n}_{\mathit{kj}}$) in a collection of documents. It is defined as $\mathrm{tf}\text{\u2212}\mathrm{idf}(k,j)=\frac{{n}_{\mathit{kj}}}{V}\xb7{\mathrm{log}}_{10}(\frac{N}{{S}_{k}})$, where $V$ is the length of a paragraph and ${S}_{k}$ is the number of sentences containing the $k$th word. Within the tf-idf representation, each paragraph becomes a sparse, real vector [see Fig. 1(A)] whose non-zero elements are the tf-idf values of the words composing it. To optically encode these sparse data, we applied to ${\mathbf{X}}_{\text{tfidf}}$ the Walsh–Hadamard transform (WHT) [49], which decomposes a discrete function in a superposition of Walsh functions. The transformed dataset ${\mathbf{X}}_{\mathrm{WHT}}$ is a dense matrix [Fig. 1(A)], which is displayed on the SLM as a phase mask within the [0, $\pi $] range [$\mathbf{X}={\mathbf{X}}_{\mathrm{WHT}}$ in Eq. (2)]. By exploiting the large number of available input pixels, we encode paragraphs with a massive size $V$ containing hundreds of thousands of words.

C. Observation of the Photonic Double Descent

Machine learning models with a very large number of trainable parameters show unique features with respect to smaller models, such as the double descent phenomenon. This effect is a resonance in the neural network performance, ruled by the ratio between the number of trainable parameters $M$ and the number of training points ${N}_{\text{train}}$. In the under-parameterized regime ($M<{N}_{\text{train}}$), good performances are obtained by balancing model bias and variance. As $M$ grows, models tend to overfit training data, and prediction performance gets worse until the so-called interpolation threshold ($M={N}_{\text{train}}$) is reached. At this point the model optimally interpolates the training data and the prediction error is maximum. Beyond this resonance, in the over-parameterized regime, the model keeps interpolating training points, but performances on the test set reach the global optimum [50].

We experimentally implement photonic NLP and investigate the double descent effect on our large-scale 3D-PELM. Specifically, we analyze the classification accuracy for the IMDb task (Appendix D) as a function of the features $M$ and training set size ${N}_{\text{train}}$. In Fig. 2(A) we report the observation of the double descent. We observe a dip in test accuracy as the number of channels $M$ reaches the number of training points ${N}_{\text{train}}$. Beyond this resonance (interpolation threshold), we find the over-parameterized region in which maximum accuracy on the training set is achieved. The behavior is obtained via training on ${N}_{\text{train}}\simeq 1.2\times {10}^{4}$ examples and using fixed train/test split ratio of 0.67/0.33. In this case, the larger classification accuracy is found in the under-parameterized region (0.77). Conversely, in Fig. 2(B), we consider a much smaller number of training points, ${N}_{\text{train}}\simeq {10}^{3}$. Remarkably, we reach the same optimal accuracy despite using only a fraction of the available training points. This observation reveals a particularly favorable learning region of the 3D-PELM in which only a reduced number of training points are required to reach high accuracy levels. From the operational standpoint, it is much more efficient to measure in parallel a large number of modes rather than sequentially processing many training examples. In Fig. 2(C) we report the full dynamics of the double descent phenomenon by continuously varying both $M$ and ${N}_{\text{train}}$. We observe that the accuracy dip shifts with a constant velocity. This result shows the existence of an optimal learning region that is accessible on the 3D-PELM thanks to its large number of photonic nodes.

Figure 2.Photonic sentiment analysis. (A), (B) Training and test accuracy of the 3D-PELM on the IMDb dataset as a function of the number of output channels. The shaded area corresponds to the over-parameterized region. The configuration in (B) allows us to reach very high accuracy in the over-parameterized region with a dataset limited to ${N}_{\text{train}}=1186$ training points. In (A), the same accuracy is reached in the under-parameterized region with ${N}_{\text{train}}=12\text{,}278$. Black horizontal lines correspond to the maximum test accuracy achieved (0.77). (C) IMDb classification accuracy by varying the number of features $M$ and training dataset size ${N}_{\text{train}}$. The boundary between the under and over-parameterized region (interpolation threshold), ${N}_{\text{train}}=M$, is characterized by a sharp accuracy drop (cyan contour line).

The observed operational advantage of the over-parameterized region indicates that, by further increasing the number of output modes, one can increase training effectiveness and/or performances [3]. In our 3D-PELM we reached ${M}_{\text{\hspace{0.22em}}}=120,000$ readout channels that are independent of each other. We also considered larger input spaces, extending the vocabulary to include the whole set of words in the corpus ($V\simeq {2}^{17}$). Figure 3(A) shows that the test accuracy reaches a plateau as $M$ increases in the over-parameterized region. Saturation indicates that all the essential information encoded within the optical field has been extracted through the available channels. However, we observe a change in the performance as we employ more input features $L$ [Figs. 3(B) and 3(C)]. To estimate the rate by which performance improves as more channels were used for training, we estimate the angular coefficient $m$ of the linear growth that precedes the plateau. Although the onset of saturation can be estimated using diverse criteria, and $\mathit{m}$ varies depending on the measured range, its trend when increasing $L$ remains unaltered. We note a relevant enhancement of $m$ for $L=362\times 362$, which indicates that PELMs featuring larger input spaces are able to reach optimal performances with a lower number of parameters. Figure 3(D) reports the accuracy as a function of the training-test split ratio (keeping fixed ${N}_{\text{train}}=6.7\times {10}^{3}$) for $M=0.8\times {10}^{5}$ and $M=1.2\times {10}^{5}$. Importantly, a limited number of examples ($\sim 20\%$) are enough to reach maximum accuracy in the over-parameterized region.

Figure 3.Performances at ultralarge scale. (A)–(C) Test accuracy as a function of $M$ for different input sizes $L$. In all cases, the 3D-PELM performance saturates in the over-parameterized region, reaching a plateau. A linear fit of the data preceding the plateau shows that the onset of the saturation is faster for datasets with a larger input space. The corresponding angular coefficient $m$ is inset in each panel. (D) Test accuracy varying the training set size for $M=0.8\times {10}^{5}$ and $M=1.2\times {10}^{5}$.

To establish a comparison among the various photonic neuromorphic computing devices, we introduce the optical network capacity as a useful ingredient related to the over-parameterization context. We define the capacity $\mathcal{C}$ of a generic optical neural network as the product between the number of input and output nodes that can be processed in a single iteration, $\mathcal{C}=L\times M$. This quantity gives direct information on the kind of large-scale problems that can be implemented on the optical setup. It depends only on the number of controllable nodes, which is the quantity that is believed to play the main role in big data processing [3]. Specifically, $L$ sets the size of the dataset that can be encoded, while $M$ reflects the embedding capacity of the network, as larger datasets necessitate larger feature spaces to be learned. Moreover, $\mathcal{C}$ also furnishes an indication on how far the over-parameterized regime is for a given task. Useful over-parameterized conditions can be reached only if $\mathcal{C}\gg N\times L$. We remark that the capacity is not a measure of the processor accuracy. It is instead a useful quantity to compare the scalability of different photonic computing devices.

In Table 1 we report the optical network capacity for various photonic processors that have been recently demonstrated. We focus on photonic platforms that exploit RC and ELM paradigms, since these devices are suitable for big data processing and may present the so-called photonic advantage at a large scale [51]. Our 3D-PELM has a record capacity $\mathcal{C}=362\times 362\times \mathrm{120,000}\simeq 1.5\times {10}^{10}$, more than one order of magnitude larger than any other optical processor that has been demonstrated on supervised learning. Moreover, while increasing the capacity is challenging on many optical reservoir computers, our 3D-PELM can be scaled up further by enlarging the measured volume of the optical field. For example, the optical processor in Ref. [52], which is exploited for randomized numerical linear algebra, suggests that our scheme can reach a larger capacity by exploiting large-area cameras and multiple optical scattering.

Maximum Network Capacity of Current Photonic Neuromorphic Computing Hardware for Supervised Learning

Working Principle

$M$

$L$

$C$

Machine Learning Task

Ref.

Time-multiplexed cavity

1400

7129

${10}^{7}$

Regression

[39]

Amplitude modulation

16,384

2000

${10}^{8}$

Human action recognition

[27]

Frequency multiplexing

200

640

${10}^{5}$

Time series recovery

[41]

Optical multiple scattering

50,000

64

${10}^{6}$

Chaotic series prediction

[38]

Amplitude Fourier filtering

1024

43,263

${10}^{7}$

Image classification

[30]

Multimode fiber

240

240

${10}^{5}$

Classification, regression

[35]

Free-space propagation

6400

784

${10}^{6}$

Classification, regression

[34]

3D optical field

120,000

131,044

${10}^{10}$

Natural language processing

3D-PELM

F. Comparison with Digital Machine Learning

To validate further the capability of our photonic setup for text processing, we compare the accuracy of our 3D-PELM with various general-purpose digital neural networks on the IMDb task. To underline the overall impact of the over-parameterization on the performance, we consider two opposite parameterization regimes, $M=1\times {10}^{3}$ and $M=4\times {10}^{4}$, and two distinct conditions for the dataset size, ${N}_{\text{train}}=6700$ [Fig. 4(A)] and ${N}_{\text{train}}=1500$ [Fig. 4(B)]. Since the learning principle of the 3D-PELM is based on kernel methods (see Ref. [34]), we implement a support vector machine (SVM) and a ridge regression based on nonlinear random projections (RP) [25] as representative models of the kernel algorithms class. The input for the SVM and the RP is the tf-idf–encoded dataset, but no significant differences are found using the Hadamard dataset. We also simulate the 3D-PELM, i.e., we evaluate our photonic scheme by using Eq. (2) to generate the output data. A convolutional neural network (CNN) with the same number of trainable weights is used as an additional benchmark model. Details on the various digital models are reported in Appendix E. The results for a split ratio ${N}_{\text{train}}/{N}_{\mathrm{test}}=0.67/0.33$ are shown in Fig. 4. Overall, we observe that the photonic device performs sentiment analysis on the IMDb with an accuracy comparable with standard digital methods. The SVM sets the maximum expected accuracy (0.86). The device over-parameterized accuracy (0.75 for ${N}_{\text{train}}=6700$) is mainly limited by the limited precision of the optical components and noise, and it could be enhanced up to 0.83 (3D-PELM numerics) by using a 64-bit camera. In fact, when operating with 8-bit precision, both the simulated 3D-PELM and the RP models achieve performance that agrees well with the experiments. Interestingly, Fig. 4(B) indicates that, for a limited set of training samples, the 3D-PELM surpasses the SVM and also the CNN. This points out an additional advantage that the photonic setup achieves thanks to over-parameterization. Not only accuracy is improved with respect to the standard under-parameterized regime, but the device operates effectively in conditions where digital models are less accurate.

Figure 4.Analysis of the IMDb accuracy. (A), (B) The comparison reports the accuracy for the experimental device (3D-PELM device), the simulated device (3D-PELM numerics), the random projection method with ridge regression (RP), the support vector machine (SVM), and a convolutional neural network (CNN) in both the under-parameterized ($M=1\times {10}^{3}$) and over-parameterized ($M=4\times {10}^{4}$) regimes, for (A) ${N}_{\text{train}}=6700$ and (B) ${N}_{\text{train}}=1500$. 8-bit numerical results, when applicable, refer to the over-parameterized regime.

We have reported the first, to our knowledge, photonic implementation of natural language processing by performing sentiment analysis on the IMBd dataset. Our results demonstrate the feasibility of modern large-scale computing tasks in photonic hardware. They can be potentially improved both in terms of performance, via low-noise high-resolution cameras and more complex encoding/training algorithms, and applications. Further developments include the implementation of advanced tasks such as multiclass sentiment analysis, which allows discriminating positive and negative sentences into different levels. Moreover, using different datasets such as the Stanford Sentiment Treebank (SST-5), we could train our 3D-PELM for providing ratings. On the other hand, our versatile optical setting may allow implementation of alternative strategies for photonic NLP, including dimensionality reduction of the input data via optical random projection [53]. An optical scheme that preserves the sequential nature of language may be also conceived by introducing recurrent mechanisms in our 3D-PELM. Another interesting direction is developing specific encoding schemes for representing text in the optical domain. Viable approaches include adapting bidimensional text representations such as quick response (QR) codes. Furthermore, by following the concept of word embedding used in digital NLP [54], photonic hardware may be used to learn a purely optical word embedding, including semantics, word ordering, and syntax.

In conclusion, modern machine learning architectures rely on over-parameterization to reach the global optimum of objective functions. In photonic computing, achieving over-parameterization for large-scale problems is challenging. Here, we realize a novel photonic computing device that, thanks to its huge network capacity, naturally operates in the over-parameterized region. The 3D-PELM is a very promising solution for developing a fully scalable photonic machine-learning platform. Our photonic processor enables the first link between NLP and optical computing hardware, opening a new vibrant research stream toward fast and energy-efficient artificial intelligence tools.

Acknowledgment

Acknowledgment. We thank I. MD Deen and F. Farrelly for technical support in the laboratory. D.P. and C.C. conceived the research. C.M.V. proposed application to NLP. D.P. developed the 3D photonic network. I.G. and D.P. carried out experimental measurements. C.M.V. and I.G. performed data analysis. D.P and C.C. co-supervised the project. All authors contributed to the manuscript.

APPENDIX A: PELM FRAMEWORK

In ELMs, a dataset with $N$ data points $\mathbf{X}=[({X}_{1},{y}_{1}),\dots ,({X}_{N},{y}_{N})]$, with $({X}_{i},{y}_{i})\in {\mathbb{R}}^{L}\times \mathbb{R}$, is mapped into a higher-dimensional feature space through a nonlinear transformation $g(\mathbf{X})$, yielding the hidden-layer output matrix $\mathbf{H}=[g({X}_{1}),\dots ,g({X}_{N})]$, with $g({X}_{i})\in {\mathbb{R}}^{M}$. A set of weights $\beta $ is learned via ridge regression such that the linear combination $\tilde{\mathbf{Y}}=\mathbf{H}\beta $ well approximates the true label vector $\mathbf{Y}=[{y}_{1},\dots ,{y}_{N}]$. An explicit solution is $\beta ={({\mathbf{H}}^{T}\mathbf{H}+\lambda \mathbf{I})}^{-1}{\mathbf{H}}^{T}\mathbf{Y}$, where $\lambda $ is the regularization parameter and $\mathbf{I}$ is the identity. The use of ridge regression allows reducing at the minimum training costs, which is crucial in applications requiring fast and reconfigurable learning. In a free-space PELM, the nonlinear mapping of input data is realized by a combination of free-space optical propagation and intensity detection [34]. The speckle-like pattern detected by a camera forms the nonlinear features of the input sample. The optical mapping reads as $\mathbf{H}=G[\mathbf{M}\mathrm{exp}\text{\hspace{0.17em}}i(\mathbf{X}+\mathbf{W})]$, where the input dataset $\mathbf{X}$ is phase-encoded via spatial light modulation. $\mathbf{W}$ is the embedding matrix, $G$ is the detector response function, and $\mathbf{M}$ is the discrete Fourier transform that models propagation to the focal plane. For more details on the PELM architecture and its working principle, see Ref. [34].

APPENDIX B: EXPERIMENTAL SETUP

The optical setup of the 3D-PELM is sketched in Fig. 1(B). A CW laser beam (532 nm, 100 mW) is expanded and collimated onto a phase-only SLM (Hamamatsu X13138, $1280\times 1024$ pixels, 60 Hz, 8 bits), which encodes data on the optical wavefront. The SLM operates in the $[\mathrm{0,}\text{\hspace{0.17em}\hspace{0.17em}}2\pi ]$ interval with a linear response function. Both the input and the embedding signal are encoded within the $[0,\pi ]$ range and superimposed. After free-space propagation through a lens ($f=150\text{\hspace{0.17em}\hspace{0.17em}}\mathrm{mm}$), the far-field three-dimensional intensity distribution is accessed simultaneously by three separated imaging systems and detected by three cameras (Basler Ace 2 Pro, $1920\times 1200$ pixels, 12 bits). The speckle-like field spatially decorrelates during propagation; the transverse intensity is transformed in a random way from one plane to the other, provided that these planes are at distances larger than the longitudinal field correlation length. To carry out the measurements, the propagating beam is divided by a series of beam splitters into separated imaging paths with different optical length. The far-field planes are selected in order to sample three speckle-like distributions having a negligible mutual intensity correlation. This is confirmed by intensity correlation measurements. Each additional plane contains additional information. In fact, keeping fixed $M$, we find an improvement in the classification accuracy when using three far-field planes instead of a single intensity distribution. In this condition, each camera maps a distinct portion of the high-dimensional feature space where the input field is projected. Importantly, speckle grains typically extend over adjacent pixels [Fig. 1(C)]. To avoid spatial correlations within the single speckle pattern, pixels are grouped into square modes (channels), and the average intensity on each channels forms the output signals. The intensity correlation length on the camera cannot be reduced to the single pixel due to the minimal size of the input modes that is limited by the SLM pixel pitch. For this reason, output channels cannot be increased by simply using all the sensor pixels, but multiple optical planes are necessary. The measured set of speckle-like patterns is elaborated on a conventional computer.

APPENDIX C: 3D-PELM TRAINING

Training operates by loading the randomly ordered input dataset on the SLM and measuring three speckle-like intensity distributions for each input sample (paragraph). From the acquired signals, we randomly select $M$ of the available channels, and we use the corresponding $M$ intensity values to form the output dataset. This dataset is split randomly into a training and a test datasets by using a split ratio ${N}_{\text{train}}/{N}_{\mathrm{test}}=0.67/0.33$, which is kept fixed throughout the analysis. All the classification accuracies we obtained refer to this hyperparameter. The training dataset is the output matrix $\mathbf{H}$ used for training (see Appendix A). Since our task corresponds to a binary classification problem, the target labels are ${y}_{i}\in \{\mathrm{0,}\text{\hspace{0.17em}\hspace{0.17em}}1\}$. According to the PELM framework, training by ridge regression consists in solving the equation for the $\beta $ vector. This matrix inversion is performed using the ML Python Library Sci-kit learn. The regularization parameter $\lambda $ is varied in the ${[10}^{-4}{,10}^{2}]$ range. As its value does not influence considerably the testing accuracy, we use $\lambda ={10}^{-4}$. Given a predicted output ${\widehat{y}}_{i}$, the predicted label is given by ${\tilde{y}}_{i}=\mathrm{\Theta}({\widehat{y}}_{i}-0.5)$, where $\mathrm{\Theta}$ is the Heaviside step function. Text classification performances are evaluated via the accuracy ${\sum}_{i}[{y}_{i}={\tilde{y}}_{i}]/N$, where $[\xb7]$ is the Iverson bracket defined as $[P]=1$ if $P$ is true and 0 otherwise.

APPENDIX D: NLP TASK, TEXT PRE-PROCESSING, AND OPTICAL ENCODING

We consider the IMDb dataset [44], a widely used benchmark for polarity-based text classification [55]. The dataset consists of $5\times {10}^{4}$ movie reviews, each one forming the $i$th input data sample. Each review is made of a paragraph consisting of multiple sentences and expresses an overall positive or negative sentiment (target variable). This sentiment analysis task hence consists in a large-scale binary classification. To implement the problem, the whole set of IMDb sentences is pre-processed by removing special characters (e.g., html), lowercasing, removing stop words, and lemmatizing. The length of the resulting vector can be varied by neglecting words that appear in the corpus with a frequency lower than a given threshold. To compute the WHT and obtain a dense text representation, zero-padding is used to reach a vector length $d=\lceil {\mathrm{log}}_{2}V\rceil $. For example, results in Fig. 2 refer to data with $V{=10}^{4}$ that are padded to $L{=2}^{14}=\mathrm{16,384}$ and reshaped as a $128\times 128$ square matrix, while the full vocabulary ($V=\mathrm{131,044}$) requires zero-padding, and paragraphs are arranged on $362\times 362$ input modes (Fig. 3). After WHT computation, the values are normalized in the $[0,\pi ]$ range and loaded on the SLM. Each element of the input ${\mathbf{X}}_{\mathrm{WHT}}$ is displayed onto a square block of SLM pixels. The embedding matrix $\mathbf{W}$ is a uniformly distributed random matrix for all the experiments. Its values ${w}_{ij}$ are selected in $[0,\pi ]$ before the training and kept fixed for a given photonic computation.

APPENDIX E: DIGITAL MACHINE LEARNING

The 3D-PELM is numerically simulated by following the optical mapping model in Eq. (1) (see also Ref. [34]). Data encoding and training operates in the same way as the photonic device, and the output matrix is computed by using Eq. (2) with $G(z)={|z|}^{2}$ (quadratic nonlinearity). In the RP model, data projection in the feature space is obtained by matrix multiplication with a real-valued uniformly distributed random matrix and using a quadratic activation function. Training occurs by ridge regression. For simulating the 8-bit machines, the input data and the output values forming the linear readout layer are properly discretized to values with 8-bit precision. The SVM is a binary classifier, and thus it is especially suited for the IMDb task. We use the Sci-kit learn built-in function on the tf-idf–processed dataset, and the over-parameterized regime is evaluated by using $M{=10}^{3}$. The CNN is composed of four convolutional layers with nonlinear activation and a fully connected layer. The CNN scheme is designed to have $M$ total nodes, and its accuracy does not depend significatively on the network details and hyperparameters.

[1] D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V. A. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, A. Phanishayee, M. Zaharia. Efficient large-scale language model training on GPU clusters using megatron-LM. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 1-15(2021).

[25] A. Saade, F. Caltagirone, I. Carron, L. Daudet, A. Dremeau, S. Gigan, F. Krzakala. Random projections through multiple optical scattering: approximating kernels at the speed of light. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6215-6219(2016).

[44] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, C. Potts. Learning word vectors for sentiment analysis. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 142-150(2011).

[47] S. Mei, A. Montanari. The generalization error of random features regression: precise asymptotics and double descent curve. Commun. Pure Appl. Math., 75, 667-766(2020).

[52] D. Hesslow, A. Cappelli, I. Carron, L. Daudet, R. Lafargue, K. Müller, R. Ohana, G. Pariente, I. Poli. Photonic co-processors in HPC: using LightOn OPUs for randomized numerical linear algebra(2021).