Large-scale photonic natural language processing

Carlo M. Valensise; Ivana Grecco; Davide Pierangeli; Claudio Conti

doi:10.1364/PRJ.472932

1. INTRODUCTION

Advanced artificial intelligence systems are becoming extremely demanding in terms of training time and energy consumption [1,2]. An ever-larger number of trainable parameters is required to exploit over-parameterization and achieve state-of-the-art performances [3,4]. Large-scale, energy-efficient computational hardware is becoming a subject of intense interest.

Photonic neuromorphic computing systems [5 –7] offer large throughput [8] and energetically efficient [9] hardware accelerators based on integrated silicon photonic circuits [10 –14], engineered meta-materials [15], or 3D-printed diffractive masks [16,17]. Other light-based computing architectures leverage reservoir computing (RC) [18] and extreme learning machine (ELM) [19] computational paradigms, where the input data is mapped into a feature space through a fixed set of random weights and training is performed only on the linear readout layer. Optical reservoir computers [20 –33] and photonic extreme learning machines (PELMs) [34 –37] apply successfully to various learning tasks, ranging from time series prediction [38,39] to image classification [40,41]. Despite the remarkable performances achieved by these architectures, their impact on large-scale problems is limited, mainly due to size constraints.

Here, we demonstrate photonic machine learning at an ultralarge scale by mapping the optical output layer in the full three-dimensional (3D) structure of the optical field. This original approach offers inherent scalability to the optical device and effortless access to the over-parameterized learning region. The 3D-PELM that we implement can process simultaneously up to 250,000 total input-output nodes via spatial light modulation, featuring a total network capacity one order of magnitude larger than existing optical reservoir computers for supervised learning. Our large-scale photonic network allows us to optically implement a massive text classification problem, the sentiment analysis of the Internet Movie Database (IMDb) [44], and enable, for the first time to our knowledge, the observation of the double descent phenomenon [45 –47] on a photonic computing device. We demonstrate that, thanks to its huge number of optical output nodes, the 3D-PELM successfully classifies text by using only a limited number of training points, realizing energy-efficient, large-scale text processing.

2. RESULTS

A. Three-Dimensional PELM

A PELM classifies a dataset with $N$ points $X = [X_{1}, \dots, X_{N}]$ by mapping the input sample $X_{i} \in R^{L}$ into a high-dimensional feature space through a nonlinear transformation that is performed by a photonic platform [34]. To perform the classification, the output matrix $H$ , which contains the measured optical signals, is linearly combined with a set of $M$ trainable readout weights $β$ (see Appendix A). In all previous PELM realizations [25,34,35,39,41], the intensities stored in $H$ only contain information on the optical field at a single spatial or temporal plane. Here, to scale up the optical network, we exploit the entire three-dimensional optical field. Figure 1(B) shows a schematic of our 3D-PELM: the photonic network uses as output nodes the full 3D structure of the optical field propagating in free space. The transformation of the input data sample into an optical intensity volume occurs through linear wave mixing by coherent free-space propagation and intensity detection. Specifically, the 3D optical mapping reads as $H_{3 D} = \sum_{j} H_{j},$ (1) $H_{j} = G_{j} [M_{j} \exp i (X + W)],$ (2)where $M_{j}$ is the transfer matrix that models coherent light propagation to the $j$ th detection plane, e.g., the discretized diffraction operator. The $M_{j}$ has complex coefficients, and thus the photonic scheme implements a complex-valued neural network [42,43]. In Eq. (2) the input dataset is encoded by phase modulation, while $W$ is a pre-selected random matrix that embeds the input signal and provides network biases. $G_{j}$ is the response function of the $j$ th detector to the impinging light field.

Figure 1.Three-dimensional PELM for language processing. (A) The text database entry is a paragraph of variable length. Text pre-processing: a sparse representation of the input paragraph is mapped into a Hadamard matrix with phase values in $[0, π]$ . (B) The mask is encoded into the optical wavefront by a phase-only SLM. Free-space propagation of the optical field maps the input data into a 3D intensity distribution (speckle-like volume). (C) Sampling the propagating laser beam in multiple far-field planes enables upscaling the feature space. Intensities picked from all the spatial modes form the output layer $H_{3 D}$ that undergoes training via ridge regression. By using three planes ( $j = 3$ ), we get a network capacity $C {> 10}^{10}$ . (D) The example shows a binary text classification problem for large-scale rating.

Download full size

View all figures

In experiments, to collect uncorrelated output data $H_{j}$ , we employ multiple cameras that simultaneously image the intensity in distinct far-field planes (see Appendix B). These speckle-like distributions [Fig. 1(C)] are uncorrelated to each other to acquire non-redundant information. Importantly, exploiting the intensity of multiple optical planes is crucial for increasing the effective number of readout channels. In fact, the intensity on a single camera has a finite correlation length and near pixels are necessarily correlated, which makes the number of independent channels lower than the available sensor pixels. Our 3D scheme circumvents this limitation and allows to achieve a huge number of output nodes.

B. Optical Encoding of Natural Language Sentences

The natural language processing (NLP) task we considered is the classification of the IMDb dataset, constructed by Maas et al. [44]. To encode natural language on the optical setup, we need to develop an optics-specific representation of the text data. Since text classification tasks have remained elusive in photonic neuromorphic computing, we here introduce the problem. A similar issue is crucial also in digital NLP, where the text encoding problem consists in obtaining a numerical representation of the input that is convenient for digital learning architectures [48]. Specifically, given a vocabulary, i.e., the set of all unique words in the corpus, a basic encoding method is the one-hot technique, in which each word corresponds to a Boolean vector with vocabulary size $V$ . Paragraphs of different lengths can be represented by Boolean vectors of size $V$ , with the element $x_{ki} = 1$ , which indicates the presence of the $k$ th vocabulary word in the paragraph $X_{i}$ [Fig. 1(A)]. An input dataset with $N$ samples, in which each database entry is a text paragraph [see Fig. 1(A)], thus becomes an $N \times V$ matrix. The dimension of the single entry $X_{i}$ scales with the number of words $V$ in the vocabulary ( $10^{4} - 10^{5}$ ). Photonic text processing hence necessitates an optical platform capable of encoding large-size input data. Given the large number of input modes $L$ supported by spatial light modulation, our 3D-PELM is the most convenient scheme for the scope.

Optical encoding of natural language in the 3D-PELM requires using a spatial light modulator (SLM) with a fixed number of input modes for sentences with variable lengths. We follow the one-hot method employed in digital NLP. We use an extension of one-hot encoding, the so-called tf-idf representation, and we construct the input representation $X_{tfidf}$ . In information retrieval, the $tf − idf (i, j)$ statistics reflects how important a word $k$ is to a document $j$ (frequency $n_{kj}$ ) in a collection of documents. It is defined as $tf − idf (k, j) = \frac{n_{kj}}{V} \cdot \log_{10} (\frac{N}{S_{k}})$ , where $V$ is the length of a paragraph and $S_{k}$ is the number of sentences containing the $k$ th word. Within the tf-idf representation, each paragraph becomes a sparse, real vector [see Fig. 1(A)] whose non-zero elements are the tf-idf values of the words composing it. To optically encode these sparse data, we applied to $X_{tfidf}$ the Walsh–Hadamard transform (WHT) [49], which decomposes a discrete function in a superposition of Walsh functions. The transformed dataset $X_{WHT}$ is a dense matrix [Fig. 1(A)], which is displayed on the SLM as a phase mask within the [0, $π$ ] range [ $X = X_{WHT}$ in Eq. (2)]. By exploiting the large number of available input pixels, we encode paragraphs with a massive size $V$ containing hundreds of thousands of words.

C. Observation of the Photonic Double Descent

Machine learning models with a very large number of trainable parameters show unique features with respect to smaller models, such as the double descent phenomenon. This effect is a resonance in the neural network performance, ruled by the ratio between the number of trainable parameters $M$ and the number of training points $N_{train}$ . In the under-parameterized regime ( $M < N_{train}$ ), good performances are obtained by balancing model bias and variance. As $M$ grows, models tend to overfit training data, and prediction performance gets worse until the so-called interpolation threshold ( $M = N_{train}$ ) is reached. At this point the model optimally interpolates the training data and the prediction error is maximum. Beyond this resonance, in the over-parameterized regime, the model keeps interpolating training points, but performances on the test set reach the global optimum [50].

We experimentally implement photonic NLP and investigate the double descent effect on our large-scale 3D-PELM. Specifically, we analyze the classification accuracy for the IMDb task (Appendix D) as a function of the features $M$ and training set size $N_{train}$ . In Fig. 2(A) we report the observation of the double descent. We observe a dip in test accuracy as the number of channels $M$ reaches the number of training points $N_{train}$ . Beyond this resonance (interpolation threshold), we find the over-parameterized region in which maximum accuracy on the training set is achieved. The behavior is obtained via training on $N_{train} ≃ 1.2 \times 10^{4}$ examples and using fixed train/test split ratio of 0.67/0.33. In this case, the larger classification accuracy is found in the under-parameterized region (0.77). Conversely, in Fig. 2(B), we consider a much smaller number of training points, $N_{train} ≃ 10^{3}$ . Remarkably, we reach the same optimal accuracy despite using only a fraction of the available training points. This observation reveals a particularly favorable learning region of the 3D-PELM in which only a reduced number of training points are required to reach high accuracy levels. From the operational standpoint, it is much more efficient to measure in parallel a large number of modes rather than sequentially processing many training examples. In Fig. 2(C) we report the full dynamics of the double descent phenomenon by continuously varying both $M$ and $N_{train}$ . We observe that the accuracy dip shifts with a constant velocity. This result shows the existence of an optimal learning region that is accessible on the 3D-PELM thanks to its large number of photonic nodes.

Figure 2.Photonic sentiment analysis. (A), (B) Training and test accuracy of the 3D-PELM on the IMDb dataset as a function of the number of output channels. The shaded area corresponds to the over-parameterized region. The configuration in (B) allows us to reach very high accuracy in the over-parameterized region with a dataset limited to $N_{train} = 1186$ training points. In (A), the same accuracy is reached in the under-parameterized region with $N_{train} = 12, 278$ . Black horizontal lines correspond to the maximum test accuracy achieved (0.77). (C) IMDb classification accuracy by varying the number of features $M$ and training dataset size $N_{train}$ . The boundary between the under and over-parameterized region (interpolation threshold), $N_{train} = M$ , is characterized by a sharp accuracy drop (cyan contour line).

Download full size

View all figures

D. Sentiment Analysis at Ultralarge Scale

The observed operational advantage of the over-parameterized region indicates that, by further increasing the number of output modes, one can increase training effectiveness and/or performances [3]. In our 3D-PELM we reached $M_{} = 120, 000$ readout channels that are independent of each other. We also considered larger input spaces, extending the vocabulary to include the whole set of words in the corpus ( $V ≃ 2^{17}$ ). Figure 3(A) shows that the test accuracy reaches a plateau as $M$ increases in the over-parameterized region. Saturation indicates that all the essential information encoded within the optical field has been extracted through the available channels. However, we observe a change in the performance as we employ more input features $L$ [Figs. 3(B) and 3(C)]. To estimate the rate by which performance improves as more channels were used for training, we estimate the angular coefficient $m$ of the linear growth that precedes the plateau. Although the onset of saturation can be estimated using diverse criteria, and $m$ varies depending on the measured range, its trend when increasing $L$ remains unaltered. We note a relevant enhancement of $m$ for $L = 362 \times 362$ , which indicates that PELMs featuring larger input spaces are able to reach optimal performances with a lower number of parameters. Figure 3(D) reports the accuracy as a function of the training-test split ratio (keeping fixed $N_{train} = 6.7 \times 10^{3}$ ) for $M = 0.8 \times 10^{5}$ and $M = 1.2 \times 10^{5}$ . Importantly, a limited number of examples ( $\sim 20 %$ ) are enough to reach maximum accuracy in the over-parameterized region.

Figure 3.Performances at ultralarge scale. (A)–(C) Test accuracy as a function of $M$ for different input sizes $L$ . In all cases, the 3D-PELM performance saturates in the over-parameterized region, reaching a plateau. A linear fit of the data preceding the plateau shows that the onset of the saturation is faster for datasets with a larger input space. The corresponding angular coefficient $m$ is inset in each panel. (D) Test accuracy varying the training set size for $M = 0.8 \times 10^{5}$ and $M = 1.2 \times 10^{5}$ .

Download full size

View all figures

E. Optical Network Capacity

To establish a comparison among the various photonic neuromorphic computing devices, we introduce the optical network capacity as a useful ingredient related to the over-parameterization context. We define the capacity $C$ of a generic optical neural network as the product between the number of input and output nodes that can be processed in a single iteration, $C = L \times M$ . This quantity gives direct information on the kind of large-scale problems that can be implemented on the optical setup. It depends only on the number of controllable nodes, which is the quantity that is believed to play the main role in big data processing [3]. Specifically, $L$ sets the size of the dataset that can be encoded, while $M$ reflects the embedding capacity of the network, as larger datasets necessitate larger feature spaces to be learned. Moreover, $C$ also furnishes an indication on how far the over-parameterized regime is for a given task. Useful over-parameterized conditions can be reached only if $C ≫ N \times L$ . We remark that the capacity is not a measure of the processor accuracy. It is instead a useful quantity to compare the scalability of different photonic computing devices.

In Table 1 we report the optical network capacity for various photonic processors that have been recently demonstrated. We focus on photonic platforms that exploit RC and ELM paradigms, since these devices are suitable for big data processing and may present the so-called photonic advantage at a large scale [51]. Our 3D-PELM has a record capacity $C = 362 \times 362 \times 120,000 ≃ 1.5 \times 10^{10}$ , more than one order of magnitude larger than any other optical processor that has been demonstrated on supervised learning. Moreover, while increasing the capacity is challenging on many optical reservoir computers, our 3D-PELM can be scaled up further by enlarging the measured volume of the optical field. For example, the optical processor in Ref. [52], which is exploited for randomized numerical linear algebra, suggests that our scheme can reach a larger capacity by exploiting large-area cameras and multiple optical scattering.Table 1.

Maximum Network Capacity of Current Photonic Neuromorphic Computing Hardware for Supervised Learning

Working Principle	$M$	$L$	$C$	Machine Learning Task	Ref.
Time-multiplexed cavity	1400	7129	$10^{7}$	Regression	[39]
Amplitude modulation	16,384	2000	$10^{8}$	Human action recognition	[27]
Frequency multiplexing	200	640	$10^{5}$	Time series recovery	[41]
Optical multiple scattering	50,000	64	$10^{6}$	Chaotic series prediction	[38]
Amplitude Fourier filtering	1024	43,263	$10^{7}$	Image classification	[30]
Multimode fiber	240	240	$10^{5}$	Classification, regression	[35]
Free-space propagation	6400	784	$10^{6}$	Classification, regression	[34]
3D optical field	120,000	131,044	$10^{10}$	Natural language processing	3D-PELM

F. Comparison with Digital Machine Learning

To validate further the capability of our photonic setup for text processing, we compare the accuracy of our 3D-PELM with various general-purpose digital neural networks on the IMDb task. To underline the overall impact of the over-parameterization on the performance, we consider two opposite parameterization regimes, $M = 1 \times 10^{3}$ and $M = 4 \times 10^{4}$ , and two distinct conditions for the dataset size, $N_{train} = 6700$ [Fig. 4(A)] and $N_{train} = 1500$ [Fig. 4(B)]. Since the learning principle of the 3D-PELM is based on kernel methods (see Ref. [34]), we implement a support vector machine (SVM) and a ridge regression based on nonlinear random projections (RP) [25] as representative models of the kernel algorithms class. The input for the SVM and the RP is the tf-idf–encoded dataset, but no significant differences are found using the Hadamard dataset. We also simulate the 3D-PELM, i.e., we evaluate our photonic scheme by using Eq. (2) to generate the output data. A convolutional neural network (CNN) with the same number of trainable weights is used as an additional benchmark model. Details on the various digital models are reported in Appendix E. The results for a split ratio $N_{train} / N_{test} = 0.67 / 0.33$ are shown in Fig. 4. Overall, we observe that the photonic device performs sentiment analysis on the IMDb with an accuracy comparable with standard digital methods. The SVM sets the maximum expected accuracy (0.86). The device over-parameterized accuracy (0.75 for $N_{train} = 6700$ ) is mainly limited by the limited precision of the optical components and noise, and it could be enhanced up to 0.83 (3D-PELM numerics) by using a 64-bit camera. In fact, when operating with 8-bit precision, both the simulated 3D-PELM and the RP models achieve performance that agrees well with the experiments. Interestingly, Fig. 4(B) indicates that, for a limited set of training samples, the 3D-PELM surpasses the SVM and also the CNN. This points out an additional advantage that the photonic setup achieves thanks to over-parameterization. Not only accuracy is improved with respect to the standard under-parameterized regime, but the device operates effectively in conditions where digital models are less accurate.

Figure 4.Analysis of the IMDb accuracy. (A), (B) The comparison reports the accuracy for the experimental device (3D-PELM device), the simulated device (3D-PELM numerics), the random projection method with ridge regression (RP), the support vector machine (SVM), and a convolutional neural network (CNN) in both the under-parameterized ( $M = 1 \times 10^{3}$ ) and over-parameterized ( $M = 4 \times 10^{4}$ ) regimes, for (A) $N_{train} = 6700$ and (B) $N_{train} = 1500$ . 8-bit numerical results, when applicable, refer to the over-parameterized regime.

Download full size

View all figures

3. DISCUSSION

We have reported the first, to our knowledge, photonic implementation of natural language processing by performing sentiment analysis on the IMBd dataset. Our results demonstrate the feasibility of modern large-scale computing tasks in photonic hardware. They can be potentially improved both in terms of performance, via low-noise high-resolution cameras and more complex encoding/training algorithms, and applications. Further developments include the implementation of advanced tasks such as multiclass sentiment analysis, which allows discriminating positive and negative sentences into different levels. Moreover, using different datasets such as the Stanford Sentiment Treebank (SST-5), we could train our 3D-PELM for providing ratings. On the other hand, our versatile optical setting may allow implementation of alternative strategies for photonic NLP, including dimensionality reduction of the input data via optical random projection [53]. An optical scheme that preserves the sequential nature of language may be also conceived by introducing recurrent mechanisms in our 3D-PELM. Another interesting direction is developing specific encoding schemes for representing text in the optical domain. Viable approaches include adapting bidimensional text representations such as quick response (QR) codes. Furthermore, by following the concept of word embedding used in digital NLP [54], photonic hardware may be used to learn a purely optical word embedding, including semantics, word ordering, and syntax.

In conclusion, modern machine learning architectures rely on over-parameterization to reach the global optimum of objective functions. In photonic computing, achieving over-parameterization for large-scale problems is challenging. Here, we realize a novel photonic computing device that, thanks to its huge network capacity, naturally operates in the over-parameterized region. The 3D-PELM is a very promising solution for developing a fully scalable photonic machine-learning platform. Our photonic processor enables the first link between NLP and optical computing hardware, opening a new vibrant research stream toward fast and energy-efficient artificial intelligence tools.

Acknowledgment

Acknowledgment. We thank I. MD Deen and F. Farrelly for technical support in the laboratory. D.P. and C.C. conceived the research. C.M.V. proposed application to NLP. D.P. developed the 3D photonic network. I.G. and D.P. carried out experimental measurements. C.M.V. and I.G. performed data analysis. D.P and C.C. co-supervised the project. All authors contributed to the manuscript.

Category: Optical Devices

Received: Aug. 10, 2022

Accepted: Oct. 8, 2022

Published Online: Nov. 24, 2022

The Author Email: Davide Pierangeli (davide.pierangeli@roma1.infn.it)

DOI:10.1364/PRJ.472932

微信扫一扫：分享