1. Introduction
Facial expression recognition is of great importance in the field of computer vision[1–5], which has been widely used to express emotions, exchange information, and convey psychological states. For example, it has been experimentally demonstrated that facial expressions account for as much as 55% in emotional expression[6]. Along with the rapid development of artificial intelligence technology[7–10], the applications of facial expression recognition have been extended to human-computer interaction, sentiment analysis, education, healthcare, security, and virtual reality[11–14]. However, achieving real-time high-accuracy facial expression recognition generally requires an expensive imaging system[15,16] (to record high-quality images) and a powerful computer system (to process a huge amount of data). For instance, recording a high-quality facial-expression picture at a long distance generally requires an expensive large-aperture imaging system. It is still a great challenge to realize efficient facial expression recognition in a cost-effective way.
A typical facial expression recognition process extracts the representative features from an image, enabling accurate identification of different expressions. Here, feature extraction is important to determine the ability of the neural network to capture the useful expression-related information from an image. Previous works mainly focus on spatial-domain recognition, in which the features are analyzed according to the geometry and gradients of the images[17]. Recently, frequency-domain recognition has attracted increasing research interest[18,19]. In the frequency domain[20–24], different frequency components reflect the spatial variations of different scales in the image. For instance, the low-frequency components correspond to large-scale structures such as facial contours, while the high-frequency components are related to small-scale details such as facial features and textures[25–27]. After converting an image into the spectral domain by fast Fourier transform (FFT), the information of different spatial scales is distributed in different locations of the frequency spectrum, which makes it feasible for identifying and modulating the large-scale and small-scale structures in the image. This recent research has demonstrated multiple functions of frequency-domain information[28,29]. However, quantitatively analyzing the contributions of various frequency components in facial expression recognition remains unexplored.
In this work, we investigate the impact of frequency-domain filtering on spatial-domain facial expression recognition in a quantitative way. The facial expression images in the spatial domain are first converted to the frequency domain through the FFT. After performing frequency filtering using a series of filters of different ranges, we create image datasets containing the structural details of different spatial scales by the inverse fast Fourier transform (IFFT). Subsequently, we use these images as input data to train a convolutional neural network for facial expression recognition and analyze the impact of different frequency components on facial expression recognition in the spatial domain. The results based on the Fer2013 dataset show that filtering out 82.64% of the high-frequency components results in only a 3.85% decrease in recognition accuracy. Clearly, the low-frequency components of an image are the key factors for determining the accuracy of facial expression recognition. Since the high-frequency components of an image are not essential, one can considerably reduce the information capacity of an image (especially small-scale detailed structures) in facial expression recognition, therefore improving the effectiveness and efficiency. This allows us to utilize an economic imaging system with a small numerical aperture for real-time recognition at a long distance.
Sign up for Chinese Optics Letters TOC Get the latest issue of Advanced Photonics delivered right to you!Sign up now
2. Methods
In the experiment, we use the facial expression dataset Fer2013, which was captured in real natural scenes. Seven basic emotions (i.e., surprise, fear, disgust, happy, sad, angry, and neutral) are included. It comprises 35,887 different facial expression images, in which 28,709 images are put in the training set and 3,589 images each in the validation and test sets. Each image has a size of pixel. The composition of the train, validation, and test sets is illustrated in Fig. 1. In the frequency domain, different frequency components of an image correspond to different scale structural variations in the spatial domain. High-frequency components generally correspond to small details, including textures and edges, reflecting those rapid changes in the image. In contrast, low-frequency components are related to the slow variation, such as the overall structure and the background, associated with global changes of the image. The frequency spectrum of an image exhibits a symmetric distribution. In this work, the zero frequency component is set at the center of the spectrum. The low-frequency components surround the center while the high-frequency ones are distributed in the edge area of the spectrum. Figure 2 shows the typical images of seven categories of facial expressions. We use the FFT to transform a spatial-domain image to a frequency-domain one, both having pixel.

Figure 1.Composition of the Fer2013 dataset.

Figure 2.Seven facial expression categories and their corresponding spectra. Here, the numbers 0–6 represent angry, disgust, fear, happy, sad, surprise, and neutral, respectively.
By utilizing the symmetric distribution of the frequency spectrum, we perform frequency filtering with a series of masks in the frequency domain and then obtain understandable spatial domain images through the IFFT. To investigate the different impacts of high-frequency and low-frequency components on recognition accuracy, we design a two-part experiment with high-pass filtering and low-pass filtering. In the low-pass experiment, we use a series of square frames as the masks, which allows for a low-frequency range of pixel to be kept. In our experiment, defines the dimensions of the masks, which increases from 1 to 20 with a step of 1. In the high-pass experiment, we use square masks of pixel, which filter out the central low-components of the frequency spectrum. Figure 3 shows a set of example images, including the original image, the reconstructed images after low-pass and high-pass filtering in the frequency domain, as well as their corresponding frequency spectra.

Figure 3.Examples of (a) low-pass and (b) high-pass filtering.
We build a deep convolutional neural network model based on Keras by utilizing the VGG architecture[30]. The VGG model has been extensively applied in computer vision tasks such as image classification and object detection. Figure 4 presents the scheme of our network model. The numbers of filters in each convolutional layer stack are 64, 128, 256, and 512, respectively. The kernel size is . In order to enhance the stability and convergence speed of the network, we add a batch normalization layer after each convolutional layer. The activation function is ReLU. Max-pooling is performed over a pixel window with stride 2 for the first four layers and stride 1 in the last layer. The fully connected layers use the ReLU activation function and perform regularization with a dropout rate of 0.5.

Figure 4.Architecture of the network model.
We utilize the stochastic gradient descent (SGD) optimizer. SGD is applied with an initial learning rate of 0.01, a momentum of 0.9, and a weight decay of . First, the model is trained on the original Fer2013 dataset, achieving a recognition accuracy of 66.65%. Then, we perform frequency filtering according to the steps in Fig. 3 and create a series of datasets with different frequency ranges. The images in each dataset are fed into the neural network for training and testing. Finally, we obtain the recognition accuracies through 20 low-pass experiments and 20 high-pass experiments for comparison.
3. Experimental Results
Figure 5 shows the results of 20 low-pass filtering experiments. We use square frames as the masks, which allows for the central pixel to be kept. As increasing the value of from 1 to 20 [Fig. 5(a)], the recognition accuracy decreases at an increasing speed. In Figs. 5(b)–5(e), we show four typical confusion matrices for , 12, 15, and 18, respectively. The measured points in Fig. 5(a) well fit an exponential relationship of . Here, represents the accuracy, and represents the value of . The parameters are fitted to be , , and . To find the critical value of (i.e., the boundary between slow and fast declines in recognition accuracy), we define a normalization coefficient and calculate the critical point by . Here, the accuracies at and are 66.65% and 51.88%, respectively, which gives . Correspondingly, the critical value of is calculated to be 14. The results indicate that the accuracy of facial expression recognition in the spatial domain maintains a high level, and even only the central pixel in the frequency domain are kept from a total spectrum of . Alternatively, the reduction of high-frequency components ( of the total spectra) leads to an accuracy decrease of 3.85% [calculated by 66.65% (at ) (at )] in recognizing the images in the Fer2013 dataset. This helps reduce the dependence of facial expression recognition on expensive optical instruments. For example, the spatial-frequency range that can be collected by an imaging system is proportional to its aperture. Our results show that one can filter out about 80% of the high-frequency components without severely reducing the accuracy (for the Fer2013 dataset). This indicates that one can use an economic imaging system with an aperture size only 1/5 of the original one to obtain reasonable accuracies.

Figure 5.(a) Dependence of recognition accuracy on low-pass filtering. (b)–(e) Confusion matrices for N =5, 12, 15, and 18, respectively.
Next, we present 20 high-pass filtering experiments in Fig. 6. Here, we use square masks of pixel to block the central low-frequency components. In increasing the value of , the accuracy reduction in Fig. 6(a) exhibits a completely different profile compared to that in the low-pass experiment [Fig. 5(a)]. In the first step (i.e., ), we can observe a sharp decrease from 66.65% to 60.65% in recognition accuracy. After that, the accuracy decreases more slowly, which satisfies a linear relationship of . Here, the parameters are fitted to be and . In Figs. 6(b)–6(e), we present the confusion matrices for , 8, 12, and 16, respectively. These high-pass experiments also verify that the low-frequency components have significant impacts on the recognition accuracy. Notably, there still exists certain feature information in the remaining frequency components, leading to a certain accuracy even at for both low-pass filtering and high-pass filtering. In addition, we have compared the results using the ResNet and ViT architectures. As shown in Fig. 7, the results present similar curves to those using the VGG. This also supports our conclusion that the accuracy of facial recognition mainly depends on low-frequency features.

Figure 6.(a) Dependence of recognition accuracy on high-pass filtering. (b)–(e) Confusion matrices for N = 4, 8, 12, and 16, respectively.

Figure 7.(a) and (b) show the dependence of recognition accuracy on low-pass and high-pass filtering, respectively, using ResNet. (c) and (d) present the dependence of recognition accuracy on low-pass and high-pass filtering, respectively, using ViT.
4. Conclusion
In this work, we utilize the frequency filtering method to investigate the contributions of high-frequency and low-frequency components for facial expression recognition in the spatial domain. The experimental results show that the low-frequency components have a more significant impact on the recognition accuracy in comparison to the high-frequency components. Specifically, in the facial expression recognition using the Fer2013 dataset, we filter out 82.64% of the high-frequency components, leading to a decrease of 3.85% in accuracy. Such substantial reduction in the frequency domain indicates that extremely high-quality images are not necessary to achieve high recognition accuracy. This helps reduce the data amount in the neural network as well as relieve the reliance on high-end imaging systems. For instance, an economic image acquisition system could work well at a long distance. To further enhance the recognition accuracy, the possible ways include the improvement of the dataset, the optimization of the network architecture, and the introduction of physical losses and regularization[31,32]. Our work provides a potential solution for realizing real-time, long-distance, and low-cost facial expression recognition.