1 Introduction
Pain has been recognized internationally as the fifth vital sign given its damage to patients [1]. Assessing pain plays a key role in real-time healthcare applications. Currently, there are three primary methods for evaluating pain intensity in medical institutions: 1) Measuring bioelectrical signals (i.e., electroencephalogram, electro-eculography, and electrocardiogram.) [2,3], 2) Self-report by patients, and 3) Observation patient’s behavioral responses by clinicians (i.e., facial expression, body movements, and sounds of moaning or crying) [4,5]. While physiology-based pain assessment is effective, it also has disadvantages, such as signal noise and discomfort due to long-term monitoring. Moreover, self-report is considered the “gold standard” for pain assessment, so clinicians and healthcare organizations primarily rely on patient's self-report to gauge the intensity of pain. However, this approach is not always feasible for specific populations, such as infants and toddlers, postoperative coma or aphasia. Observational assessments require continuous and uninterrupted care from family members or medical staff, placing a heavy burden on caregivers and clinicians [6]. To address the challenges, it is crucial and necessary to develop more objective and alternative solutions for pain assessment methods. With advancements in image monitoring technology and artificial intelligence, behavior-based indicators can be continuously monitored using contactless recording devices (such as digital cameras, surveillance cameras.). Therefore, developing a pain assessment system based on behavioral parameters shows potential value in clinical applications.
As we know, facial pain-related action units (AUs) movements can convey most of the pain-related information compared with other behavioral indicators. With the remarkable advancements in machine learning and computer vision [7–9], studies have shown that pain can be detected from facial expressions [10], and considerable progress has been made in automatically recognizing pain from facial images. In earlier studies, local descriptors emerged as an effective method for capturing local feature attributes of images. Consequently, numerous researchers have integrated local descriptors with various classifiers for pain recognition from facial images, such as the discrete cosine transform (DTC) [11], the local binary pattern (LBP) [12], and the histogram of oriental gradient (HOG) [13]. However, a significant shortcoming of these approaches is that they are extremely time-consuming because they require careful selection and extraction of pain features. To overcome the inadequacies of precious methods, deep learning algorithms [14,15] have been applied in the field of pain recognition in recent years with their advantages such as automatic feature learning and strong representation ability. For example, the convolutional neural network (CNN) [16,17] or the recurrent CNN (RCNN) [18,19]. Some existing pain estimation methods also made use of specific global or local features of facial images. The global and local expression features are particularly vital for pain assessment. Local features can present subtle characteristics of facial pain, while global features can help to extract contextual information and semantic features. Nevertheless, current research has scarcely attempted to merge local and global features for a comprehensive pain assessment.
To fill this gap, in this paper, we propose CNN with global and local attention mechanism (GLA-CNN) to achieve the facial pain assessment. When clinicians assess pain, they make judgments based on overall facial observation and careful examination of the details of distortions in AUs associated with pain, such as brow lowering, orbital tightening, levator contraction, eye closure. Inspired by the clinical observation strategy, we have built a combinational global and local attention network. The feature map of the last convolutional layer of the backbone network is decomposed, and the decomposed patches are then input into the local attention network (LANet) to extract the patch features of interest. Additionally, a local attention module is embedded to meticulously select representative local facial patches. Given that facial contortions occur during pain, it is important to note that AUs of facial pain do not exist independently but are somehow related to each other. Therefore, the global attention network (GANet) is used to extract associations between pain units. Similar to LANet, the designed global attention module is incorporated into GANet. Subsequently, the features extracted from both pathways are seamlessly integrated to estimate pain intensity.
In summary, the main contributions of this paper are three folds:
1) We propose the CNN model with GLA-CNN to assess pain intensity. GLA-CNN can obtain robust global and local features to achieve better performance in pain estimation.
2) Experimental visualization analyses show that LANet can extract subtle local pain features from the face with the help of the local attention module, and GANet can provide global complementary information between facial patches.
3) Experimental results demonstrate the advantage of the proposed GLA-CNN over other methods on the UNBC-McMaster Shoulder Pain database.
2 Related works
2.1 Methods towards pain intensity assessment from facial image
Facial action coding system(FACS) [20] has mapped the relationship between different facial muscle movements and various expression through 44 independent AUs, demonstrating that changes in facial expression are useful clues for pain expression recognition. Prkachin and Solomon [21] proposed the Prkachin and Solomon pain intensity metric (PSPI) method based on FACS. They identified four core AUs most strongly associated with pain: Brow lowering, orbital tightening, levator contraction, and closure. The facial pain score is determined by the summation of the four core AUs scores. The PSPI method could identify 16 discrete pain levels on a frame-level basis, providing a theoretical basis for the early and timely assessment of facial expression of pain. The publication of the UNBC-McMaster Shoulder Pain database [22] has further advanced research in this field.
Firstly, the relevant researches about local facial features for pain assessment are summarized. During the past years, researchers pushed forward on the automatic pain assessment with different methods. Yang et al. [23] indicated that local descriptors showed a strong ability to capture discriminative features from pain patches. The encoding methods for local facial features included histograms, statistics, orthogonal transform, and other forms, which have shown good performance in many tasks. Ahmed et al. [11] used the discrete cosine transform(DCT) for feature extraction of images, and then the neatest neighbor method was used for pain classification. Ashrafat et al. [24] employed the active appearance model to analyze the facial shape of pain according to the key points related to the face and pain. Hammal and Cohn [18] divided pain intensity into a four-level scale by extending the transferable belief model for automatic assessment of pain. Kaltwang et al. [25] estimated continuous pain intensity across the 16-grade PSPI measure by combining the shape features, LBP, and DTC.
Next, the relevant studies of pain assessment by extracting global features of facial images are introduced. Recently, the features extracted from CNN have been implemented successfully on computer vision tasks. CNN can acquire subtle information in image patches by a set of filters, and construct global representations through a set of successive layers with different activation and pooling functions. Rodriguez et al. [26] used the visual geometry group network 16 (VGG-16) [27] and the long short-term memory (LSTM) [17] to extract the facial features for pain assessment. Zhou et al. [19] implemented RCNN as a prediction framework to predict pain intensity. Tavakolian et al. [28] proposed a 3D CNN for dynamic spatiotemporal representation of faces in videos.
From the above, the models proposed in the existing works extracted either local or global features of facial pain images. Few studies have used models to estimate pain intensity by combining local and global features of facial pain images. In this paper, the GLA-CNN method is proposed to estimate pain intensity from facial images.
2.2 CNN with attention
Humans have the ability to quickly orient their attention toward salient objects in a chaotic visual scene. In other words, we can rapidly move our gaze toward objects of interest in the scene [29]. In recent years,attention mechanism has been applied in many fields, such as visual question answering [30,31] and computer vision [32,33].
In Ref. [34], CNN with attention mechanism for facial expression recognition was proposed, which focused on most of the discriminating facial regions. Experiments on multiple facial expression datasets showed that this method greatly improved the accuracy of facial expression recognition [35–37]. Regarding feature map attention, the spatial transformer network (STN) [38] allows automatic learning of the network, realizing the spatial transformation of data of various deformations and automatically selecting the features of the region of interest. STN used in Ref. [16] helps to focus on the face-related areas, thus improving experimental results.
From the above work, a trend can be observed where attention mechanism is widely used in computer vision tasks to direct deep networks to pay greater attention to important information. Unfortunately, over the past decades, attention mechanism has been infrequently applied in pain estimation. Therefore, two types of attention modules are designed. One is implemented in LANet to instruct the model to learn key pain patches, while the other is utilized in GANet to enable it to concentrate on pain-related areas.
3 Proposed model
In this section, the proposed method is described in detail. Firstly, the preprocessing method of facial images is introduced, and then GLA-CNN is illustrated, including the details of the LANet module and the GANet module.
3.1 Preprocessing of facial image
Research has shown that face alignment and masking contribute to the accuracy of facial expression recognition algorithms [19,26]. Therefore, a preprocessing pipeline as shown in Fig. 1 is employed to obtain the area of the input image. The preprocessing method consists of three steps, namely facial localization, facial alignment, and facial cropping.

Figure 1.Overall pre-processing pipeline: (a) original image, (b) facial localization, (c) facial alignment, and (d) facial cropping.
The multi-task cascaded neural network (MTCNN) [39] is employed to detect the face from the original image, as it has been trained with millions of face images and provides optimal results. When MTCNN detects a face, it returns five coordinate points, including the coordinates for the left and right eyes. Then, the angle between eye coordinates is calculated.
$ \theta ={\mathrm{t}\mathrm{a}\mathrm{n}}^{-1}\frac{\mathrm{d}y}{\mathrm{d}x} $ (1)
where $ \theta $ represents the angle of rotation of the face, $ \mathrm{d}x $ and $ \mathrm{d}y $ are the differences between the coordinates for the left and right eyes, respectively. After obtaining the rotation angle, an affine transformation matrix is employed to achieve face alignment. Furthermore, when the face is localized by MTCNN, it is cropped out from the aligned image using bounding box data. In the end, we resize the cropping images to 224×224×3 pixels.
3.2 Framework overview
GLA-CNN is constructed to extract the pain features from the face for assessing pain intensity, as shown in Fig. 2. VGG-16 is employed as the backbone for our proposed model. Then, VGG-16 is combined with two neural network modules, namely GANet and LAnet. LANet is responsible for extracting representative pain features from local facial patches, but it can neglect some associated information between patches. Therefore, GANet is designed to extract pain features from the global face region. By integrating the advantages of GANet and LANet, GLA-CNN can extract pain features from local patches as well as global context clues of facial images, which can improve the performance of pain assessment.

Figure 2.Framework of the proposed GLA-CNN.
GLA-CNN takes facial images as input and uses VGG-16 as the backbone network to encode the image. Then, for LANet, the last feature map of the backbone network is cut into 25 local patches through the method of region decomposition. The representative patches are extracted after being input into LANet. GANet extracts features associated with facial pain AUs. Finally, the features of LANet and GANet are combined to achieve pain grading assessment.
3.3 Local attention network
The proposed GLA-CNN pain assessment model is inspired by the clinical physician’s assessment strategy based on the patient’s facial expression. Generally speaking, physicians assess pain primarily by observing the details of facial AUs associated with pain. Specifically, pain intensity could be assessed by specific facial AUs. Therefore, LANet is designed to focus on representative patches of facial expression in pain. LANet consists of two core steps: Region decomposition and perception of pain features.
As shown in the overall framework of Fig. 2, the image is fed into VGG-16 and is represented as some feature maps (the size of the feature map is 512×25×25). The feature map is uniformly decomposed into 25 local regions, each of size 512×5×5. It should be noted that the feature map is divided into 25 local patches without overlap.
An identity addition component is introduced along with a “skip connection” from the input patches, which helps avoid vanishing gradient issues during network training. To enable LANet to perceive automatically and focus mainly on the patches containing rich and subtle pain information, LANet contains a local attention module, and its detailed structure is illustrated in the top purple dashed rectangle in Fig. 2. The local attention module consists of two branches. The first branch encodes the input feature maps as the vector-shaped local features. The second branch mainly aims to estimate a scalar weight to denote the importance of each local patch. Specifically, after the feature map of 512×5×5 input into the local attention module, the feature map of 128×3×3 is obtained after the max-pooling operation and two 3×3 convolution operations. After ReLU and BatchNorm2d operations, the weight value is calculated using the Sigmoid function. Finally, the calculated weights are used to weigh the local features. The mathematical expression of the local attention module is shown as
$ {\mathbf{w}}_{i}=\sigma \left({\omega }_{i}\right)\times {{\boldsymbol{\mathbm{β}}}}_{i} $ (2)
where the ith local path uses $ \sigma \left({\omega }_{i}\right) $ to weight the local feature $ {{\boldsymbol{\mathbm{β}}}}_{i} $ ($ \sigma $ denotes Sigmoid activation). $ {\mathbf{w}}_{i} $ represents the output after the ith patch passes the local attention module. Twenty-five weighted vectors are obtained, described as $ \{{\mathbf{w}}_{1} $, $ {\mathbf{w}}_{2} $, $ \cdots $, $ {\mathbf{w}}_{25} $}. $ {f}_{L}\left(\omega \right) $ is used to describe the extracted features of local path, as follows
$ {f}_{L}\left(\omega \right)={\displaystyle\cup }_{i=1}^{25}{\mathbf{w}}_{i} $ (3)
where ∪ represents concatenation operation. Then, $ {f}_{L}\left(\omega \right) $ is changed to 1024 dimensions through a full connection layer.
3.4 Global attention network
There is an interaction between pain sensation and the autonomic nervous system [40]. In other words, when pain occurs, a connection exists between the neurons. AUs associated with pain are controlled by specific neurons. Therefore, when experiencing pain, the pain-related facial AUs are triggered to distort facial expression. For example, when the eyes are closed, the eyebrows will drop. This suggests that facial AUs associated with pain do not exist in isolation when pain occurs. These local patches in LANet may ignore association information between facial pain units. Therefore, GANet aims to extract association information between facial pain AUs.
From Fig. 2, the 512×25×25 feature maps extracted from the last convolution layer of the VGG16 network are taken as the input of GANet. Then, the feature maps are input into the proposed global attention module after a max-pooling operation.
To better extract the correlation information between facial pain units, this study proposes a global attention module in which the important features in the feature map can be enhanced, while the useless features can be suppressed. The structure of the global attention module is illustrated at the bottom of a green rectangle in Fig. 2. Suppose that the input to the global attention module is $ \mathbf{M}=[{m}_{1}{\mathrm{, }} {m}_{2}{\mathrm{,}}\,\mathrm{ }\cdots {\mathrm{,}}\,{m}_{25}]\in {\mathbb{R}}^{512\times 12\times 12} $, the global average pooling (GAP) operation is first executed to obtain global information for each channel, and the global information obtained for the kth channel is given as
$ y_k=\frac{1}{h\times w}\sum_{i=1}^h\sum_{{{j}}=1}^wm_k(i\mathrm{,}\ j) $ (4)
where h and w represent the height and width of the feature map, respectively. To enhance the correlation between the channels, a 1×1 convolution operation is performed after GAP. Then, the Sigmoid function is applied to map the weight vector value to [0, 1], and the weight vector obtained is denoted as $ {\mathbf{V}}^{512\times 1\times 1} $. Meanwhile, the feature map $ {y}_{k} $ is denoted as $ {\mathbf{M}}' \in {\mathbb{R}}^{512\times 12\times 12} $ after calculation by the convolution layer, max-pooling, and ReLu. After calculating the weight vector and the feature maps, the weighted feature maps $ {\mathbf{M}} ''$ is obtained, which is described as
$ {\mathbf{M}}''=\mathbf{M}\text{⨁}\left(\mathbf{V}\text{⨂}{\mathbf{M}'}\right)$ (5)
where $ \oplus $ represents the element-wise addition and $ \otimes $represents the element-wise multiplication.
After the global attention module, weighted feature maps $ {\mathbf{M}''} $ with a size of 512×12×12 are obtained. The last full connection layer of GANet is replaced with a GAP layer because GAP can achieve dimensionality reduction and prevent overfitting. Therefore, the weighted feature maps through GAP reduction to an appropriate dimension, and the information of global feature is denoted as $ {f}_{G}\left(m\right) $.
Final, the local feature $ {f}_{L}\left(\omega \right) $ and global feature $ {f}_{G}\left(m\right) $ are concatenated, and then fed into a single fully connected layer. The softmax function can classify pain into four levels, namely “No pain”, “Weak pain”, “Mild pain”, and “Strong pain”.
3.5 Loss function
Further, since our goal is to grade pain based on static facial images, this task is inherently a classification problem. To address this, a cross-entropy loss function is employed in this work. The mathematical expression of cross-entropy is given as
$ {L}_{k}=-\frac{1}{N}\displaystyle\sum _{i=0}^{N-1}\mathrm{l}\mathrm{o}\mathrm{g}\frac{{e}^{{\mathbf{W}}_{{y}_{i}}^{\left(k\right)\mathrm{T}}{v}_{i}^{k}+{b}_{i}^{\left(k\right)}}}{\displaystyle\sum _{j=0}^{C-1}{e}^{{\mathbf{W}}_{j}^{\left(k\right)\mathrm{T}}{v}_{i}^{k}+{b}_{j}^{\left(k\right)}}} $ (6)
where $ N $ is the number of samples; $ C $ is the number of pain level; $ {\mathbf{W}}^{\left(k\right)} $ is the weight matrix of the fully connected (FC) layer; $ {b}^{\left(k\right)} $ is the bias term of the FC layer; $ {v}_{i}^{\left(k\right)} $ is the FC input of the ith sample, and $ {y}_{i} $ is the label.
The proposed end-to-end GLA-CNN for the automatic pain assessment model is implemented based on Pytorch. The stochastic gradient descent (SGD) optimizer is used to optimize the parameters of our model. The initial learning rate is set to 10–3, with the momentum and weight decay valued at 0.9 and 10–4, respectively. The model is trained on a GPU server consisting of an NVIDIA TITAN XP 12 GB GPU and two Intel Xeon E5-2673 v3 CPUs. Meanwhile, the batch size is set to 32.
4 Experimental results
4.1 Dataset
The UNBC-McMaster Shoulder Pain database is used to validate the proposed model. This database includes 200 sequences across 25 subjects, with a total of 48398 images, and each frame is encoded with a PSPI score. Fig. 3 shows some images of this database and the corresponding PSPI scores.

Figure 3.Sample frames and their corresponding PSPI scores at the UNBC-McMaster Shoulder Pain database.
The database divides pain into 16 levels according to the PSPI score. A PSPI score of 0 stands for “No pain”, while a PSPI score of 15 indicates the most intense pain. In other words, pain becomes increasingly intense as the PSPI score increases. As shown in Fig. 4, the number of images without pain is far greater than the number of images with pain. In addition, there is also a large difference in the number of images at different pain levels. This database imbalance presents a significant challenge to the performance of the training model.

Figure 4.Number of pictures per PSPI code class at the UNBC-McMaster Shoulder Pain database.
To alleviate this imbalance issue, the method in Ref. [41] is applied to balance the sample number of categories. That is, full sequences that included only no pain (PSPI=0) frames are removed and some no-pain frames from the beginning and end of sequences that included no pain frames are removed. Consequently, a total of 11290 images are selected. According to the PSPI score, pain is reclassified into four levels, including “No pain” (PSPI=0), “Weak pain” (PSPI=1), “Mild pain” (PSPI=2), and “Strong pain” (PSPI≥3), as shown in Table 1.

Table 1. Number of image for four pain levels at the UNBC-McMaster Shoulder Pain database2 according to the PSPI score.
Table 1. Number of image for four pain levels at the UNBC-McMaster Shoulder Pain database2 according to the PSPI score.
PSPI score | Pain level | Number of images | 0 | No pain | 2928 | 1 | Weak pain | 2909 | 2 | Mild pain | 2351 | ≥3 | Strong pain | 3102 |
|
4.2 Performance of the model
The proposed GLA-CNN model is validated for pain level classification in 5-fold cross-validation. To assess its performance, accuracy, precision, F1-score, and recall are employed to evaluate the performance of the proposed model. The baseline networks (VGG-16), GANet, and LANet have been trained and evaluated. The results for each network are shown in Table 2. It is evident from the table that the baseline achieved the accuracy of 51.72%. Furthermore, the accuracy of GANet and LANet has acquired 55.51% and 55.53%, respectively. Compared with the baseline, the accuracy of GANet and LANet improved by 3.79% and 3.81%, respectively. The improved performance of LANet validates the fact that its ability to learn discriminative and representative features of pain from local patches. Meanwhile, the results of GANet indicate that incorporating correlation information between facial pain AUs is beneficial for estimating pain intensity. Notably, the accuracy of GLA-CNN is 56.45%, surpassing VGG-16 by 4.73%. This significant improvement demonstrates that GLA-CNN effectively integrates the strengths of GANet and LANet, making it the most effective model for distinguishing pain levels of different intensities. In addition, a knowledge graph experiment is conducted to enhance contextual richness and semantic comprehension. Specifically, the features extracted by GLA-CNN are seamlessly integrated into the BiLSTM network, henceforth designated as GLA-CNN-BILSTM, allowing us to leverage the combined strengths of both models for enhanced classification performance. Unfortunately, GLA-CNN-BiLSTM did not outperform GLA-CNN. This could potentially be attributed to the fact that GANet and LANet disrupt the dependency relationship among features during the feature extraction process, ultimately resulting in a degradation of the performance of GLA-CNN-BiLSTM.

Table 2. Comparison of the average performance of proposed GLA-CNN with VGG-16, separated local and global modules under 5-fold cross validation.
Table 2. Comparison of the average performance of proposed GLA-CNN with VGG-16, separated local and global modules under 5-fold cross validation.
Model | Accuracy | F1-score | Recall | Precision | Baseline(VGG-16) | 51.72% | 36.65% | 34.55% | 44.36% | LANet | 55.53% | 35.38% | 33.78% | 42.92% | GANet | 55.51% | 37.22% | 33.91% | 43.03% | GLA-CNN | 56.45% | 36.52% | 34.08% | 43.23% | GLA-CNN-BiLSTM | 54.54% | 33.15% | 31.07% | 42.02% |
|
Fig. 5 illustrates the confusion matrix of the proposed model, revealing its performance across different pain levels. Notably, it is clear that our proposed model achieves higher accuracy for “No pain” and “Strong pain”, while it exhibits lower performance on “Weak pain” and “Mild pain”. This is attributed to the subtle differences between these pain levels. “Mild pain” and “Weak pain” tend to cause more confusion with others because their facial expression shows an intermediate degree and even clinicians have difficulty distinguishing accurately.

Figure 5.Confusion matrix based on the proposed GLA-CNN model.
4.3 Comparison with other models
The comparison results of GLA-CNN with nine existing models are presented in Table 3. To maintain the consistency and fairness of comparison, all models are retrained and tested with the same database, the best parameters are adopted and the best performance are used for comparison. Therefore, according to the methodology described in Ref. [16], the database is divided into different partitions. Specifically, the training set consists of images from 10 subjects, the validation set includes images from 5 subjects, and the test set comprises images from 10 subjects. This partitioning allows us to evaluate the performance of the model based on accuracy.

Table 3. Comparison of the proposed model with the state-of-the-art at the UNBC-McMaster Shoulder Pain database.
Table 3. Comparison of the proposed model with the state-of-the-art at the UNBC-McMaster Shoulder Pain database.
Ref. | Method | Level | Accuracy | [42] | Head analysis | 4 | 28.10% | [43] | LBP | 4 | 34.70% | [43] | LPQ | 4 | 23.40% | [43] | BSIF | 4 | 23.70% | [44] | CNN | 4 | 35.30% | [16] | CNN | 4 | 51.10% | [45] | MRN | 4 | 46.18% | [46] | MA-NET | 4 | 55.16% | [47] | Swin transformer | 4 | 54.40% | [48] | POSTER V2 | 4 | 52.24% | This work | GLA-CNN | 4 | 56.45% |
|
As shown in Table 3, the performance of conventional methods such as those in Refs. [42] and [43] provided the accuracy around the chance level (25% for four categories), indicating that traditional methods show weakness in extracting facial pain features. Deep learning models, such as those in Refs. [16,44–49], largely improved the performance, among them, GLA-CNN shows the highest accuracy because it integrates the global and local features of facial features.
4.4 Ablation analysis
To better understand the benefits of attentional mechanisms, we conduct quantitative presumptive evaluations of different network structures, and with/without attentional mechanisms.
We conduct a comparative analysis of various network structures, including VGG-16, LNet (local attention is not added to LANet), and GNet (without global attention added to GANet), against LANet, GANet, and the proposed GLA-CNN. As shown in Table 4, the inclusion of attentional mechanisms significantly improves the accuracy of both GANet and LANet by over 2%. Specifically, the local attention module in LANet enhances the focus on local patches associated with pain features, while the global attention module in GANet strengthens the utilization of global context information. Compared with GANet and LANet, which leverages both global weighted features and the local attentive patches, GLA-CNN demonstrates its superior ability to capture relevant information for pain level classification.

Table 4. Comparison of different network performance with/without attentional mechanisms (GA and LA stand for global attention and local attention, respectively).
Table 4. Comparison of different network performance with/without attentional mechanisms (GA and LA stand for global attention and local attention, respectively).
Model | GA | LA | Accuracy | F1-score | Recall | Precision | VGG-16 | × | × | 51.72% | 36.65% | 34.55% | 44.36% | GNet | × | × | 53.42% | 33.06% | 30.18% | 41.32% | GANet | √ | × | 55.51% | 37.22% | 33.91% | 43.03% | LNet | × | × | 53.28% | 35.76% | 33.07% | 43.77% | LANet | × | √ | 55.53% | 35.38% | 33.78% | 42.92% | GLA-CNN | √ | √ | 56.45% | 36.52% | 34.08% | 43.23% |
|
4.5 Visualization
To explain how the local, global, and attention modules play roles in the network structure, the feature maps of VGG-16, GNet, GANet, LNet, LANet, and GLA-CNN are visualized using the method described in Ref. [49]. This method adopts gradient-weight class activation mapping to visualize a wide variety of CNN model families. Fig. 6 shows the attention maps of images with different pain levels (from “No pain” to “Strong pain”) on the test set under different models. As shown in Fig. 6, VGG-16 extracts only a few patches of facial features for pain estimation. LNet provides small local facial feature details for whole faces. Compared with LNet, LANet can find more representative local identification patches, which is useful in pain estimation. It is clear that GNet also provides large patches, similar to VGG-16. However, GANet can effectively extract the associated features between facial pain AUs using the global attention module. GLA-CNN can not only accurately perceive facial AUs related to pain, but also extract the association information between different AUs. Besides, the visualization results indicate that pain features are predominantly distributed around the eyes, nose, and mouth which are consistent with the core AUs in previous pain studies [14,19].

Figure 6.Samples of attention map from “No Pain” to “Strong Pain” by different models at the UNBC-McMaster Shoulder Pain database.
5 Conclusion
In the real clinical environment, pain assessment and management is an important part of pain control and the crucial reference element in medical diagnosis. Therefore, effective pain assessment and management for patients can alleviate their fear of pain, avoid drug abuse, and offer numerous other benefits. For doctors, accurate pain assessment leads to a deeper understanding of patients’ pain conditions, enabling them to prescribe medications that are tailored to achieve optimal therapeutic outcomes. In this study, we propose an auxiliary method for clinical pain assessment that utilizes facial expression as a reliable indicator.
This study aims to leverage contemporary artificial intelligence techniques to assess pain intensity through facial expression images. In this endeavor, we introduce the GLA-CNN facial pain assessment model, which is grounded inattention mechanism. The GLA-CNN model represents a harmonious integration of GANet and LANet, designed to harness their respective strengths. By drawing inspiration from clinical physicians’ evaluation methods for patient’s facial expressions, LANet has been developed based on regional decomposition. LANet incorporates a local attention module to identify facial regions typically associated with pain. Our experimental findings demonstrate that LANet surpasses previous methods that rely solely on extracting local features and combining them with a classifier [42,43]. Furthermore, a highly efficient GANet has developed to extract pertinent information from pain patches. The experimental results further reveal that the performance of LANet and GANet in pain assessment is comparable, underscoring the significance of the information extracted by both pathways. Nevertheless, GLA-CNN has its limitations, necessitating a more streamlined design to better suit clinical applications. In the future, we will endeavor to make pertinent improvements in this aspect.
Experiments under the UNBC-McMaster Shoulder Pain database demonstrate that the GLA-CNN model outperforms other state-of-the-art methods. Visualization analyses show that GLA-CNN can not only extract local pain features with discrimination but also extract the correlation features between local features to ensure global context cues. However, although our proposed model is superior to other methods, the overall accuracy of pain assessment from facial expressions remains unsatisfactory. The reason lies in the following facts: 1) The sample size of the dataset is quite small (the number of subjects is only 25), which leads to pretty low generalization ability. 2) There are small differences between the facial pain features of “Weak pain” and “Mild pain”, and even clinicians have difficulty to distinguish them, therefore the accuracy of the ground truth is doubtful; By contrast, for “No pain” and “Strong pain”, their facial features are more obvious, so their accuracy are better than that of “Weak pain” and “Mild pain”. 3) The database only consists of facial expression information. However, in clinical applications, doctors always assess the pain density with multidimensional information such as facial expression, voice, activities, and inquiry, which greatly guarantees accuracy. Despite all this, the study of pain assessment based on facial expression is meaningful, particularly for infants and those patients with aphasia and language inability. To make artificial intelligence better serve medical and clinical pain assessment, it is imperative to establish larger datasets with more diverse patients and improved generalization capabilities. Additionally, there is a pressing need to develop video-based pain assessment methods that integrate facial expression with multimodal information such as voice and activities, thereby enhancing the accuracy and reliability of pain assessments.
Institutional Review Board Statement
The study was approved by researchers at the McMaster University and University of Northern British Columbia. In addition, we also signed the requisite agreement, ensuring that all methods employed in this research were conducted in accordance with relevant guidelines and regulations.
Informed Consent Statement
In the article, the facial images used in Figs. 1, 3 and 6 are also part of the publicly available dataset in the UNBC-McMaster Shoulder Pain Expression Archive Database. When presenting these images, mosaic is partly overlapped on the faces to safeguard the privacy and anonymity of the individuals.
Disclosures
The authors declare no conflicts of interest.