1 Introduction
Intelligent transportation systems (ITSs) have been a major research focus in recent years, as they play a key role in advancing smart cities. As an essential part of ITSs, traffic flow prediction provides data support for optimizing the allocation of urban traffic resources and can help city managers deal with various urban traffic situations more effectively. Moreover, it can ease the daily travel inconvenience for the public and has great significance for intelligent traffic management and road network planning.
Traffic flow prediction aims to forecast future traffic flow data from historical traffic flow data. Long-term prediction expects output to be the traffic flow of a series of continuous time steps rather than a single time step. For example, assuming that every five minutes is a time step, long-term urban traffic flow prediction needs to use the traffic flow data of 12-time steps of the previous hour (or even more hours) to predict the corresponding traffic flow data of the 12-time steps of the next hour. The main challenge to achieving accurate predictions is how to model the complex dynamic spatiotemporal dependency that exists in the traffic flow data. To solve this problem, researchers have developed spatiotemporal graph neural network (GNN) models, which have outperformed previous models in extracting spatiotemporal information and making predictions. However, some problems still need to be solved.
First, most models overlook the semantic spatial similarity between distant regions when mining the spatial dependency of urban traffic data. But in cities, the traffic conditions of different areas depend on their locations and the functions of the nearby buildings; i.e., traffic data recorded by two sensors that are far apart in distance may also show great similarity [1]. Specifically, most current approaches use graph convolutional networks (GCNs) [2–6] and fail to capture the spatial dependency in the traffic flow data. These models can only focus on the spatial relations between each sensor and its neighbors. By stacking multilayer networks, a GCN model can capture spatial information between sensors that are slightly farther away. However, it is still hard to learn semantic spatial relations at a longer distance due to the limitation of the adjacency matrix representation. Moreover, stacked multilayer networks are prone to over-smoothing, which leads to significant degradation of model performance. Therefore, Jiang et al. [1] proposed the concept of semantic spatial similarity. To differentiate it from traditional distance-based spatial similarity, they added two modules to their prediction model to exploit two kinds of spatial dependency based on distance and semantics, respectively. The model achieved more accurate results than previous models in the final prediction task.
Second, there is still room for improvement in the performance of existing methods in long-term predictions because they neglect the temporal dependency between predicted and unpredicted time steps. Most methods are based on the transformer model and only use a spatiotemporal encoder to analyze the input time series data and combine the data through several layers of fully connected layers to obtain the predicted time series at once [1,7–10]. Specifically, existing methods usually focus only on the correlation between the traffic flow at the time step to be predicted and the input traffic flow, ignoring the implicit temporal correlation with the already predicted traffic flow, which results in the model losing some temporal information. Moreover, these methods only use the encoder to mine the temporal dependency of the input sequence traffic data. To extract deep temporal correlations, several iterations are required to make the model fully learn the complex temporal correlations at each time step, which makes their models run less efficiently. Therefore, based on the existing spatiotemporal encoder, this paper proposes adding a spatiotemporal decoder to form an encoder-decoder structure. This structure can predict the traffic flow at each time step by using information of the previously predicted time steps and take advantage of the temporal dependency between them to improve the accuracy of long-term predictions.
Finally, recent models with a decoder structure can benefit from improvements in how to perform data embedding. Current approaches based on the encoder-decoder structure typically utilize a data embedding layer only in the encoder. However, for some complex datasets, embedding data only in the encoder may prevent the model from fully learning the contextual information implied by the time series data and the periodical and spatial dependency of the time step to be predicted. Adding data embedding processing to the decoder could further extract implicit information and enhance the modeling of periodical and spatial dependency in the decoder input. Thus, this deserves further investigation to improve a model’s prediction performance.
This paper proposes a novel multi-scale persistent spatiotemporal transformer (MSPSTT) for long-term urban traffic flow prediction to address the limitations of prior studies. The model adopts an encoder-decoder structure. The spatiotemporal decoder uses contextual information of the input data provided by the encoder and combines the information with the transformer-based persistent perception time module. Thus, the model enables each prediction of the next time step to be predicted by learning the information of all previous time steps and utilizing the feedback to update the already predicted time steps. Moreover, two spatial masking matrices based on geographic and semantic distance are introduced in both the encoder and decoder. The masking matrix of the geographic space module masks distant spatial information and retains proximal information, while the semantic space module does the opposite to help the module better distinguish between the effects of spatial dependency between distant and proximal distances. Furthermore, full data embedding processing is introduced to model the complex urban traffic flow characteristics better and assist the spatiotemporal encoder and decoder in dynamically extracting temporal, periodical, geospatial, and semantic spatial dependency in the traffic flow data. In summary, this paper makes the following main contributions.
1) An encoder-decoder architecture is employed to design a persistent perceptual temporal attention module. This module considers and models the temporal, geospatial, and semantic spatial dependency of the traffic flow data to provide comprehensive data embedding of the data.
2) The MSPSTT model is proposed, which can achieve persistent perception of temporal information in the prediction results to enhance the accuracy of long-term prediction. The model can also capture complex information such as temporal, geospatial, and semantic spatial information in urban traffic data.
3) Experiments are conducted on four real public traffic flow datasets, and the experimental results show that the proposed model outperforms state-of-the-art prediction methods. The impact of each module on the model performance is further evaluated through ablation experiments.
The structure of this paper is organized as follows: Section 2 reviews related works. Section 3 defines the problem of traffic flow prediction. Then, Section 4 presents the designed transformer-based multi-scale persistent perceptual spatiotemporal attention network model. Section 5 reports the results of the experimental evaluation. Finally, Section 6 draws a conclusion.
2 Related works
Traffic flow prediction is an important task for ITS, which has been studied for several years, and a number of achievements have been made in research and development. By analyzing time series characteristics of the traffic flow data, early researchers applied classical time series analysis methods based on statistical learning methods to create prediction models, such as using the autoregressive integrated moving average [11] and vector autoregressive model [12]. However, these methods are poorly adapted to nonlinear data and thus lead to poor predictions.
With the development of machine learning, approaches based on support vector regression (SVR) [13] and k-nearest neighbors (KNNs) [14] were found to provide more accurate traffic flow predictions than previous methods. However, SVR relies on manually selected kernel functions and their parameters, and parameter tuning is time-consuming. On the other hand, KNN requires selecting k appropriate values and converges slowly. Thus, it is unsuitable for today’s complex traffic scenarios requiring real-time updates.
Deep learning provides new solutions to such problems and can better handle complex spatiotemporal features with high dimensionality and nonlinearity. For instance, convolutional neural networks (CNNs) can extract spatial dependency from grid traffic flow data using convolutional and pooling layers, while recurrent neural networks (RNNs) are often used to model temporal dependency in traffic flow and learn long-term dependency of data from time series. Zhang et al. [15] designed the deep spatiotemporal residual networks (ST-ResNet) architecture based on residual convolution units to extract spatiotemporal features from traffic data for urban pedestrian flow prediction. Yao et al. [16] integrated CNNs and long-term short-term memory (LSTM) with RNN to capture both temporal and spatial dependency. Although these techniques are widely used in the field of traffic flow prediction, actual traffic data is not based on planar or three-dimensional Euclidean space but on non-Euclidean spatially structured graph data, which cannot be effectively handled by traditional deep learning methods. For example, ST-ResNet [15] restricts the input to grid data, which cannot be used for traffic prediction problems on graph-structured urban road networks. Recently, with the advent of GNNs, it was shown that they could handle non-Euclidean traffic data and have been proven effective in modeling the complex spatial dependency of the traffic data [17]. Wu et al. [18] divided GNNs broadly into four types, namely, recurrent GNNs, convolutional GNNs, graph autoencoders, and spatiotemporal GNNs. Since traffic prediction is a spatiotemporal problem, GNNs used in this field can all be classified as spatiotemporal GNNs. Li et al. [19] combined the graph structure with a convolutional network and proposed a diffusion convolution operation to capture spatial dependency. Yu et al. [20] proposed an architecture consisting of multiple spatiotemporal graph convolution blocks. The convolutional blocks consist of a combination of graph convolutional layers and convolutional sequence learning layers for modeling temporal and spatial dependency. Compared with traditional machine learning methods, various deep learning models are able to achieve better performance without relying on human intervention. However, despite their effectiveness for traffic flow prediction tasks, these models often face serious problems of error accumulation and data lag when performing long-term predictions, which still need further exploration and research.
The attention mechanism was very effective in modeling the dynamic dependency of the traffic flow data [10,21–23]. The transformer [24] network architecture, which is based on the attention mechanism and thus is able to access any part of a sequence regardless of its distance to the target [25–27], has also become one of the main models for traffic flow prediction and has broad applications. Jiang et al. [1] added a self-attention module to the transformer’s encoder to mine geospatial dependency and semantic spatial dependency and distinguish between spatial information that is near and far, but they ignored the temporal dependency between predicted and unpredicted time steps. Guo et al. [2] used the encoder-decoder structure of the transformer network and constructed a spatiotemporal self-attention module to capture the spatiotemporal dependency in traffic data, but without considering the semantic similarity between urban areas at long distances. Ye et al. [23] proposed a framework for integrating meta-learning into a transformer to capture spatiotemporal heterogeneity in traffic prediction tasks. These studies have demonstrated that transformer-based models can obtain excellent results in traffic flow prediction. Therefore, in this paper, we adopt a complete encoder-decoder architecture based on the transformer deep learning model, combined with a GNN model that has the ability to handle non-Euclidean traffic data and design three modules to jointly explore the temporal, geographic spatial, and semantic spatial dependency in traffic data, to attain more accurate traffic flow prediction. Due to the complete encoder-decoder structure, the model can learn all the above information, including the predicted time steps, to predict the traffic flow data of the next time step and also update the traffic flow data of the predicted time steps. Compared with the existing models, MSPSTT achieves an improvement of up to 9.5% on three common metrics.
3 Problem Definition
The problem of traffic flow prediction is defined as follows [1].
Definition 1. (Transportation Network) A transportation network is represented as a directed or undirected graph $ G=(V{\rm{,}}\;E{\rm{,}}\;\mathbf{A}) $, where V is the set of $ N\;\left(=\left|V\right|\right) $ nodes, each representing a sensor on the roadway, such as a traffic detector or an observatory; E is the set of $ M\left(=\left|E\right|\right) $ edges, each representing pairs of sensors at both ends of a roadway that are adjacent to each other in the real traffic network, and $ \mathbf{A} $ is the adjacency matrix of the graph G, which can be calculated by E.
Definition 2. (Traffic Flow) The traffic flow $ {\mathrm{X}}_{t}\in {\mathbb{R}}^{N\times C} $ is defined as the values observed by the N nodes of the traffic network at time t. In general, urban traffic conditions are complex, with a variety of traffic data such as traffic flow, density, and speed. In this paper, we assume that the number of available observations is C. The notation $ \mathbf{X}=\left({\mathrm{X}}_{{t}_{1}}{\rm{,}}\;{\mathrm{X}}_{{t}_{2}}{\rm{,}}\; \cdots {\rm{,}}\;{\mathrm{X}}_{{t}_{N}}\right)\in {\mathbb{R}}^{T\times N\times C} $ denotes the set of traffic flow matrices for all N nodes with C observations at T time steps.
Definition 3. (Traffic Flow Prediction) According to the above definition, traffic flow prediction can be expressed as the spatial relationship of N nodes given by the traffic network G. Thus, based on the set of historical traffic flow matrices of T time steps, the goal is to find a suitable mapping method f to exploit the spatiotemporal dependency to predict the set of traffic flow matrices for the following T' time steps.
$ \left[{\mathrm{X}}_{t-T+1}{\rm{,}}\;{\mathrm{X}}_{t-T+2}{\rm{,}}\; \cdots {\rm{,}}\;{\mathrm{X}}_{t}{\mathrm{;}}\;G\right]\stackrel{f}{\to }\left[{\mathrm{X}}_{t+1}{\rm{,}}\;{\mathrm{X}}_{t+2}{\rm{,}}\; \cdots {\rm{,}}\;{\mathrm{X}}_{\left(t+T'\right)}\right] {\mathrm{.}} $ (1)
4 Methods
This paper proposes a novel traffic flow prediction model based on a multi-head attention mechanism to solve the long-term traffic flow prediction problem in urban areas. The model’s architecture, depicted in Fig. 1, consists of three modules. Both the encoder and the decoder are composed of three types of multi-attention mechanisms to handle the time, geospatial, and semantic spaces, respectively, and l represents the number of stacked layers of encoder and decoder. Among them, the temporal attention mechanism mainly models and learns the temporal correlation between different time steps, the geographic spatial attention mechanism focuses on capturing spatial correlation between different geographical locations, and the semantic spatial attention mechanism focuses on extracting semantic information from traffic data. Then, a spatiotemporal heterogeneous fusion mechanism is introduced, by which the outputs of the three types of multi-head attention are weighted, combined, and mapped to the same feature space to obtain the multi-head attention output. The encoder output is fed into the decoder’s temporal attention module, providing the decoder with rich contextual information to learn the temporal feature representation persistently and circularly. Finally, recursive multi-step prediction is performed to obtain the prediction results for future time steps.

Figure 1.Structure of the proposed traffic flow prediction model with multi-head attention: (a) data embedding layer, (b) spatiotemporal encoder, and (c) spatiotemporal decoder.
4.1 Data embedding layer
The data embedding layer’s structure is shown in Fig. 2. This module first transforms the C-dimensional traffic flow vector into a d-dimensional embedding vector through the fully connected layer. Then, it performs data embedding to extract the relevant feature representation for integration into the model for prediction. According to the characteristics of different features, the data embedding layer is further divided into three kinds of data embedding: Positional, spatial graph, and temporal periodic embedding, which are defined as follows.

Figure 2.Data embedding layer.
1) Positional embedding: Influenced by people’s travel and lifestyle, urban traffic flow is usually sensitive to time series location information, which can help the model to perceive the future traffic flow trend more accurately according to the traffic condition and thus improve the prediction accuracy. Therefore, this paper refers to positional embedding [24], which embeds the time location information into the input sequence. Specifically, the embedding vector has a dimension size d, and the positional embedding matrix $ {\mathbf{X}}_{\mathrm{t}\mathrm{p}\mathrm{e}}\in {\mathbb{R}}^{T\times d} $ is defined as follows:
$ {\mathbf{X}}_{\mathrm{t}\mathrm{p}\mathrm{e}}=\left\{\right. $ (2)
where pos denotes the current position and i is the current dimension.
2) Spatial graph embedding: In this paper, the graph Laplacian eigenvector L proposed by Belkin et al. [28] is used to embed the node adjacencies, characterize the spatial correlation between nodes based on the geographic distance between them, and preserve the global graph structure information. First, the normalized Laplacian matrix $ \mathbf{L}={\mathbf{D}}^{-1/2}\mathbf{A}{\mathbf{D}}^{-1/2} $ is calculated, where A is the adjacency matrix, D is the degree matrix, and $ {\mathbf{D}}_{i{\mathrm{,}}i}=\displaystyle\sum _{j=1}^{N}{\mathbf{A}}_{i{\mathrm{,}}j} $. After the eigenvalue decomposition $ \mathbf{L}{\mathbf{v}}_{i}=\lambda _{i}{\mathbf{v}}_{i} $, the Laplacian eigenvector matrix $ {\mathbf{v}}_{i} $ and eigenvalues $ \lambda _i $ are obtained, and for the N nodes in the graph, the Laplacian embedding $ {\mathbf{X}}_{\mathrm{s}\mathrm{p}\mathrm{e}}\in {\mathbb{R}}^{N\times d} $ is obtained using linear projection on the k smallest nontrivial eigenvectors.
3) Temporal periodic embedding: In the previously described positional embedding, only the traffic flow characteristics of the input time step are considered. However, urban traffic generally has obvious periodicity, such as morning and evening rush hours and non-working days [1], and the impact of that temporal periodicity on the prediction accuracy cannot ignored. For example, the traffic flow at 9:00 AM on Monday is influenced not only by the traffic flow in the time period near 8:00 AM but also by the traffic flow in the same time period in the past days and weeks. Therefore, two-time embeddings $ {t}_{w\left(t\right)}{\rm{,}}\;{t}_{d\left(t\right)}\in {\mathbb{R}}^{d} $ are designed to represent weekly and daily periods, where $ w\left(t\right) $ and $ d\left(t\right) $ map time t to weekly index (1 to 7) and minute index (1 to 1440), respectively. The time period embeddings $ {\mathbf{X}}_{w}{\rm{,}}\;{\mathbf{X}}_{d}\in {\mathbb{R}}^{T\times d} $ are obtained by concatenating the d-dimensional time embedding vectors of T time slices of the input sequence data.
Finally, the above embedding vectors are aggregated to obtain the output of the data embedding layer. The input data of the encoder $ {\mathbf{X}}_{\mathrm{e}\mathrm{n}\mathrm{c}\_\mathrm{d}\mathrm{a}\mathrm{t}\mathrm{a}} $ and the decoder $ {\mathbf{X}}_{\mathrm{d}\mathrm{e}\mathrm{c}\_\mathrm{d}\mathrm{a}\mathrm{t}\mathrm{a}} $, are processed by their respective fully connected layers to obtain the processed data of the encoder and decoder, $ {\mathbf{X}}_{\mathrm{e}\mathrm{n}\mathrm{c}\_\mathrm{e}\mathrm{m}\mathrm{b}} $ and $ {\mathbf{X}}_{\mathrm{d}\mathrm{e}\mathrm{c}\_\mathrm{e}\mathrm{m}\mathrm{b}} $, which share the same data embedding structure. The positional embedding matrix is a fixed value matrix which does not change during model training, while the other embedding matrices are updated along with the training. The final embedding representations of both sides are as follows:
$ {\mathbf{X}}_{\mathrm{e}\mathrm{n}\mathrm{c}\_\mathrm{e}\mathrm{m}\mathrm{b}}={\mathbf{X}}_{\mathrm{e}\mathrm{n}\mathrm{c}\_\mathrm{d}\mathrm{a}\mathrm{t}\mathrm{a}}+{\mathbf{X}}_{\mathrm{s}\mathrm{p}\mathrm{e}}+{\mathbf{X}}_{w}+{\mathbf{X}}_{d}+{\mathbf{X}}_{\mathrm{t}\mathrm{p}\mathrm{e}}{\mathrm{,}} \tag{3a}$ ()
$ {\mathbf{X}}_{\mathrm{d}\mathrm{e}\mathrm{c}\_\mathrm{e}\mathrm{m}\mathrm{b}}={\mathbf{X}}_{\mathrm{d}\mathrm{e}\mathrm{c}\_\mathrm{d}\mathrm{a}\mathrm{t}\mathrm{a}}+{\mathbf{X}}_{\mathrm{s}\mathrm{p}\mathrm{e}}+{\mathbf{X}}_{w}+{\mathbf{X}}_{d}+{\mathbf{X}}_{\mathrm{t}\mathrm{p}\mathrm{e}} {\mathrm{.}} \tag{3b}$ ()
4.2 Spatiotemporal encoder
The spatiotemporal encoder comprises four modules, namely the geographic spatial self-attention module which consists of delay-aware feature transformation and geographic spatial attention, semantic spatial self-attention module to run semantic spatial attention, temporal self-attention module to run temporal self-attention, and spatiotemporal heterogeneous fusion module to concat multi-heads data, as illustrated in Fig. 3.

Figure 3.Spatiotemporal encoder.
4.2.1 Geographic spatial self-attention module
The dynamic nature of traffic flow propagates with the structure of the road network, changes dynamically over time, and is influenced by adjacent road segments. For instance, traffic congestion often spreads from one road segment and affects the traffic flow of adjacent road segments. Therefore, a graph-based multi-head spatial self-attention mechanism is decided to capture the spatial dynamics of traffic flow. In addition, a traffic pattern self-attention mechanism based on the k-shape clustering algorithm is designed with reference to the delay-aware feature transformation proposed by Jiang et al. [1]. The input sequences are clustered to extract representative key historical traffic patterns, and the self-attention weighted output of the pattern features is calculated and accumulated by the geographic spatial self-attention module to participate in the update of the key matrix. The traffic pattern self-attention mechanism helps the model to learn the traffic flow variation patterns better, reduce the influence of data noise, and improve the accuracy and stability of the model to a certain extent.
First, in data preprocessing, the traffic flows of all nodes in the historical time steps are selected for k-shape clustering, and the center of each cluster is used as the most representative historical traffic pattern of that cluster. Then, the historical traffic patterns and the representative traffic patterns obtained after clustering are converted into high-dimensional feature embedding by linear variation, and the attention weights are calculated according to the similarity scores between them. Unlike the general self-attention mechanism performing the query, key, and value matrix construction, the geographic spatial self-attention module uses the original historical modes $ {\mathbf{X}}_{p} $ to construct the query matrix and the clustered historical modes $ {\mathbf{X}}_{k} $ to construct the key and value matrix. Finally, the self-attention output $ {\mathbf{P}\mathbf{S}\mathbf{A}}_{t}^{P} $ of historical key traffic patterns is calculated as follows:
$ {\mathbf{Q}}_{t}^{P}={\mathbf{X}}_{t\colon\colon }{\mathbf{W}}_{Q}^{P}{\rm{,}}\;{\mathbf{K}}_{t}^{P}={\mathbf{X}}_{t\colon\colon }{\mathbf{W}}_{K}^{P}{\rm{,}}\;{\mathrm{V}}_{t}^{P}={\mathbf{X}}_{t\colon\colon }{\mathbf{W}}_{V}^{P}{\mathrm{,}} \tag{4a}$ ()
$ {\mathbf{P}\mathbf{S}\mathbf{A}}_{t}^{P}=\text{softmax}\left(\frac{{\mathbf{Q}}_{t}^{P}{\left({\mathbf{K}}_{t}^{P}\right)}^{{\mathrm{T}}}}{\sqrt{{d}_{\mathrm{h}\mathrm{e}\mathrm{a}\mathrm{d}}}}\right){\mathbf{V}}_{t}^{P}{\mathrm{,}} \tag{4b}$ ()
where $ {\mathbf{Q}}_{t}^{P} $ is the query matrix constructed based on the historical patterns, $ {\mathbf{K}}_{t}^{P} $ and $ {\mathbf{V}}_{t}^{P} $ are the key and value matrices constructed based on the clustered representative patterns, dhead determines the dimensionality of the query, key and value matrices, and $ {\mathbf{W}}_{Q}^{P} $, $ {\mathbf{W}}_{K}^{P} $, and $ {\mathbf{W}}_{V}^{P} $ are the corresponding mapping matrices.
At time t, the query, key, and value matrices $ {\mathbf{Q}}_{t}^{S}{\rm{,}}\;{\mathbf{K}}_{t}^{S}{\rm{,}}\;\mathrm{a}\mathrm{n}\mathrm{d}\;{\mathbf{V}}_{t}^{S} $ are constructed from the same input sequences. Simultaneously, the key matrix is added to the historical traffic pattern sequences obtained from the previous clustering. This sum is then used to update and learn the geographic spatial self-attention weights $ \mathbf{G}\mathbf{e}\mathbf{o}\mathbf{S}\mathbf{A}\in {\mathbb{R}}^{B\times T\times N\times D} $(B is the data batch size).
$ {\mathbf{Q}}_{t}^{S}={\mathbf{X}}_{t\colon\colon }{\mathbf{W}}_{Q}^{S}{\rm{,}}\;{\mathbf{K}}_{t}^{S}={\mathbf{X}}_{t\colon\colon }{\mathbf{W}}_{K}^{S}{\rm{,}}\;{\mathbf{V}}_{t}^{S}={\mathbf{X}}_{t\colon\colon }{\mathbf{W}}_{V}^{S}{\mathrm{,}} \tag{5a}$ ()
$ \widetilde {{\mathbf{K}}_{t}^{G}}={\mathbf{K}}_{t}^{S}+{\mathbf{P}\mathbf{S}\mathbf{A}}_{t}^{S}{\mathrm{,}} \tag{5b}$ ()
$ \mathbf{G}\mathbf{e}\mathbf{o}\mathbf{S}\mathbf{A}=\text{softmax}\left(\frac{{\mathbf{Q}}_{t}^{S}{\left(\widetilde{{\mathbf{K}}_{t}^{G}}\right)}^{{\mathrm{T}}}}{\sqrt{{d}_{\mathrm{m}\mathrm{o}\mathrm{d}\mathrm{e}\mathrm{l}}}}\odot {\bf{M}}_{\mathrm{g}\mathrm{e}\mathrm{o}}\right){\mathbf{V}}_{t}^{S}{\mathrm{,}} \tag{5c}$ ()
where $ {\mathbf{W}}_{Q}^{S}{\rm{,}}\;{\mathbf{W}}_{K}^{S}{\rm{,}}\;{\mathrm{a}\mathrm{n}\mathrm{d}\;\mathbf{W}}_{V}^{S} $ are the corresponding mapping matrices, determining the dimensions of $ {\mathbf{Q}}_{t}^{S}{\rm{,}}\;{\mathbf{K}}_{t}^{S}{\rm{,}}\;\mathrm{a}\mathrm{n}\mathrm{d}\;{\mathbf{V}}_{t}^{S}{\rm{;}}\;\widetilde{{\mathbf{K}}_{t}^{G}} $ is the key matrix that incorporates historical traffic pattern information; $ {\bf{M}}_{\mathrm{g}\mathrm{e}\mathrm{o}} $ is the geographic shielding matrix constructed based on the geographic adjacency relationship between nodes whose shortest distance is less than a geographic threshold. This latter matrix is utilized to shield against the influence of distant nodes and avoid the interference of distant semantic spatial information on the mining process of proximal geographic spatial dependency.
4.2.2 Semantic spatial self-attention module
Given the spatial heterogeneity of traffic flows, even at the same time, there may be significant differences in spatial characteristics and different traffic patterns between neighboring areas. Also, similar traffic patterns may exist between some distant but semantically similar areas. For example, there are significant differences in traffic patterns between residential and commercial areas, while commercial areas have similar traffic patterns due to their similar functional and planning structures. To capture the dynamic semantic spatial dependency of traffic flows, the dynamic time warping algorithm [29] is used to calculate the semantic similarity between nodes. The top l nodes with the highest similarity are selected for each node as its neighbor nodes to obtain the semantic spatial adjacency matrix, and the semantic spatial self-attention output $ \mathbf{S}\mathbf{e}\mathbf{m}\mathbf{S}\mathbf{A}\in {\mathbb{R}}^{B\times T\times N\times D} $ is calculated as
$ \mathbf{S}\mathbf{e}\mathbf{m}\mathbf{S}\mathbf{A}=\text{softmax}\left(\frac{{\mathbf{Q}}_{t}^{S}{\left(\widetilde{{\mathbf{K}}_{t}^{S}}\right)}^{{\mathrm{T}}}}{\sqrt{{d}_{\mathrm{m}\mathrm{o}\mathrm{d}\mathrm{e}\mathrm{l}}}}\odot {\mathbf{M}}_{\mathrm{s}\mathrm{e}\mathrm{m}}\right){\mathbf{V}}_{t}^{S}{\mathrm{,}} $ (6)
where $ {\mathbf{M}}_{\mathrm{s}\mathrm{e}\mathrm{m}} $ is the semantic masking matrix with only the distant semantic adjacencies retained, and the weight of semantic spatial information of distant nodes will be enhanced by masking the influence of proximity nodes.
4.2.3 Temporal self-attention module
In the time dimension, the traffic flows of different nodes vary across different time slices with different temporal dependency in terms of proximity, periodicity, and trend. To capture this dynamic variation, the query, key, and value matrices, $ {\mathbf{Q}}_{n}^{T}{\rm{,}}\;{\mathbf{K}}_{n}^{T}{\rm{,}}\;{\mathrm{a}\mathrm{n}\mathrm{d}\;\mathbf{V}}_{n}^{T} $, are first constructed to calculate the attention output of node n in all time slices. On one hand, this allows for the dynamic extraction of traffic patterns from each node over time. On the other hand, the self-attention mechanism can obtain greater sensory and contextual information in the input time range, thus capturing the relationship between different time steps in the sequence. This enables us to extract the short-term temporal dependency and long-term dependency that change dynamically over time, thus obtaining more comprehensive and accurate time series embedding. The temporal self-attention output $ \mathbf{T}\mathbf{e}\mathbf{m}\mathbf{S}\mathbf{A}\in {\mathbb{R}}^{B\times T\times N\times D} $ is calculated as follow:
$ {\mathbf{Q}}_{n}^{T}={\mathbf{X}}_{{\mathrm{:}}n{\mathrm{:}}}{\mathbf{W}}_{Q}^{T}{\rm{,}}\;{\mathbf{K}}_{n}^{T}={\mathbf{X}}_{{\mathrm{:}}n{\mathrm{:}}}{\mathbf{W}}_{K}^{T}{\rm{,}}\;{\mathbf{V}}_{n}^{T}={\mathbf{X}}_{{\mathrm{:}}n{\mathrm{:}}}{\mathbf{W}}_{V}^{T}{\mathrm{,}} \tag{7a}$ ()
$ {\mathbf{T}\mathbf{e}\mathbf{m}\mathbf{S}\mathbf{A}}_{\mathrm{t}}^{\mathrm{P}}=\text{softmax}\left(\frac{{\mathbf{Q}}_{n}^{T}{\left({\mathbf{K}}_{n}^{T}\right)}^{{\mathrm{T}}}}{\sqrt{{d}_{\mathrm{m}\mathrm{o}\mathrm{d}\mathrm{e}\mathrm{l}}}}\right){\mathbf{V}}_{n}^{T}{\mathrm{,}} \tag{7b}$ ()
where $ {\mathbf{W}}_{Q}^{T}{\rm{,}}\;{\mathbf{W}}_{K}^{T}{\rm{,}}\;{\mathbf{W}}_{V}^{T}\in {\mathbb{R}}^{d\times {d}_{\mathrm{m}\mathrm{o}\mathrm{d}\mathrm{e}\mathrm{l}}} $ are the learnable parameters and $ {d}_{\mathrm{m}\mathrm{o}\mathrm{d}\mathrm{e}\mathrm{l}} $ determines the dimensionality of Q, K, and V.
4.2.4 Spatiotemporal heterogeneous fusion mechanism
To integrate the outputs of the three different multi-head self-attention mechanisms at the geographic spatial (GeoSA), semantic spatial (SemSA), and temporal (TemSA) levels and to reduce the computational complexity of the model, this paper further introduces a spatiotemporal heterogeneous fusion mechanism. This mechanism sequentially projects the results of these self-attention heads into the same feature space and outputs them, enabling the model to simultaneously integrate geographic spatial, semantic spatial, and temporal information. The multi-head self-attention output of the spatiotemporal encoder is obtained as follows:
$ \text{STEncSelfAttn}= \oplus \left({\bf{H}}_{1{\mathrm{,}} 2{\mathrm{,}} \cdots {\mathrm{,}}{h}_{\mathrm{g}\mathrm{e}\mathrm{o}}}^{\text{geo}}{\mathrm{,}}\;{\bf{H}}_{1{\mathrm{,}} 2{\mathrm{,}} \cdots {\mathrm{,}}{h}_{\mathrm{s}\mathrm{e}\mathrm{m}}}^{\text{sem}}{\mathrm{,}}\;{\bf{H}}_{1{\mathrm{,}} 2{\mathrm{,}} \cdots {\mathrm{,}}{h}_{\mathrm{t}\mathrm{e}\mathrm{m}}}^{\text{tem}}\right){\mathbf{W}}^{{\mathbf{O}}_{\mathrm{e}\mathrm{n}\mathrm{c}}}{\mathrm{,}} $ (8)
where $ {\bf{H}}_{1{\mathrm{,}} 2{\mathrm{,}}\cdots {\mathrm{,}}{h}_{\mathrm{g}\mathrm{e}\mathrm{o}}}^{\text{geo}}{\mathrm{,}}\;{\bf{H}}_{1{\mathrm{,}} 2{\mathrm{,}}\cdots {\mathrm{,}}{h}_{\mathrm{s}\mathrm{e}\mathrm{m}}}^{\text{sem}}{\mathrm{,}}\;{\text{and}\;\bf{H}}_{1{\mathrm{,}} 2{\mathrm{,}}\cdots {\mathrm{,}}{h}_{\mathrm{t}\mathrm{e}\mathrm{m}}}^{\text{tem}} $ denote the three multi-head self-attention outputs for the geographic spatial, semantic spatial, and temporal levels, respectively; $ {h}_{\mathrm{g}\mathrm{e}\mathrm{o}}{\mathrm{,}}\;{h}_{\mathrm{s}\mathrm{e}\mathrm{m}}{\mathrm{,}}\;{\mathrm{a}\mathrm{n}\mathrm{d}\;h}_{\mathrm{t}\mathrm{e}\mathrm{m}} $ denote the numbers of multi-heads in the three self-attention modules, $ \oplus $ denotes the matrix direct sum, i.e., splicing operation, and $ {\mathbf{W}}^{{\mathbf{O}}_{\mathrm{e}\mathrm{n}\mathrm{c}}} $ is the output layer mapping matrix. Finally, the results are fed into the feedforward neural network for regularization, and layer normalization and residual connections are added to improve the generalization of the model.
4.3 Spatiotemporal decoder
The spatiotemporal encoder consists of a geographic spatial self-attention module to run geographic spatial attention, a semantic spatial self-attention module to run semantic spatial attention, a persistent perceptual temporal attention module to run temporal self-attention and temporal attention, and a spatiotemporal heterogeneous fusion module to concat multi-heads data, as shown in Fig. 4. Unlike the encoder, the spatiotemporal decoder computes the predicted value of the current time step by inputting the traffic flow data of all previously predicted time steps before the current time step. Furthermore, the decoder combines this with the output data of the encoder and dynamically updates the predicted values of the predicted time steps. The spatiotemporal decoder is thus run T’ times to obtain the traffic flow forecast for the next T’ time steps.

Figure 4.Spatiotemporal decoder.
Calculating the attention weights for the output of the spatiotemporal encoder and the output of the temporal self-attention module through multi-head attention helps the decoder integrate the most relevant information of the input time steps provided by the encoder and the predicted time steps. This allows the decoder to consider the temporal association of all time steps before the current time step, including the predicted time steps. The temporal self-attention module is the same as that proposed in the encoder. Then, the output $ {\mathrm{T}\mathrm{e}\mathrm{m}\mathrm{S}\mathrm{A}}_{\mathrm{d}\mathrm{e}\mathrm{c}} $ of this temporal self-attention module and the output $ {\mathbf{O}}_{\mathrm{e}\mathrm{n}\mathrm{c}} $ of the spatiotemporal encoder are used to construct the query, key, and value matrices $ {\mathbf{Q}}_{\mathrm{T}\mathrm{G}}{\mathrm{,}}\;{\mathbf{K}}_{\mathrm{T}\mathrm{G}}{\mathrm{,}}\;{\mathrm{a}\mathrm{n}\mathrm{d}\;\mathbf{V}}_{\mathrm{T}\mathrm{G}} $, for computing the temporal attention weights $ \mathbf{T}\mathbf{e}\mathbf{m}\mathbf{P}\mathbf{e}\mathbf{r}\mathbf{S}\mathbf{A}\in {\mathbb{R}}^{B\times T\times N\times D} $:
$ {\mathbf{Q}}_{\mathrm{T}\mathrm{G}}=\mathbf{T}\mathbf{e}\mathbf{m}\mathbf{S}{\mathbf{A}}_{\mathrm{d}\mathrm{e}\mathrm{c}}{\mathbf{W}}_{Q}^{\mathrm{T}\mathrm{G}}{\rm{,}}\;{\mathbf{K}}_{n}^{T}={\mathbf{O}}_{\mathrm{e}\mathrm{n}\mathrm{c}}{\mathbf{W}}_{K}^{\mathrm{T}\mathrm{G}}{\rm{,}}\;{\mathbf{V}}_{n}^{T}={\mathbf{O}}_{\mathrm{e}\mathrm{n}\mathrm{c}}{\mathbf{W}}_{V}^{\mathrm{T}\mathrm{G}}{\mathrm{,}} \tag{9a}$ ()
$ \mathbf{T}\mathbf{e}\mathbf{m}\mathbf{P}\mathbf{e}\mathbf{r}\mathbf{S}\mathbf{A}=\text{softmax}\left(\frac{{\mathbf{Q}}_{\mathrm{T}\mathrm{G}}{\left({\mathbf{K}}_{\mathrm{T}\mathrm{G}}\right)}^{{\mathrm{T}}}}{\sqrt{{d}_{\mathrm{m}\mathrm{o}\mathrm{d}\mathrm{e}\mathrm{l}}}}\right){\mathbf{V}}_{\mathrm{T}\mathrm{G}}{\mathrm{,}} \tag{9b}$ ()
where $ {\mathbf{W}}_{Q}^{\mathrm{T}\mathrm{G}}\in {\mathbb{R}}^{d\times {d}_{\mathrm{m}\mathrm{o}\mathrm{d}\mathrm{e}\mathrm{l}}} $ and $ {\mathbf{W}}_{K}^{\mathrm{T}\mathrm{G}}{\rm{,}}\;{\mathbf{W}}_{V}^{\mathrm{T}\mathrm{G}}\in {\mathbb{R}}^{{d}_{\mathrm{O}}\times {d}_{\mathrm{m}\mathrm{o}\mathrm{d}\mathrm{e}\mathrm{l}}} $ are the learnable parameters.
Finally, the outputs of multi-head attention are spliced together to obtain the output of the multi-head attention of the spatiotemporal decoder:
$ \text{STDecSelfAttn}= \oplus \left({\bf{H}}_{1{\rm{,}} 2{\rm{,}} \cdots {\rm{,}}{h}_{\mathrm{g}\mathrm{e}\mathrm{o}}}^{\text{geo\_dec}}{\rm{,}}\;{\bf{H}}_{1{\rm{,}} 2{\rm{,}} \cdots {\rm{,}}{h}_{\mathrm{s}\mathrm{e}\mathrm{m}}}^{\text{sem\_dec}}{\rm{,}}\;{\bf{H}}_{1{\rm{,}} 2{\rm{,}} \cdots {\rm{,}}{h}_{\mathrm{t}\mathrm{e}\mathrm{m}}}^{\text{tem\_dec}}\right){\mathbf{W}}^{{\mathbf{O}}_{\mathrm{d}\mathrm{e}\mathrm{c}}}{\mathrm{,}} $ (10)
where, $ {\bf{H}}_{1{\rm{,}} 2{\rm{,}}\cdots {\rm{,}}{h}_{\mathrm{g}\mathrm{e}\mathrm{o}}}^{\text{geo\_dec}}{\rm{,}}\; {\bf{H}}_{1{\rm{,}} 2{\rm{,}} \cdots {\rm{,}}{h}_{\mathrm{s}\mathrm{e}\mathrm{m}}}^{\text{sem\_dec}}{\rm{,}}\; \mathrm{a}\mathrm{n}\mathrm{d}\;{\bf{H}}_{1{\rm{,}} 2{\rm{,}} \cdots {\rm{,}}{h}_{\mathrm{t}\mathrm{e}\mathrm{m}}}^{\text{tem\_dec}} $ denote the outputs of the three multi-head attention modules for the geographic spatial, semantic spatial, and temporal levels, respectively, and $ {\mathbf{W}}^{{\mathbf{O}}_{\mathrm{d}\mathrm{e}\mathrm{c}}} $ is the decoder output layer mapping matrix.
5 Experimental analysis
5.1 Experimental data
To validate the effectiveness of MSPSTT, experiments were conducted on four real-world public transportation datasets, including two graph-based highway traffic flow datasets (PeMSD4 and PeMSD8 [8]), and two grid-based city-wide traffic datasets (NYCTaxi [30] and CHIBike [31]). Table 1 indicates the size and data volume of these datasets.

Table 1. Datasets information.
Table 1. Datasets information.
Dataset | Nodes | Edges | Time steps | Time interval | Time range | PeMSD4 | 307 | 340 | 16992 | 5 min | 01/01/2018−02/28/2018 | PeMSD8 | 170 | 295 | 17856 | 5 min | 07/01/2016−08/31/2016 | NYCTaxi | 75 | 484 | 17520 | 30 min | 01/01/2014−12/31/2014 | CHIBike | 270 | 1966 | 4416 | 30 min | 07/01/2020−09/30/2020 |
|
5.2 Baseline models
To evaluate the performance of the proposed MSPSTT model, the following baseline models were selected for comparison.
1) LSTM [32]: A typical recurrent neural network. This model can learn information from long sequences by using a gating mechanism to control the transmission and forgetting of information. This model overcomes the gradient vanishing and exploding problems of traditional RNN algorithms. At the same time, it can better learn information from long sequences by using a gating mechanism to control the transmission and forgetting of information.
2) Graph WaveNet [33]: This model is based on the WaveNet structure for spatiotemporal feature extraction that combines causal convolution with graph convolution.
3) Spatiotemporal graph convolutional neural network (STGCN) [20]: The model uses scaled Laplacian matrices to model spatial adjacencies, captures spatial features using Chebyshev polynomial approximation graph convolution, and captures temporal features along the temporal dimension using one-dimensional convolution.
4) Attention based spatial-temporal graph convolutional networks (ASTGCN) [21]: STGCN with an attention mechanism, which helps the model to learn dynamic temporal and spatial information.
5) Diffusion convolution recurrent neural network (DCRNN) [19]: Sequence-to-sequence based on the gated recurrent unit (GRU) is used to capture temporal information, and matrix multiplication in GRU is replaced by diffusion convolution to capture spatial information.
6) Propagation delay-sensitive spatiotemporal self-attentive networks (PDFormer) [1]: A spatiotemporal encoder based on the multi-head self-attention mechanism is implemented to extract spatiotemporal dependency dynamically. Moreover, a propagation delay-aware module is proposed to explicitly model the time delays in the spatial information propagation process.
7) Attention-based spatiotemporal graph neural network (ASTGNN) [2]: In the temporal dimension, a self-attention mechanism that can utilize local contextual information was designed to share the global receiver domain while capturing dynamic temporal associations. Dynamic spatial dependency is extracted using dynamic graph convolution and a self-attention mechanism in the spatial dimension.
MSPSTT was compared with all the baselines on the PeMSD4 and PeMSD8 datasets, and only with PDFormer [1] on the two grid datasets (NYCTaxi and CHIBike).
5.3 Experimental setup and evaluation metrics
The proposed MSPSTT model is implemented with the PyTorch framework and 12 G memory NVIDIA GeForce RTX 3060 GPU. For dataset preprocessing, the graph datasets, PeMSD4 and PeMSD8, were divided into training, validation, and test sets using the ratio of 6∶2∶2. The grid datasets, NYCTaxi and CHIBike, were divided according to a ratio of 7∶1∶2. For the two graph-based datasets, PeMSD4 and PeMSD8, 12-step future prediction sequences were created using historical 12-step traffic flows. In contrast, 6-step future prediction sequences were created using historical 6-step traffic flows for the other two grid datasets. The experimental batch size was set to 8, and the maximum number of epochs was set to 50.
In terms of model parameters and training strategy settings, the AdamW optimizer was used for training with a learning rate of 0.001. On the other hand, an early stop mechanism was utilized, i.e., when a model’s performance did not improve after several consecutive training epochs, the training was terminated early. This strategy was applied to avoid overfitting due to excessive model training.
The mean absolute error (MAE), root mean square error (RMSE), and mean absolute percentage error (MAPE) were used as metrics for model evaluation, and MAE was used as the loss function as well. The three metrics are calculated as follows:
$ \mathrm{M}\mathrm{A}\mathrm{E}=\frac{1}{n}\displaystyle\sum _{i=1}^{n}\left|{y}_{i}-\widehat{{y}_{i}}\right|{\mathrm{,}} \tag{11a}$ ()
$ \mathrm{R}\mathrm{M}\mathrm{S}\mathrm{E}=\sqrt{\frac{1}{n}\displaystyle\sum _{i=1}^{n}{\left({y}_{i}-\widehat{{y}_{i}}\right)}^{2}}{\mathrm{,}} \tag{11b}$ ()
$ \mathrm{M}\mathrm{A}\mathrm{P}\mathrm{E}=\frac{100}{n}\displaystyle\sum _{i=1}^{n}\frac{\left|{y}_{i}-\widehat{{y}_{i}}\right|}{{y}_{i}}{\mathrm{,}} \tag{11c}$ ()
where n is the number of samples, $ {y}_{i} $ is the true value of the ith sample, and $ \widehat{{y}_{i}} $ is the predicted value of the ith sample.
5.4 Validity experiments and performance comparison
The comparison results in Table 2 indicate that methods without an attention mechanism tend to underperform. LSTM only considers temporal information and cannot handle complex spatiotemporal correlations, which leads to poor results. DCRNN is unable to circumvent the problem of vanishing and exploding gradients when dealing with long-time series due to the GRU structure. Both STGCN and Graph WaveNet are typical spatiotemporal GCNs. Compared with STGCN, Graph WaveNet introduces an adaptive spatial adjacency matrix and a technique called dilated causal convolution, which retains long-term dependency well, and consequently Graph WaveNet performs better.

Table 2. Experimental results on the PeMSD4 and PeMSD8 graph datasets.
Table 2. Experimental results on the PeMSD4 and PeMSD8 graph datasets.
Model | LSTM | DCRNN | STGCN | ASTGCN | Graph WaveNet | ASTGNN | PDFormer | MSPSTT | MSPSTT_nde | Dataset | Index | *Bold is for the best result, and underline is for the second best. | PeMSD4 | MAE | 26.84 | 23.67 | 22.31 | 22.48 | 19.35 | 19.28 | 18.75 | 18.77 | 18.45 | RMSE | 40.76 | 37.12 | 35.05 | 34.81 | 31.74 | 31.27 | 30.26 | 30.29 | 30.04 | MAPE (%) | 22.39 | 16.12 | 14.43 | 15.89 | 13.32 | 13.16 | 13.00 | 12.50 | 12.34 | PeMSD8 | MAE | 22.19 | 18.25 | 18.09 | 18.92 | 15.07 | 15.96 | 14.75 | 14.16 | 14.13 | RMSE | 33.63 | 28.32 | 27.98 | 28.56 | 23.85 | 24.92 | 24.18 | 23.66 | 23.49 | MAPE (%) | 18.77 | 11.59 | 11.23 | 12.58 | 9.51 | 10.88 | 9.50 | 9.53 | 9.30 |
|
Among the models with an attention mechanism, STGCN employs an adaptive attention mechanism, which dynamically adjusts attention weights according to contextual information. However, some important information is easily ignored because the focus is only on one subspace. ASTGNN, PDFormer, and the proposed MSPSTT model all use a multi-class attention mechanism. ASTGNN introduces a local time-trend-aware self-attention mechanism in the construction of the adjacency matrix, which models the graph adjacency relations in a more detailed and dynamic way but ignores the semantic spatial dependency. In contrast, PDFormer utilizes dedicated multi-head self-attention modules for the temporal, geographic spatial, and semantic spatial levels, respectively. Moreover, it introduces delay-aware feature transformation in the spatial module to further incorporate the time-delayed characteristics in the spatial information propagation process. As a result, PDFormer can capture complex spatiotemporal dependency over long distances more effectively than ASTGNN and performs the best among all baseline models.
Unlike the above methods, the proposed MSPSTT model relies on an encoder-decoder structure, each with its spatiotemporal attention modules. The spatial module considers not only the geographic spatial dependency but also the semantic spatial dependency, which enhances the model’s ability to represent the spatial dependency of distance and proximity. At the same time, the historical representative traffic patterns are integrated into the geographic spatial self-attention module to fully consider the influence of historical traffic patterns. Subsequently, a persistent perceptual temporal attention module is designed in the decoder. This module receives contextual information from the encoder and the temporal information extracted from the temporal self-attention module, and extracts the deep dynamic time association based on the input step and the predicted step’s information, so that the model can make accurate and stable long-term predictions. Therefore, the method proposed in this paper performs well on graph data, and the variant model MSPSTT with no decoder data embedding (MSPSTT_nde), which removes the spatial embedding and temporal periodic embedding in the decoder data embedding layer based on MSPSTT, performs the best as analyzed in the following subsection 5.5.2.
Table 3 shows the comparison results between MSPSTT and the best-performing method, PDFormer, which also applies to the grid dataset. Results show that MSPSTT and the variant model MSPSTT_nde achieve the best performance for both grid datasets. This demonstrates that MSPSTT and its variant effectively solve the problem of capturing complex spatiotemporal dependency and have good generalization ability on different dataset types. We can also conclude that choosing an appropriate spatiotemporal embedding strategy for different datasets is also key to improve prediction performance.

Table 3. Experimental results on the grid datasets.
Table 3. Experimental results on the grid datasets.
Model | PDFormer | MSPSTT | MSPSTT_nde | Dataset | Index | *Bold is for the best result, and underline is for the second best. | NYCTaxi | MAE | 13.47 | 12.99 | 13.26 | MAPE (%) | 26.03 | 23.57 | 23.90 | RMSE | 25.33 | 24.55 | 25.44 | CHIBike | MAE | 2.45 | 2.42 | 2.33 | MAPE (%) | 54.72 | 53.82 | 51.93 | RMSE | 4.26 | 4.21 | 4.05 |
|
5.5 Ablation experiments and analysis
5.5.1 Effect of the decoder positional masking matrix
Traditional transformer-based models usually apply a masking matrix to calculate the self-attention weights. When calculating the attention matrix using a query and key matrix, a positional masking matrix is added to the attention matrix in the temporal self-attention module. Because the data about the predicted future time steps is filtered out by the masking matrix, this information does not adversely affect the model’s generalization performance. However, in this paper, the information on the predicted future steps is not entirely detrimental to the model’s improvement. Fig. 5 compares the prediction results of MSPSTT and MSPSTT_nde with the respective models, MSPSTT_pm and MSPSTT_nde_pm, that incorporate the decoder positional masking matrix. From the comparison results on the four publicly available datasets, it can be concluded that MSPSTT and MSPSTT_nde perform better for most of the metrics without the decoder positional masking matrix. To further explore the impact of the positional masking matrix in the persistent perceptual temporal attention module, Fig. 6 shows the training process after adding the masking matrix to both datasets. It can be seen that the error curve changes more smoothly after adding the masking matrix, but the errors are generally larger. This demonstrates that removing the masking matrix enables the model to access more valuable reference information and to extract dynamic temporal dependency better. In addition, it has been shown that predicted future step information can enhance prediction performance on most public datasets. Therefore, the masking matrix of the temporal self-attention module is removed from the decoder of the proposed MSPSTT model.

Figure 5.Results of the ablation experiments: (a) MAE on PeMSD4, (b) MAPE on PeMSD4, (c) RMSE on PeMSD4, (d) MAE on PeMSD8, (e) MAPE on PeMSD8, (f) RMSE on PeMSD8, (g) MAE on NYCTaxi, (h) MAPE on NYCTaxi, (i) RMSE on NYCTaxi, (j) MAE on CHIBike, (k) MAPE on CHIBike, and (l) RMSE on CHIBike.
5.5.2 Effect of spatiotemporal data embedding in the decoder side
As mentioned above, the proposed MSPSTT model incorporates positional, spatial, and temporal periodic embedding in the encoder and decoder, respectively. However, introducing overly complex data embedding may increase the complexity and computational effort of the model, resulting in the degradation of the overall performance. Therefore, to investigate the spatiotemporal embedding’s effect before the decoder, two variant models were respectively designed that remove the spatiotemporal embedding, and both the spatiotemporal embedding and positional masking matrix. More precisely, removing the spatiotemporal embedding means to remove the temporal periodic embedding of days and weeks and the spatial graph embedding of the decoder data embedding layer. From the results in Fig. 5, it can be seen that MSPSTT and MSPSTT_nde show different results on different datasets. Among them, MSPSTT_nde with spatiotemporal embedding removed performs better on the PeMSD4, PeMSD8, and CHIBike datasets, while MSPSTT with spatiotemporal embedding has better performance on NYCTaxi. Combined with the fact that NYCTaxi is a grid dataset with fewer nodes, the analysis shows that this dataset has a weak spatiotemporal correlation, and adding spatiotemporal embedding can enhance the model’s ability to explore the correlation and improve the prediction accuracy. This shows that the decoder’s spatiotemporal embedding has different effects depending on the dataset. Furthermore, the results in Fig. 6 show that the model with spatiotemporal embedding is more error-prone in the early stage, while the method with spatiotemporal embedding removed shows more error fluctuations during the training, and both of them fit the data quickly as the training is conducted. For the NYCTaxi and PeMSD8 datasets, the removal of the spatiotemporal embedding leads to poor performance as the model cannot further extract the potential spatiotemporal correlation between the nodes. For the other two datasets, the spatiotemporal data embedding provided by the encoder can already sufficiently extract and characterize the complex spatiotemporal correlation between the nodes. At this time, adding repeated spatiotemporal embedding in the decoder may cause the model to overfit and thus increase the error.

Figure 6.Loss curves of the ablation experiments or (a) PeMSD4, (b) PeMSD8, (c) NYCTaxi, and (d) CHIBike.
In addition, Fig. 6 shows the training process of PDFormer, in which the loss curve mainly follows a stepwise decrease and has the slowest convergence rate. In some datasets, such as NYCTaxi, there is a more obvious loss oscillation phenomenon. This is mainly attributed to the encoder-output structure of PDFormer, where the absence of a decoder prevents the model from further exploiting the complex temporal dependency and contextual information contained in the target sequence. This hinders the model’s ability to capture dynamic and long-term dependency. In contrast, the different variants of the proposed MSPSTT model converge quickly on each dataset and fit the true values better, demonstrating the validity of the model’s structure.
5.6 Time complexity analysis
Table 4 shows the computation efficiency of MSPSTT, MSPSTT_nde, and PDFormer whose performance is the second only to our models. It is indicated that both MSPSTT and MSPSTT_nde have more training time than PDFormer on the four datasets. Due to the introduction of the decoder, our models are able to make full use of the influence of the predicted time steps on the subsequent time steps. It is precise because the decoder can only get the result of one step at one time; our models need to make multiple predictions to get the result of all time steps, which leads to a large increase in the training time. With the reduction of data size, the training time is also greatly reduced. Considering the performance improvement of the proposed MSPSTT method on the four datasets, they are promising in applications, and the future work is how to optimize the model so that the training speed can be greatly improved.

Table 4. Training time of three methods (s/epoch).
Table 4. Training time of three methods (s/epoch).
Dataset | PDFormer | MSPSTT | MSPSTT_nde | PeMSD4 | 313.83 | 2754.37 | 2535.21 | PeMSD8 | 256.36 | 2272.16 | 2144.90 | NYCTaxi | 81.70 | 379.37 | 1770.47 | CHIBike | 35.87 | 171.80 | 165.75 |
|
6 Conclusion
This paper proposes an effective model named MSPSTT, which is based on the transformer architecture and leverages an attention mechanism to capture complex spatiotemporal dependency for traffic flow prediction. The model comprehensively embeds the input data and applies multiple self-attention modules to capture the temporal, periodic, geographic spatial, and semantic spatial dependency, where the two latter are based on distance. Moreover, a decoder module is integrated into the prediction results to capture further temporal dependency based on the predicted time steps, thus improving long-term prediction accuracy.
Experiments on four publicly available traffic datasets have shown that the proposed model can handle both graph and grid datasets and outperforms state-of-the-art models on all datasets. This demonstrates the superiority of the model in capturing dynamic multi-scale spatiotemporal correlations. The contribution of each module to the model’s prediction performance and the effectiveness of the model’s structure was also evaluated by conducting ablation experiments and analyzing the experimental results. In the future work, MSPSTT will be applied to other datasets, which include more information like weather and holiday. With the information added to the data embedding layer, we hope to further improve the performance of the model. In addition to spatiotemporal dependency, more implicit dependency needs to be explored for model accuracy.
Disclosures
The authors declare no conflicts of interest.