Hourglass attention and progressive hybrid Transformer for image classification

Table 1. Ablation experiment on CIFAR100 dataset
View table
View in Article
Table 1. Ablation experiment on CIFAR100 dataset
Model Top-1 accuracy/%
ViT 73.81
ViT+τ 74.35
ViT+M 74.34
ViT+τ+M 74.87
ViT+τ+M+Stem 78.24
ViT+τ+M+Stem+DTSA 83.45
ViT+τ+M+Stem+DTSA+PLM 83.76

Table 2. Top-1 accuracy comparison of downsampling module on CIFAR100 dataset
View table
View in Article
Table 2. Top-1 accuracy comparison of downsampling module on CIFAR100 dataset
Downsampling module Top-1 accuracy/%
Patch Emb 81.71
Patch Stem 78.45
ConvStem 83.76

Table 3. Top-1 accuracy comparison of P-LocalMLP module %
View table
View in Article
Table 3. Top-1 accuracy comparison of P-LocalMLP module %
Model T-ImageNet CIFAR10
BN+FN 69.39 94.64
DW+LN 71.51 96.97
DW+LN+skip connect 72.09 97.32

Table 4. Top-1 accuracy comparison of attention module on CIFAR100 dataset
View table
View in Article
Table 4. Top-1 accuracy comparison of attention module on CIFAR100 dataset
Attention module M+τ Top-1 accuracy/%
DTSA × 83.18
√ 83.76
EMSAv2 × 80.59
√ 82.29
Linear SRA × 79.58
√ 82.52

Table 5. Top-1 accuracy comparison of overall architecture

View table

View in Article

Table 5. Top-1 accuracy comparison of overall architecture

Architecture	Top-1 accuracy/%
Attention，Attention，Attention，Attention	73.69
Pool，Pool，Pool，Pool	62.09
SpatialMLP，SpatialMLP，SpatialMLP，SpatialMLP	61.73
Pool，Pool，SpatialMLP，SpatialMLP	77.83
Pool，Pool，Attention，Attention	83.76

Table 6. Top-1 accuracy comparison of different models on small-size datasets

View table

View in Article

Table 6. Top-1 accuracy comparison of different models on small-size datasets

Model	Params/M	FLOPs/G	T-ImageNet/%	CIFAR10/%	CIFAR100/%	SVHN/%
ConvNeXt-T	29	4.46	67.51	94.77	76.87	96.17
SwinV2-T	28	4.35	50.41	78.22	68.97	82.19
PVTv2-B2	26	4.04	63.93	93.58	73.29	97.01
MViTv2-T	24	3.99	53.94	74.44	73.16	85.99
ResTv2-T	30	4.10	68.59	95.57	80.57	97.17
HAPHFormer-T	25	3.41	72.09	97.32	83.76	97.42
ConvNeXt-S	50	8.69	67.88	95.82	81.33	96.21
SwinV2-S	50	8.45	58.97	80.06	70.32	85.03
PVTv2-B3	45	6.92	66.85	94.27	73.44	97.41
MViTv2-S	35	6.08	56.92	76.76	74.13	86.34
ResTv2-S	40	5.97	69.70	95.91	80.89	97.23
HAPHFormer-S	35	5.11	72.29	97.37	83.88	97.77
ConvNeXt-B	88	15.36	68.69	96.04	83.30	96.33
SwinV2-B	87	14.99	60.14	80.38	70.51	85.51
PVTv2-B4	62	10.14	67.96	94.94	77.25	97.65
MViTv2-B	52	8.88	62.43	77.18	74.51	86.68
ResTv2-B	55	7.88	70.10	96.21	81.04	97.26
HAPHFormer-B	49	6.84	72.55	97.38	84.12	97.89
PVTv2-B5	81	11.75	68.37	95.14	78.42	97.88
ResTv2-L	86	13.83	72.34	96.37	81.74	97.34
HAPHFormer-L	72	11.27	72.80	97.46	84.71	98.16

Tools

Get Citation

Copy Citation Text

Yanfei PENG, Yun CUI, Kun CHEN, Yongxin LI. Hourglass attention and progressive hybrid Transformer for image classification[J]. Chinese Journal of Liquid Crystals and Displays, 2024, 39(9): 1223

Download Citation

EndNote(RIS)BibTex Plain Text

Set citation alerts for article

Save article for my favorites