Multi-scale feature enhanced Transformer network for efficient semantic segmentation

Fig. 1. An efficient Transformer-based semantic segmentation network enhanced by multi-scale features. (a) Multi-scale pooling self-attention module; (b) Cross-spatial feed-forward network module

Download full size

View in Article

Fig. 2. Multi-scale pooling operation

Download full size

View in Article

Fig. 3. Cross-spatial attention

Download full size

View in Article

Fig. 4. Visualization of segmentation results of different algorithms on the ADE20K dataset

Download full size

View in Article

Table 1. Detailed settings of the MFE-Former, output size refers to the resolution of the output at each stage, and operation represents the operations used in the i stage. Additionally, the parameter settings include the channel reduction ratio r and the pooling factor $p_{l}$ used by each MFE-Transformer block

View table

View in Article

Table 1. Detailed settings of the MFE-Former, output size refers to the resolution of the output at each stage, and operation represents the operations used in the i stage. Additionally, the parameter settings include the channel reduction ratio r and the pooling factor $p_{l}$ used by each MFE-Transformer block

阶段	输出大小	操作	参数设置
1	128×128×48	Patch Embed , MFE-Transformer block×2	r=1, {p₁=8, p₂=16, p₃=24, p₄=32, p₅=40}
2	64×64×96	Patch Embed, MFE-Transformer block×2	r=2, {p₁=4, p₂=8, p₃=12, p₄=16, p₅=20}
3	32×32×260	Patch Embed, MFE-Transformer block×6	r=4, {p₁=2, p₂=4, p₃=6, p₄=8, p₅=10}
4	16×16×384	Patch Embed, MFE-Transformer block×3	r=8, {p₁=1, p₂=2, p₃=3, p₄=4, p₅=5}

Table 2. Model evaluation results of different segmentation models on ADE20K dataset

View table

View in Article

Table 2. Model evaluation results of different segmentation models on ADE20K dataset

Method	#Param/M	FLOPs/G	mIoU/%
FCN^[19]	9.8	39.6	19.7
ResNet18^[11]	15.5	32.2	32.9
PSPNet^[9]	13.7	52.2	29.6
DeepLabV3+^[6]	15.4	25.9	38.1
ViT^[13]	10.2	24.6	37.4
PVT-Tiny^[17]	17.0	33.2	35.7
PoolFormer-S12^[42]	15.7	—	37.2
Conv-PVT-Tiny^[43]	16.4	—	37.2
EdgeViT-XS^[44]	10.3	27.7	41.4
Swin-tiny^[14]	31.9	46.0	41.5
PVT-Large^[17]	65.1	79.6	42.1
Segformer-B1^[29]	13.7	15.9	42.2
PoolFormer-M48^[42]	77.1	—	42.7
Twins-SVT-S^[45]	28.3	37.0	43.2
Xcit-T12/16^[46]	33.7	—	43.5
TBFormer-T^[47]	20.5	24.3	42.8
SCTNet-B^[48]	17.4	—	43.0
MFE-Former (Ours)	15.9	31.1	44.1

Table 3. The model evaluation results on the Cityscapes dataset (the FLOPs test was performed at a resolution of 1024×2048)

View table

View in Article

Table 3. The model evaluation results on the Cityscapes dataset (the FLOPs test was performed at a resolution of 1024×2048)

Method	#Param/M	FLOPs/G	mIoU/%
FCN^[19]	9.8	317	61.5
PSPNet^[9]	13.7	423	70.2
DeepLabV3+^[6]	15.4	555	75.2
SwiftNetRN^[49]	11.8	104	75.5
EncNet^[50]	55.1	1748	76.9
PVT-Tiny^[17]	17.0	—	71.7
MLT^[51]	20.1	—	77.4
RTFormer-Bas^[52]	16.8	—	79.3
SFNet(ResNet-18)^[53]	12.87	247.0	78.9
DDRNet-39^[54]	32.3	281.2	80.4
PIDNet-L^[55]	36.9	275.8	80.6
MFE-Former (Ours)	15.9	238.6	80.6

Table 4. Compare with the state-of-the-art models on the COCO stuff dataset (testing was conducted using an input resolution of 512×512)
View table
View in Article
Table 4. Compare with the state-of-the-art models on the COCO stuff dataset (testing was conducted using an input resolution of 512×512)
Method #Param/M FLOPs/G mIoU/%
PSPNet^[9] 13.7 52.9 30.1
DeepLabV3+^[6] 15.4 25.9 29.9
LR-ASPP^[56] — 2.37 25.2
MaskFormer^[57] 41 53 37.1
TBFormer-T^[47] 20.5 37.5 37.9
SCTNet-B^[48] 17.4 — 35.9
MFE-Former (Ours) 15.9 31.1 38.0

Table 5. The segmentation performance metrics of 3 models on the ADE20K, Cityscapes, and COCO-stuff datasets

View table

View in Article

Table 5. The segmentation performance metrics of 3 models on the ADE20K, Cityscapes, and COCO-stuff datasets

Method	ADE20K				Cityscapes				COCO-stuff
Method	mIoU/%	aAcc/%	mAcc/%	mDice/%	mIoU/%	aAcc/%	mAcc/%	mDice/%	mIoU/%	aAcc/%	mAcc/%	mDice/%
Segformer-B1	42.1	79.6	52.7	56.3	78.5	95.8	83.4	85.4	—	—	—	—
SCTNet-B	43.0	80.2	56.1	57.2	80.1	96.4	86.6	88.3	35.9	67.1	48.5	47.9
MFE-Former(Ours)	44.1	81.0	56.1	57.4	80.6	96.5	87.7	87.5	38.0	69.1	49.1	48.1

Table 6. Experiment results on ablation using different pooling methods
View table
View in Article
Table 6. Experiment results on ablation using different pooling methods
池化方法 FLOPs/G mIoU/%
最大池化 30.7 40.8
多尺度最大池化 31.1 43.1
平均池化 30.7 42.5
多尺度平均池化 31.1 44.1

Table 7. Experiment results on ablation with multi-scale pooling ratio parameter settings and experimental results of parameter setting for ADE20K dataset

View table

View in Article

Table 7. Experiment results on ablation with multi-scale pooling ratio parameter settings and experimental results of parameter setting for ADE20K dataset

池化因子 $p_{l}$					R	FLOPs/G	mIoU/%
4	—	—	—	—	16	31.5	42.3
8	—	—	—	—	64	30.9	42.5
16	—	—	—	—	256	30.6	41.9
24	—	—	—	—	576	30.5	41.3
32	—	—	—	—	1024	30.5	40.9
40	—	—	—	—	1600	30.5	40.4
8	16	24	—	—	47	31.1	43.7
8	16	24	32	40	43	31.1	44.1

Table 8. An ablation experiment was conducted on different components of cross spatial attention on the ADE20K dataset
View table
View in Article
Table 8. An ablation experiment was conducted on different components of cross spatial attention on the ADE20K dataset
Setting #Param/M FLOPs/G mIoU/%
不添加跨空间注意力 15.8 30.3 43.1
仅添加通道交互分支 15.8 30.3 43.3
添加跨空间注意力 15.9 31.1 44.1

Table 9. Ablation study on the hyperparameter N in the CSA module on the ADE20K dataset
View table
View in Article
Table 9. Ablation study on the hyperparameter N in the CSA module on the ADE20K dataset
N #Param/M FLOPs/G mIoU/%
4 16.3 31.8 43.9
8 15.9 31.1 44.1
16 15.8 30.8 43.3

Tools

Get Citation

Copy Citation Text

Yan Zhang, Chunming Ma, Shudong Liu, Yemei Sun. Multi-scale feature enhanced Transformer network for efficient semantic segmentation[J]. Opto-Electronic Engineering, 2024, 51(12): 240237-1

Download Citation

EndNote(RIS)BibTex Plain Text

Set citation alerts for article

Save article for my favorites

Paper Information

Category: Article

Received: Oct. 10, 2024

Accepted: Nov. 19, 2024

Published Online: Feb. 21, 2025

The Author Email:

DOI:10.12086/oee.2024.240237

Topics

laser devices and laser physics

Lasers and Laser Optics

Laser physics

laser manufacturing

Instrumentation, Measurement and Metrology

Table 2. Model evaluation results of different segmentation models on ADE20K dataset

Table 2. Model evaluation results of different segmentation models on ADE20K dataset

Table 3. The model evaluation results on the Cityscapes dataset (the FLOPs test was performed at a resolution of 1024×2048)

Table 3. The model evaluation results on the Cityscapes dataset (the FLOPs test was performed at a resolution of 1024×2048)

Table 4. Compare with the state-of-the-art models on the COCO stuff dataset (testing was conducted using an input resolution of 512×512)

Table 4. Compare with the state-of-the-art models on the COCO stuff dataset (testing was conducted using an input resolution of 512×512)

Table 5. The segmentation performance metrics of 3 models on the ADE20K, Cityscapes, and COCO-stuff datasets

Table 5. The segmentation performance metrics of 3 models on the ADE20K, Cityscapes, and COCO-stuff datasets

Table 6. Experiment results on ablation using different pooling methods

Table 6. Experiment results on ablation using different pooling methods

Table 7. Experiment results on ablation with multi-scale pooling ratio parameter settings and experimental results of parameter setting for ADE20K dataset

Table 7. Experiment results on ablation with multi-scale pooling ratio parameter settings and experimental results of parameter setting for ADE20K dataset

Table 8. An ablation experiment was conducted on different components of cross spatial attention on the ADE20K dataset

Table 8. An ablation experiment was conducted on different components of cross spatial attention on the ADE20K dataset

Table 9. Ablation study on the hyperparameter N in the CSA module on the ADE20K dataset

Table 9. Ablation study on the hyperparameter N in the CSA module on the ADE20K dataset