Multi-scale feature enhanced Transformer network for efficient semantic segmentation

Yan Zhang; Chunming Ma; Shudong Liu; Yemei Sun

doi:10.12086/oee.2024.240237

[1] Goodfellow I, Pouget-Abadie J, Mirza M et al. Generative adversarial networks[J]. Commun ACM, 63, 139-144(2020).

[2] Kingma D P. Auto-encoding variational Bayes[Z](2013). https://doi.org/abs/1312.6114

[3] Jiang W T, Dong R, Zhang S C. Global pooling residual classification network guided by local attention[J]. Opto-Electron Eng, 51, 240126(2024).

[4] He F T, Wu Q Q, Yang Y et al. Research on laser speckle image recognition technology based on transfer learning[J]. Laser Technol, 48, 443-448(2024).

[5] Zhang C, Huang Y P, Guo Z Y et al. Real-time lane detection method based on semantic segmentation[J]. Opto-Electron Eng, 49, 210378(2022).

[6] Chen L C, Papandreou G, Kokkinos I et al. Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs[J]. IEEE Trans Pattern Anal Mach Intell, 40, 834-848(2018).

[7] Lin T Y, Maire M, Belongie S et al. Microsoft coco: common objects in context[C], 740-755(2014). https://doi.org/10.1007/978-3-319-10602-1_48

[8] He K M, Gkioxari G, Dollár P et al. Mask R-CNN[C], 2980-2988(2017). https://doi.org/10.1109/ICCV.2017.322

[9] Zhao H S, Shi J P, Qi X J et al. Pyramid scene parsing network[C], 6230-6239(2017). https://doi.org/10.1109/CVPR.2017.660

[10] Chen L C, Papandreou G, Kokkinos I et al. Semantic image segmentation with deep convolutional nets and fully connected CRFs[Z](2014). https://doi.org/abs/1412.7062

[11] He K M, Zhang X Y, Ren S Q et al. Deep residual learning for image recognition[C], 770-778(2016). https://doi.org/10.1109/CVPR.2016.90

[12] Vaswani A, Shazeer N, Parmar N et al. Attention is all you need[C], 6000-6010(2017).

[13] Dosovitskiy A, Beyer L, Kolesnikov A et al. An image is worth 16x16 words: transformers for image recognition at scale[C](2021).

[14] Liu Z, Lin Y T, Cao Y et al. Swin transformer: hierarchical vision transformer using shifted windows[C](2021).

[15] Strudel R, Garcia R, Laptev I et al. Segmenter: transformer for semantic segmentation[C], 7242-7252(2021). https://doi.org/10.1109/ICCV48922.2021.00717

[16] Zheng S X, Lu J C, Zhao H S et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers[C], 6877-6886(2021). https://doi.org/10.1109/CVPR46437.2021.00681

[17] Wang W H, Xie E Z, Li X et al. Pyramid vision transformer: a versatile backbone for dense prediction without convolutions[C], 548-558(2021). https://doi.org/10.1109/ICCV48922.2021.00061

[18] Lin T Y, Dollár P, Girshick R et al. Feature pyramid networks for object detection[C], 936-944(2017). https://doi.org/10.1109/CVPR.2017.106

[19] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation[C], 3431-3440(2015).

[20] Hu J, Shen L, Sun G. Squeeze-and-excitation networks[C], 7132-7141(2018). https://doi.org/10.1109/CVPR.2018.00745

[21] Wang Q L, Wu B G, Zhu P F et al. ECA-Net: efficient channel attention for deep convolutional neural networks[C], 11531-11539(2020). https://doi.org/10.1109/CVPR42600.2020.01155

[22] Wang X L, Girshick R, Gupta A et al. Non-local neural networks[C], 7794-7803(2018). https://doi.org/10.1109/CVPR.2018.00813

[23] Fu J, Liu J, Tian H J et al. Dual attention network for scene segmentation[C], 3141-3149(2019). https://doi.org/10.1109/CVPR.2019.00326

[24] Zhao H S, Qi X J, Shen X Y et al. ICNet for real-time semantic segmentation on high-resolution images[C], 418-434(2018). https://doi.org/10.1007/978-3-030-01219-9_25

[25] Mehta S, Rastegari M, Caspi A et al. ESPNet: efficient spatial pyramid of dilated convolutions for semantic segmentation[C], 561-580(2018). https://doi.org/10.1007/978-3-030-01249-6_34

[26] Ranftl R, Bochkovskiy A, Koltun V. Vision transformers for dense prediction[C], 12159-12168(2021). https://doi.org/10.1109/ICCV48922.2021.01196

[27] Liu X Y, Peng H W, Zheng N X et al. EfficientViT: memory efficient vision transformer with cascaded group attention[C], 14420-14430(2023). https://doi.org/10.1109/CVPR52729.2023.01386

[28] Cheng B W, Misra I, Schwing A G et al. Masked-attention mask transformer for universal image segmentation[C], 1280-1289(2022). https://doi.org/10.1109/CVPR52688.2022.00135

[29] Xie E Z, Wang W H, Yu Z D et al. SegFormer: simple and efficient design for semantic segmentation with transformers[C], 924(2021).

[30] Wan Q, Huang Z L, Lu J C et al. SeaFormer: squeeze-enhanced axial transformer for mobile semantic segmentation[C](2023).

[31] Zhang Q L, Yang Y B. ResT: an efficient transformer for visual recognition[C], 1185(2021).

[32] Wu H P, Xiao B, Codella N et al. CvT: introducing convolutions to vision transformers[C], 22-31(2021). https://doi.org/10.1109/ICCV48922.2021.00009

[33] Mehta S, Rastegari M. MobileViT: light-weight, general-purpose, and mobile-friendly vision transformer[Z](2021). https://doi.org/abs/2110.02178

[34] Chen Y P, Dai X Y, Chen D D et al. Mobile-Former: bridging MobileNet and transformer[C], 5260-5269(2022). https://doi.org/10.1109/CVPR52688.2022.00520

[35] Wang W H, Xie E Z, Li X et al. Pvt v2: improved baselines with pyramid vision transformer[J]. Comp Visual Media, 8, 415-424(2022).

[36] Yuan L, Chen Y P, Wang T et al. Tokens-to-token ViT: training vision transformers from scratch on ImageNet[C], 558-567(2021).

[37] Zhou B L, Zhao H, Puig X et al. Scene parsing through ADE20K dataset[C], 5122-5130(2017). https://doi.org/10.1109/CVPR.2017.544

[38] Cordts M, Omran M, Ramos S et al. The cityscapes dataset for semantic urban scene understanding[C], 3213-3223(2016). https://doi.org/10.1109/CVPR.2016.350

[39] Caesar H, Uijlings J, Ferrari V. COCO-stuff: thing and stuff classes in context[C], 1209-1218(2018). https://doi.org/10.1109/CVPR.2018.00132

[40] Contributors M M S. MMSegmentation: openmmlab semantic segmentation toolbox and benchmark[EB/OL]. https://github.com/open-mmlab/mmsegmentation

[41] Loshchilov I, Hutter F. Decoupled weight decay regularization[C](2019).

[42] Yu W H, Luo M, Zhou P et al. MetaFormer is actually what you need for vision[C], 10809-10819(2022). https://doi.org/10.1109/CVPR52688.2022.01055

[43] Zhang X, Zhang Y. Conv-PVT: a fusion architecture of convolution and pyramid vision transformer[J]. Int J Mach Learn Cyber, 14, 2127-2136(2023).

[44] Pan J T, Bulat A, Tan F W et al. EdgeViTs: competing light-weight CNNs on mobile devices with vision transformers[C], 294-311(2022). https://doi.org/10.1007/978-3-031-20083-0_18

[45] Chu X X, Tian Z, Wang Y Q et al. Twins: revisiting the design of spatial attention in vision transformers[C], 716(2021).

[46] El-Nouby A, Touvron H, Caron M et al. XCiT: cross-covariance image transformers[C], 1531(2021).

[47] Wei C, Wei Y. TBFormer: three-branch efficient transformer for semantic segmentation[J]. Signal, Image Video Process, 18, 3661-3672(2024).

[48] Xu Z Z, Wu D Y, Yu C Q et al. SCTNet: single-branch CNN with transformer semantic information for real-time segmentation[C], 6378-6386(2024). https://doi.org/10.1609/aaai.v38i6.28457

[49] Oršic M, Krešo I, Bevandic P et al. In defense of pre-trained ImageNet architectures for real-time semantic segmentation of road-driving images[C], 12599-12608(2019). https://doi.org/10.1109/CVPR.2019.01289

[50] Zhang H, Dana K, Shi J P et al. Context encoding for semantic segmentation[C], 7151-7160(2018). https://doi.org/10.1109/CVPR.2018.00747

[51] Zhou Q, Sun Z H, Wang L J et al. Mixture lightweight transformer for scene understanding[J]. Computers and Electrical Engineering, 108, 108698(2023).

[52] Wang J, Gou C H, Wu Q M et al. RTFormer: efficient design for real-time semantic segmentation with transformer[C], 539(2022).

[53] Li X T, You A S, Zhu Z et al. Semantic flow for fast and accurate scene parsing[C]. European Conference on Computer Vision, 775-793(2020). https://doi.org/10.1007/978-3-030-58452-8_45

[54] Pan H H, Hong Y D, Sun W C et al. Deep dual-resolution networks for real-time and accurate semantic segmentation of traffic scenes[J]. IEEE Trans Intell Transp Syst, 24, 3448-3460(2023).

[55] Xu J C, Xiong Z X, Bhattacharyya S P. PIDNet: a real-time semantic segmentation network inspired by PID controllers[C], 19529-19539(2023). https://doi.org/10.1109/CVPR52729.2023.01871

[56] Howard A, Sandler M, Chen B et al. Searching for MobileNetV3[C], 1314-1324(2019). https://doi.org/10.1109/ICCV.2019.00140

[57] Cheng B W, Schwing A G, Kirillov A. Per-pixel classification is not all you need for semantic segmentation[C], 1367(2021).