2SWUNet: small window SWinUNet based on tansformer for building extraction from high-resolution remote sensing images

Jiamin YU; Sixian CHAN; Yanjing LEI; Wei WU; Yuan WANG; Xiaolong ZHOU

doi:10.1007/s11801-024-3179-1

Optoelectronics Letters, Volume. 20, Issue 10, 599(2024)

2SWUNet: small window SWinUNet based on tansformer for building extraction from high-resolution remote sensing images

Jiamin YU¹, Sixian CHAN^1,2,3、*, Yanjing LEI¹, Wei WU^1,3, Yuan WANG¹, and Xiaolong ZHOU⁴

Author Affiliations

¹College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou 310023, China

²Key Laboratory of Meteorological Disaster (KLME), Ministry of Education & Collaborative Innovation Center on Forecast and Evaluation of Meteorological Disasters (CIC-FEMD), Nanjing University of Information Science &Technology, Nanjing 210044, China

³College of Geographic Information Modern Industry, Zhejiang University of Technology, Hangzhou 310023, China

⁴College of Electrical and Information Engineering, Quzhou University, Quzhou 324000, China2

show less

Models dedicated to building long-range dependencies often exhibit degraded performance when transferred to remote sensing images. Vision transformer (ViT) is a new paradigm in computer vision that uses multi-head self-attention(MSA) rather than convolution as the main computational module, with global modeling capabilities. However, its performance on small datasets is usually far inferior to that of convolutional neural networks (CNNs). In this work, we propose a small window SWinUNet (2SWUNet) for building extraction from high-resolution remote sensing images.Firstly, the 2SWUNet is trained based on swin transformer by designing a fully symmetric encoder-decoder U-shaped architecture. Secondly, to construct a reasonable U-shaped architecture for building extraction from high-resolution remote sensing images, different forms of patch expansion are explored to simulate up-sampling operations and recover feature map resolution. Then, the small window-based multi-head self-attention (W-MSA) is designed to reduce the computational and memory burden, which is more appropriate for the features of remote sensing images. Meanwhile, the pre-training mechanism is advanced to make up for the lack of decoder parameters. Finally, comparison experiments with other mainstream CNNs and ViTs validate the superiority of the proposed model.

Tools

Get Citation

Copy Citation Text

YU Jiamin, CHAN Sixian, LEI Yanjing, WU Wei, WANG Yuan, ZHOU Xiaolong. 2SWUNet: small window SWinUNet based on tansformer for building extraction from high-resolution remote sensing images[J]. Optoelectronics Letters, 2024, 20(10): 599

Download Citation

EndNote(RIS)BibTex Plain Text

Set citation alerts for article

Save article for my favorites

Paper Information

Received: Aug. 29, 2023

Accepted: Apr. 3, 2024

Published Online: Sep. 20, 2024

The Author Email: Sixian CHAN (sxchan@zjut.edu.cn)

DOI:10.1007/s11801-024-3179-1

Topics

laser devices and laser physics

Lasers and Laser Optics

Laser physics

laser manufacturing

Instrumentation, Measurement and Metrology