From U-Net to Transformer: Progress in the Application of Hybrid Models in Medical Image Segmentation

Yixiao Yin; Jingang Ma; Wenkai Zhang; Liang Jiang

doi:10.3788/LOP240875

Imaging type

Imaging mechanism

Imaging objective

Advantage

Disadvantage

Common datasets

X-ray

Penetration of electromagnetic radiation into body tissues

Bones， lungs， heart

Fast， easy to operate， low cost， good contrast of negatives

Radiation， limited ability to image soft tissue， only 2D images available

COVID-19^［33］

Rotational X-ray fabrication of in vivo cross-sectional images

Any part of the body

High-resolution images with no overlap between tissue structure and lesion images

Radiating， prone to artifacts， average resolution for soft tissue

LiTS^［34］

MRI

Strong magnetic fields and radio waves

Soft tissues， brain， joints

Non-invasive， no radiation risk， high resolution to soft tissue， ability to cut layers directly in any direction

Time-consuming， costly， tight inspection space and limited by metal implants

BraTS^［35］

High-frequency sound waves

Soft tissues， blood flow， organs

Non-invasive， non-radiation-risked， real-time imaging

Poor imaging of gases and bony structures， high dependence on manual labor

EchoNet-Dynamic^［36］

Digital pathology

High-resolution scans under an optical microscope

Tissues， cells

High-resolution images， digital storage

Requires advance sampling， which is costly and time-consuming

SEED^［37］

Evaluation index

Mathematical description

Dice

\frac{2 |A ⋂ B|}{|A| + |B|}

IoU

\frac{|A ⋂ B|}{|A ⋃ B|}

ACC

\frac{R_{T P} + R_{T N}}{R_{T P} + R_{T N} + R_{F P} + R_{F N}}

\frac{R_{T P}}{R_{T P} + R_{F N}}

Combination strategy

Typology

Main idea

Advantage

Disadvantage

Model complexity

Serial

Single-scale serial

CNN and Transformer serially combined at a single scale

Easy to implement and understand

A loss of information between the CNN and the Transformer

Low

Multi-scale serial

CNN serially combine with Transformers at different scales

Availability of multi-scale information

Increases complexity of managing different scales and detailed information may be lost when lowering resolution

Middle

Alternating serial

Alternate between CNN and Transformer throughout the network

More balanced utilization of CNN and Transformer capabilities to improve information flow at each stage

Need to manage the interaction layer， which increases complexity and can cause information bottlenecks or redundancy if not carefully designed

High

Parallelism

Global parallelism

CNN and Transformer process the image in parallel and finally fuse the features

Taking full advantage of both networks， spatial and sequence information can be efficiently captured

Need to design effective feature fusion mechanisms with increased computation

High

Hierarchical parallelism

CNN and Transformer are processed in parallel at multiple layers of the model， with feature fusion at each layer

Enhanced model generalization using two networks at each level to fully learn features

Need to design an effective feature fusion mechanism， complex structure and difficult to tune

High

Method

Model

Ideological improvement

Advantage

Disadvantage

U-Net

UNet++^［15］

A dense jump connection strategy is used to introduce a deep supervision mechanism to motivate the network to better learn and utilize hierarchical features

Deeper and multi-scale feature extraction and fusion are realized to improve segmentation performance

Large number of model parameters and high computational complexity

Attention U-Net^［46］

AttentionGate attention mechanism is used to eliminate noise and irrelevant information of jump connections and extract key features

It effectively captures important local and global information in the image and improves the accuracy and performance of image segmentation

Model segmentation performance is improved to a lesser extent

Transformer

Swin-Unet^［58］

Use Swin Transformer module instead of U-Net's 2D convolutional block

Multi-scale feature fusion， parameters and computational efficiency are optimized for applicability

Lack of extraction of local feature information

C2Former^［59］

Designing cross-convolutional self-attention mechanism algorithms to improve semantic feature understanding

Integration of multi-scale feature information and filtering of interfering information

Loss of feature information

U-Net+Transformer

Single-scale serial

TransUNet^［62］

Local features are extracted using a CNN and the Transformer module is strung after the CNN to extract global contextual information

Perceiving global information for direct processing of sequential data， adaptation to multi-scale tasks

High computational complexity

MultiIB-TransUNet^［67］

A single Transformer layer is used to extract global features to reduce the number of parameters， and multiple IB blocks are utilized to compress noise and improve robustness

Reduced amount of model parameters

Compressing the relevant features， the segmentation accuracy is relatively reduced

Multi-scale serial

CoTr^［70］

Transformer in the encoder receives the feature maps at multiple scales in the CNN and introduces a deformable self-attention mechanism that focuses on some of the key sampling points

Computational and spatial complexity is reduced； multi-scale processing of 3D feature maps is realized

Insufficient generalizability

HTUNet^［73］

2D U-Net design MSCAT module is used to jump connections and extract intra-frame features， and the 3D Transformer UNet extracts inter-frame features

Intra- and inter-frame feature fusion for improved segmentation performance

Failure to delineate the nodal region from the gland is computationally expensive

Alternating serial

HCTNet^［76］

TEBlocks module is designed to learn global context information and combine it with CNN blocks to extract features

Combining CNN inductive bias for spatial association modeling and Transformer capabilities for remote dependency modeling

Boundary detail segmentation is not effective

U-Net+Transformer

Alternating serial

UTNet^［79］

Replace the last convolution in each resolution of U-Net with the Transformer module

Combines convolutional and self-attentive mechanisms that are easy to understand and use

Poor segmentation performance in large-scale training tasks

Global parallelism

Transfuse^［82］

CNN and Transformer extract features in parallel， and BiFusion module is designed to fuse features from both branches

Capturing global information while maintaining sensitivity to low-level context， with strong fusion representation

Transformer layer is less efficient and insufficiently generalized

HSNet^［86］

Encoder and decoder are connected through an interactive attention mechanism， and the decoder has a two-branch structure

Discriminative remote dependencies can be generated， detail features can be recovered， and generalization ability is high

High model complexity

Hierarchical parallelism

UconvTrans^［88］

Each level uses a two-branch structure and a feature fusion module is designed to pass the fused features to the next level

Number of model parameters and the amount of computation are less， which better balances the segmentation accuracy and efficiency

Not applicable to 3D medical image segmentation where slice information is richer

RMTF-Net^［93］

Encoder combines Mix Transfomer and RCNN structures， and the feature decoder is designed with a GFI module to re-fuse the feature information extracted by the encoder

Network boundary coding is more capable and a local global balanced feature is available at each level of the encoder

Not extended to 3D image segmentation

Method

Model

Split task

Image type

Dice /%

IoU /%

ACC /%

U-Net

UNet++^［15］

Nucleus/colon polyp/liver/pulmonary nodule segmentation

Microscopy/RGB video/CT/CT

92.52/32.10/82.90/77.21

UNet 3+^［43］

Liver/spleen segmentation

CT/CT

96.75/96.20

Attention

U-Net^［46］

Pancreatic segmentation

81.48

R2U-Net^［48］

Retinal vascular/skin lesion/pulmonary nodule segmentation

Fundus image/dermoscopy/CT

-/94.21/99.18

97.12/94.24/99.18

Transformer

Swin-Unet^［58］

Abdominal multi-organ/heart segmentation

CT/MRI

79.13/90.00

C2Former^［59］

Abdominal multi-organ/heart /skin cancer segmentation

CT/MRI/dermoscopy

83.22/91.42/86.78

MISSFormer^［60］

Abdominal multi-organ/heart segmentation

CT/MRI

81.96/90.86

DS-TransUNet^［61］

Polyp/dermatological/glandular/nuclear segmentation

Endoscopy/dermatoscopy/Pathology/microscopy

93.5/-/87.19/-

88.9/85.23/78.45/86.12

Combination strategy

Typology

Model

Split task

Image type

Dice /%

IoU /%

ACC /%

Serial connection

Single-scale serial

TransUNet^［62］

Multi-organ/cardiac segmentation

CT/MRI

77.48/

89.71

TransUNet+^［63］

Multi-organ/glandular/heart segmentation

CT/filmstrip/MRI

81.57/

90.42/ 90.47

-/82.69/-

PKRT-Net ^［64］

Videocups and videodiscs segmentation

Fundus image

91.20/

97.66

TU-Net^［65］

Vascular segmentation

92.00/85.00/67.00

GL-Segnet^［66］

Rectal adenocarcinoma cell/skin lesion/glioma/thoracic organ segmentation

Pathology/dermoscopy/CT/X-ray

93.10/

91.50/

93.10/

96.90

87.30/85.80/87.70/94.20

MultiIB-TransUNet^［67］

Mammary gland segmentation

US/CT

-/81.83

67.75/-

UNETR^［68］

Abdominal multi-organ/brain tumor/spleen segmentation

CT/MRI/CT

89.10/

71.10/

96.40

UMSTC^［69］

Microscope image

Microscopy

82.50

76.50

Multi-scale serial

CoTr^［70］

Cranial vault multi-organ segmentation

85.00

Multi-compound Transformer^［71］

Cell/colon/skin lesion segmentation

Microscopy/endoscopy/dermoscopy

68.40/

92.30/

90.35

TFNet^［72］

Mammary gland segmentation

87.90

78.40

HTUNet^［73］

Thyroid segmentation

98.59

97.26

TransHRNet^［74］

Abdominal multi-organ/brain tumor/spleen segmentation

CT/MRI/CT

86.70/

72.50/

97.40

Swin UNETR^［75］

Brain tumor segmentation

MRI

91.30

Alternating serial

HCTNet^［76］

Mammary gland segmentation

97.23

94.63

97.41

Feature integration network^［77］

Abdominal multi-organ/prostate segmentation

CT/MRI

92.45/

81.63

SWTRU^［78］

Liver and tumor/skin lesion/glioma segmentation

CT/Dermoscopy/MRI

97.20/

90.40/

89.70

94.90/-/-

UTNet^［79］

Heart segmentation

MRI

88.30

nnFormer^［80］

Brain tumor/abdominal multi-organ/ heart segmentation

MRI/CT/MRI

86.40/

86.57/

92.06

SwinBTS^［81］

Brain tumor segmentation

MRI

81.15

81.10

Parallel connection

Global parallelism

Transfuse^［82］

Polyp/skin lesion/hip/prostate segmentation

Endoscopy/Dermoscopy/X-ray/MRI

94.20/

87.20/-/-

89.70/-/-/-

-/94.40/-/-

Parallel connection

Global parallelism

TransFusionNet^［83］

Liver tumor/vascular segmentation

CT/CT

96.10/

90.10

92.70/85.40

DuPNet^［84］

Rectal cancer segmentation

MRI

98.22

89.34

CTC -Net^［85］

Multi-organ/heart segmentation

CT/MRI

78.41/

90.77

HSNet^［86］

Polyp segmentation

Endoscopy

92.60

87.70

TDD-UNet^［87］

COVID-19 pneumonia segmentation

CT、X-ray

78.94

96.34

Hierarchical parallelism

UconvTrans^［88］

Heart segmentation

MRI

89.60

Swin Unet3D^［89］

Brain tumor segmentation

MRI

90.50

P-TransUNet^［90］

Polyp/nucleus/glandular segmentation

Endoscopy/Microscopy/Pathology

93.52/

93.63/

95.93

88.93/88.75/91.42

PCT^［91］

Parotid tumor segmentation

91.51

84.34

TransConver^［92］

Brain tumor segmentation

MRI

86.32

RMTF-Net^［93］

Brain tumor segmentation

MRI

93.50

88.20

Yixiao Yin, Jingang Ma, Wenkai Zhang, Liang Jiang. From U-Net to Transformer: Progress in the Application of Hybrid Models in Medical Image Segmentation[J]. Laser & Optoelectronics Progress, 2025, 62(2): 0200001

Category: Reviews

Received: Mar. 12, 2024

Accepted: Jun. 3, 2024

Published Online: Jan. 9, 2025

The Author Email:

DOI:10.3788/LOP240875

CSTR:32186.14.LOP240875

Table 1. Comparison of five imaging methods

Table 1. Comparison of five imaging methods

Table 2. Segmentation performance evaluation indexes and mathematical description

Table 2. Segmentation performance evaluation indexes and mathematical description

Table 3. Summary of different combination strategies

Table 3. Summary of different combination strategies

Table 4. Idea and comparison of advantages and disadvantages of partially improved models

Table 4. Idea and comparison of advantages and disadvantages of partially improved models

Table 5. Comparison of segmentation methods based on U-Net and Transformer

Table 5. Comparison of segmentation methods based on U-Net and Transformer

Table 6. Comparison of segmentation methods based on U-Net+Transformer

Table 6. Comparison of segmentation methods based on U-Net+Transformer