Laser & Optoelectronics Progress, Volume. 62, Issue 2, 0200001(2025)

From U-Net to Transformer: Progress in the Application of Hybrid Models in Medical Image Segmentation

Yixiao Yin*, Jingang Ma, Wenkai Zhang, and Liang Jiang
Author Affiliations
  • School of Medical Information Engineering, Shandong University of Traditional Chinese Medicine, Jinan 250355, Shandong China
  • show less
    Figures & Tables(19)
    Different types of medical images
    U-Net structure diagram[12]
    Transformer structure diagram[55]
    Conceptual structure diagram of single-scale serial combination
    TransUNet structure diagram[62]
    Conceptual structure diagram of multi-scale serial combination
    CoTr structure diagram[70]
    Conceptual structure diagram of alternating serial combination
    UTNet structure diagram[79]
    Conceptual structure diagram of overall parallel combination
    DuPNet structure diagram[84]
    Conceptual structure diagram of hierarchical parallel combination
    UconvTrans structure diagram[88]
    • Table 1. Comparison of five imaging methods

      View table

      Table 1. Comparison of five imaging methods

      Imaging typeImaging mechanismImaging objectiveAdvantageDisadvantageCommon datasets
      X-rayPenetration of electromagnetic radiation into body tissuesBones, lungs, heartFast, easy to operate, low cost, good contrast of negativesRadiation, limited ability to image soft tissue, only 2D images availableCOVID-1933
      CTRotational X-ray fabrication of in vivo cross-sectional imagesAny part of the bodyHigh-resolution images with no overlap between tissue structure and lesion imagesRadiating, prone to artifacts, average resolution for soft tissueLiTS34
      MRIStrong magnetic fields and radio wavesSoft tissues, brain, jointsNon-invasive, no radiation risk, high resolution to soft tissue, ability to cut layers directly in any directionTime-consuming, costly, tight inspection space and limited by metal implantsBraTS35
      USHigh-frequency sound wavesSoft tissues, blood flow, organsNon-invasive, non-radiation-risked, real-time imagingPoor imaging of gases and bony structures, high dependence on manual laborEchoNet-Dynamic36
      Digital pathologyHigh-resolution scans under an optical microscopeTissues, cellsHigh-resolution images, digital storageRequires advance sampling, which is costly and time-consumingSEED37
    • Table 2. Segmentation performance evaluation indexes and mathematical description

      View table

      Table 2. Segmentation performance evaluation indexes and mathematical description

      Evaluation indexMathematical description
      Dice2ABA+B
      IoUABAB
      ACCRTP+RTNRTP+RTN+RFP+RFN
      SERTPRTP+RFN
    • Table 3. Summary of different combination strategies

      View table

      Table 3. Summary of different combination strategies

      Combination strategyTypologyMain ideaAdvantageDisadvantageModel complexity
      SerialSingle-scale serialCNN and Transformer serially combined at a single scaleEasy to implement and understandA loss of information between the CNN and the TransformerLow
      Multi-scale serialCNN serially combine with Transformers at different scalesAvailability of multi-scale informationIncreases complexity of managing different scales and detailed information may be lost when lowering resolutionMiddle
      Alternating serialAlternate between CNN and Transformer throughout the networkMore balanced utilization of CNN and Transformer capabilities to improve information flow at each stageNeed to manage the interaction layer, which increases complexity and can cause information bottlenecks or redundancy if not carefully designedHigh
      ParallelismGlobal parallelismCNN and Transformer process the image in parallel and finally fuse the featuresTaking full advantage of both networks, spatial and sequence information can be efficiently capturedNeed to design effective feature fusion mechanisms with increased computationHigh
      Hierarchical parallelismCNN and Transformer are processed in parallel at multiple layers of the model, with feature fusion at each layerEnhanced model generalization using two networks at each level to fully learn featuresNeed to design an effective feature fusion mechanism, complex structure and difficult to tuneHigh
    • Table 4. Idea and comparison of advantages and disadvantages of partially improved models

      View table

      Table 4. Idea and comparison of advantages and disadvantages of partially improved models

      MethodModelIdeological improvementAdvantageDisadvantage
      U-NetUNet++15A dense jump connection strategy is used to introduce a deep supervision mechanism to motivate the network to better learn and utilize hierarchical featuresDeeper and multi-scale feature extraction and fusion are realized to improve segmentation performanceLarge number of model parameters and high computational complexity
      Attention U-Net46AttentionGate attention mechanism is used to eliminate noise and irrelevant information of jump connections and extract key featuresIt effectively captures important local and global information in the image and improves the accuracy and performance of image segmentationModel segmentation performance is improved to a lesser extent
      TransformerSwin-Unet58Use Swin Transformer module instead of U-Net's 2D convolutional blockMulti-scale feature fusion, parameters and computational efficiency are optimized for applicabilityLack of extraction of local feature information
      C2Former59Designing cross-convolutional self-attention mechanism algorithms to improve semantic feature understandingIntegration of multi-scale feature information and filtering of interfering informationLoss of feature information
      U-Net+TransformerSingle-scale serialTransUNet62Local features are extracted using a CNN and the Transformer module is strung after the CNN to extract global contextual informationPerceiving global information for direct processing of sequential data, adaptation to multi-scale tasksHigh computational complexity
      MultiIB-TransUNet67A single Transformer layer is used to extract global features to reduce the number of parameters, and multiple IB blocks are utilized to compress noise and improve robustnessReduced amount of model parametersCompressing the relevant features, the segmentation accuracy is relatively reduced
      Multi-scale serialCoTr70Transformer in the encoder receives the feature maps at multiple scales in the CNN and introduces a deformable self-attention mechanism that focuses on some of the key sampling pointsComputational and spatial complexity is reduced; multi-scale processing of 3D feature maps is realizedInsufficient generalizability
      HTUNet732D U-Net design MSCAT module is used to jump connections and extract intra-frame features, and the 3D Transformer UNet extracts inter-frame featuresIntra- and inter-frame feature fusion for improved segmentation performanceFailure to delineate the nodal region from the gland is computationally expensive
      Alternating serialHCTNet76TEBlocks module is designed to learn global context information and combine it with CNN blocks to extract featuresCombining CNN inductive bias for spatial association modeling and Transformer capabilities for remote dependency modelingBoundary detail segmentation is not effective
      U-Net+TransformerAlternating serialUTNet79Replace the last convolution in each resolution of U-Net with the Transformer moduleCombines convolutional and self-attentive mechanisms that are easy to understand and usePoor segmentation performance in large-scale training tasks
      Global parallelismTransfuse82CNN and Transformer extract features in parallel, and BiFusion module is designed to fuse features from both branchesCapturing global information while maintaining sensitivity to low-level context, with strong fusion representationTransformer layer is less efficient and insufficiently generalized
      HSNet86Encoder and decoder are connected through an interactive attention mechanism, and the decoder has a two-branch structureDiscriminative remote dependencies can be generated, detail features can be recovered, and generalization ability is highHigh model complexity
      Hierarchical parallelismUconvTrans88Each level uses a two-branch structure and a feature fusion module is designed to pass the fused features to the next levelNumber of model parameters and the amount of computation are less, which better balances the segmentation accuracy and efficiencyNot applicable to 3D medical image segmentation where slice information is richer
      RMTF-Net93Encoder combines Mix Transfomer and RCNN structures, and the feature decoder is designed with a GFI module to re-fuse the feature information extracted by the encoderNetwork boundary coding is more capable and a local global balanced feature is available at each level of the encoderNot extended to 3D image segmentation
    • Table 5. Comparison of segmentation methods based on U-Net and Transformer

      View table

      Table 5. Comparison of segmentation methods based on U-Net and Transformer

      MethodModelSplit taskImage typeDice /%IoU /%ACC /%
      U-NetUNet++15Nucleus/colon polyp/liver/pulmonary nodule segmentationMicroscopy/RGB video/CT/CT-92.52/32.10/82.90/77.21-
      UNet 3+43Liver/spleen segmentationCT/CT96.75/96.20--

      Attention

      U-Net46

      Pancreatic segmentationCT81.48--
      R2U-Net48Retinal vascular/skin lesion/pulmonary nodule segmentationFundus image/dermoscopy/CT--/94.21/99.1897.12/94.24/99.18
      TransformerSwin-Unet58Abdominal multi-organ/heart segmentationCT/MRI79.13/90.00--
      C2Former59Abdominal multi-organ/heart /skin cancer segmentationCT/MRI/dermoscopy83.22/91.42/86.78--
      MISSFormer60Abdominal multi-organ/heart segmentationCT/MRI81.96/90.86--
      DS-TransUNet61Polyp/dermatological/glandular/nuclear segmentationEndoscopy/dermatoscopy/Pathology/microscopy93.5/-/87.19/-88.9/85.23/78.45/86.12-
    • Table 6. Comparison of segmentation methods based on U-Net+Transformer

      View table

      Table 6. Comparison of segmentation methods based on U-Net+Transformer

      Combination strategyTypologyModelSplit taskImage typeDice /%IoU /%ACC /%
      Serial connectionSingle-scale serialTransUNet62Multi-organ/cardiac segmentationCT/MRI

      77.48/

      89.71

      --
      TransUNet+63Multi-organ/glandular/heart segmentationCT/filmstrip/MRI

      81.57/

      90.42/ 90.47

      -/82.69/--
      PKRT-Net 64Videocups and videodiscs segmentationFundus image

      91.20/

      97.66

      --
      TU-Net65Vascular segmentationUS-92.00/85.00/67.00-
      GL-Segnet66Rectal adenocarcinoma cell/skin lesion/glioma/thoracic organ segmentationPathology/dermoscopy/CT/X-ray

      93.10/

      91.50/

      93.10/

      96.90

      87.30/85.80/87.70/94.20-
      MultiIB-TransUNet67Mammary gland segmentationUS/CT-/81.8367.75/--
      UNETR68Abdominal multi-organ/brain tumor/spleen segmentationCT/MRI/CT

      89.10/

      71.10/

      96.40

      --
      UMSTC69Microscope imageMicroscopy82.5076.50-
      Multi-scale serialCoTr70Cranial vault multi-organ segmentationCT85.00--
      Multi-compound Transformer71Cell/colon/skin lesion segmentationMicroscopy/endoscopy/dermoscopy

      68.40/

      92.30/

      90.35

      --
      TFNet72Mammary gland segmentationUS87.9078.40-
      HTUNet73Thyroid segmentationUS98.5997.26-
      TransHRNet74Abdominal multi-organ/brain tumor/spleen segmentationCT/MRI/CT

      86.70/

      72.50/

      97.40

      --
      Swin UNETR75Brain tumor segmentationMRI91.30--
      Alternating serialHCTNet76Mammary gland segmentationUS97.2394.6397.41
      Feature integration network77Abdominal multi-organ/prostate segmentationCT/MRI

      92.45/

      81.63

      --
      SWTRU78Liver and tumor/skin lesion/glioma segmentationCT/Dermoscopy/MRI

      97.20/

      90.40/

      89.70

      94.90/-/--
      UTNet79Heart segmentationMRI88.30--
      nnFormer80Brain tumor/abdominal multi-organ/ heart segmentationMRI/CT/MRI

      86.40/

      86.57/

      92.06

      --
      SwinBTS81Brain tumor segmentationMRI81.1581.10-
      Parallel connectionGlobal parallelismTransfuse82Polyp/skin lesion/hip/prostate segmentationEndoscopy/Dermoscopy/X-ray/MRI

      94.20/

      87.20/-/-

      89.70/-/-/--/94.40/-/-
      Parallel connectionGlobal parallelismTransFusionNet83Liver tumor/vascular segmentationCT/CT

      96.10/

      90.10

      92.70/85.40-
      DuPNet84Rectal cancer segmentationMRI98.2289.34-
      CTC -Net85Multi-organ/heart segmentationCT/MRI

      78.41/

      90.77

      --
      HSNet86Polyp segmentationEndoscopy92.6087.70-
      TDD-UNet87COVID-19 pneumonia segmentationCT、X-ray78.94-96.34
      Hierarchical parallelismUconvTrans88Heart segmentationMRI89.60--
      Swin Unet3D89Brain tumor segmentationMRI90.50--
      P-TransUNet90Polyp/nucleus/glandular segmentationEndoscopy/Microscopy/Pathology

      93.52/

      93.63/

      95.93

      88.93/88.75/91.42-
      PCT91Parotid tumor segmentationUS91.5184.34-
      TransConver92Brain tumor segmentationMRI86.32--
      RMTF-Net93Brain tumor segmentationMRI93.5088.20-
    Tools

    Get Citation

    Copy Citation Text

    Yixiao Yin, Jingang Ma, Wenkai Zhang, Liang Jiang. From U-Net to Transformer: Progress in the Application of Hybrid Models in Medical Image Segmentation[J]. Laser & Optoelectronics Progress, 2025, 62(2): 0200001

    Download Citation

    EndNote(RIS)BibTexPlain Text
    Save article for my favorites
    Paper Information

    Category: Reviews

    Received: Mar. 12, 2024

    Accepted: Jun. 3, 2024

    Published Online: Jan. 9, 2025

    The Author Email:

    DOI:10.3788/LOP240875

    CSTR:32186.14.LOP240875

    Topics