Lu et al. [3] | GLDP12k | Ghost convolution enlightened the transformer | Accuracy: 98.14% | It proposed a transformer-based network that generates intermediate feature maps. |
Li et al. [4] | Two small-scale datasets | DCT | Accuracy: 89.19% Accuracy: 93.92% | Proposed a DCT model where the transformer serves as the backbone of this model. |
Lu et al. [5] | Washington State University (WSU) Roza Experimental Orchards, Prosser, WA | Swin-T-YOLOv5 | mAP: 97% | Developed Swin-T-YOLOv5 by architecturally integrating YOLOv5 and Swin-transformer detectors. |
Praveen et al. [6] | Plant Village Dataset | YOLO-X with with SE, ECA, and CBAM attention | Precision: 89.77%,Recall: 86.97%,F1 Score: 85.91%, and mAP: 88.96% | Combined attention techniques such as CBAM, SE, and ECA with YOLO to improve GLDD. |
Xie et al. [7] | GLDD dataset | Faster DR-IACNN | mAP: 81.1% | Enhanced the Faster R-CNN model with Inception-v1, Inception-ResNet-v2 modules, and SE blocks to improve feature extraction. |
Kong et al. [8] | Jilin Agricultural University | YOLOv5 network and transformer module | mAP@0.5: 84.3% | YOLOv5 network and transformer module were used to detect targets. |
Shaheed et al. [9] | PlantVillage repository | Efficient RMT-Net | Accuracy: 97.65%Accuracy: 99.12% | A combinational model with vision transformer and ResNet-50 where efficient RMT-Net, distinct features are extracted using the CNN model. |
Xia et al. [10] | Research Institute of Pomology of Chinese Academy of Agricultural Sciences in Xingcheng and the Beijing Vocational College of Agriculture in Beijing, China | MTYOLOX | AP50: 83.4%AR50: 93.3% | Developed ST-PAFPN module and DAT-Darknet module, which are embedded into the backbone and neck of the network based on multiple self-attention mechanisms. |
Leng et al. [11] | NLB dataset | YOLOv5 | mAP@0.5: 87.5% | Introduced the feature restructuring and fusion module, which focuses on retaining critical information during downsampling. |
Lu et al. [12] | WGISD | CMA-YOLO | Precision: 89.6%F1 score: 86.5% AP: 90.2% | Introduced a YOLOv5-based model that integrates dual-stream data loading, mosaic augmentation, global self-attention, and a CMA-C3 module to enhance grapefruit detection accuracy. |
Feng et al. [13] | Xiaotangshan National Precision Agriculture Demonstration Base | YOLOv5s+BiCMT | Accuracy: 99.23%Precision: 97.37%Sensitivity: 97.54%Specificity: 99.54% | Proposed YOLOv5 for region detection and a BiCMT classifier for feature fusion. |
Li et al. [14] | Dangshan County, Suzhou City, Anhui Province, China | YOLOv5s-FP | AP: 96.12% | Developed YOLOv5s-FP, which utilises a modified CSP module with a transformer encoder for global feature extraction and attentional feature fusion. |
Jiang et al. [15] | / | Efficient LC3Net model | AP: 92.29% | Proposed the Retinex algorithm for contrast enhancement and LC3Net model, with image normalization and reduced down-sampling frequency. |
Sun et al. [16] | PlantVillage dataset | SE-VIT hybrid network | Accuracy: 97.26% | Developed SE-VIT hybrid network where the SE attention module enhances inter-channel weight learning in ResNet-18. |
Huang et al. [17] | / | YOLO-EP algorithm, based on YOLOv5 | AP@0.5: 88.6%Precision: 85.1%Recall: 82.6%. | Introduced the YOLO-EP algorithm, utilizing transposed convolution and attention algorithms. |
Thai et al. [18] | Cassava Leaf Disease Dataset | Least important attention pruning (LeIAP) algorithm | / | Developed LeIAP algorithm to select each layer’s most critical attention heads in the transformer model. |
Chen et al. [19] | / | ESP-YOLO | mAP: 98.3% | Integrated YOLO with advanced techniques like ELSAN, SE, and PConv to improve the accuracy and efficiency of table grape detection. |
Liu et al. [20] | RGB Grape Data -North China | FRT-YOLO | mAP: 90.67% | Developed FTR-YOLO, a real-time and lightweight model for detecting grape diseases. |