1 Introduction
Detection of objects is one of the most critical tasks in computational vision. Due to limited camera resolution, the issue is made more difficult when the object of interest is small. Locating persons who are missing and may be harmed is the primary objective of the search and rescue (SAR) operations. Sickness and disorientation are frequent reasons for missing persons. A SAR effort must be started to deliver the required medical care to persons who participate in adventure sports that require them to stay in isolated areas like deserts. The primary objective of a SAR operation is to thoroughly explore the area as soon as possible to locate a missing or injured individual. As time passes, the chances of finding a missing individual decrease, while the search area grows exponentially [1]. By developing methods for swiftly discovering individuals who have become lost in such settings, SAR teams continuously strive to improve and modernize their day-to-day routine operations. Traditionally, canines or dogs have been trained and used to find such lost people. However, the time and the number of dogs needed for such activities are prohibitive. An alternative way of finding people involves using humans as search assistance, but it consumes time and requires the coordination of many professionals and volunteers [2].
Unmanned aerial vehicles (UAVs), which can help find missing individuals more quickly and lower the cost associated with SAR missions, have been embraced to increase survival rates. UAVs are now routinely used in search operations that require scanning and taking pictures of the surroundings. However, it can be challenging to detect humans from the imagery captured by UAVs due to positional variations, weather patterns, the small size of people, and the environmental camouflage. Sometimes, the environment is hard-to-reach for search operations. With current research and growth in UAV-based detection techniques, an automatic object detection system is considered essential.
A considerable number of image analysis techniques have been reported in literature to improve object detection accuracy and speed up the analysis. The most recent detectors described in the literature are either deep learning networks–you only look once (YOLO), RetinaNet, faster region-basedconvolutional neural network (R-CNN), and cascade R-CNN–or hybrid ones. The datasets for SAR operations are constrained, since there are few situations that are similar to those commonly seen in the real-world. Accuracy, speed, and the number of false positives during testing are also factors included in performance evaluation of these detectors.
In an SAR scenario, a person is the primary target of the search but is observed or recorded from aerial platforms. Such camera images or recordings are typically absent from the vast datasets needed to train well-known object detectors. To get the highest level of detection accuracy, the dataset on which the model is to be trained should include comparable environmental attributes that are found in real-world scenes. As a result, it is necessary to train the model using real-world aerial images or recordings.
2 Literature review
Low visibility owing to different elevations, positional variation, disguised surroundings with trees, bushes, and high-resolution photos all play a role in object detection in aerial images [3]. Object detection from aerial images is still considered a serious challenge [4], even though typical ground imagery has generated acceptable results in object detection. If an object gets rotated, translated, scaled, and partially hidden in an image, a person will more likely recognize it. However, because of the way computers address this problem, this task is significantly difficult with computer vision systems. Thus, human detection from aerial imagery may result in a substantial number of false positives.
To recognize humans from aerial images, Baker et al. used image segmentation, contrast enhancement, and single shot multibox detector (SSD) [5]. Yun et al. proposed an approach that employs a support vector machine (SVM) classifier to search for persons caught in an avalanche [6]. In Ref. [7], the global positioning system (GPS) signal is utilized in SAR missions. However, the solution is envisioned considering that the person has a GPS device turned on, and the location may be calculated by combining the UAV position with the received signal strength. In Ref. [8], the tiny you only look once version 3 (YOLOv3) structure has been used to create a model to identify a person in the water. The model was trained on the Microsoft common objects in context (MS COCO) dataset developed in high definition (HD) resolution (with a GoPro camera) by UAV.
Many deep learning approaches have been reported in literature that attempt detection of objects in SAR missions to accelerate further image analysis. Many deep learning structures [9] have been constructed that use object detection and bounding box regression after determining the region of interest (ROI). The detectors include faster R-CNN [10], RetinaNet [11], and cascade R-CNN [12], and are frequently considered to be more accurate in terms of localization and classification. They process data more slowly than single stage detectors like YOLO [13] and SSD [14]. The use of single-stage detectors for real-time object detection is commonly reported in Ref. [15].
Due to limitations of the aspect ratio, a group of Google researchers created a one-stage detector, called EfficientDET [16], which proved much more efficient than two-stage detectors. The approach used a factor that can scale the depth, width, and resolution dimensions evenly. The EfficientNet-B7 [17] obtained the accuracy of 84.4% on ImageNet [18] and the precision of 52.2% on the MS COCO dataset [19], putting forth a new coefficient while employing the bidirectional feature pyramid network.
Transfer learning is also a machine learning approach especially beneficial for remote sensing when enough data is not available. Using transfer learning, Rostami et al. [20] suggested a network model for classifying SAR scenes, which does not need a larger labelled dataset. Zhang et al. [21] proposed a transfer learning-based multiscale feature network, in which intrinsic pyramidal features and low-resolution features are merged with high resolution features. Using transfer learning, Sambolek et al. devised a deep network model for SAR image categorization that does not require a labelled dataset [22].
The reliability parameters of RetinaNet, faster R-CNN, YOLOv4, and cascade R-CNN detectors have been examined using the VisDrone benchmark and a customized dataset built to mimic rescue scenes [23]. The detection outcomes were compared after the models were trained on selected datasets. Dorrer and Tolmacheva proposed a strategy in which image sequences are used as the system input [24]. The strategy utilizes the idea that a person detected in numerous successive images is likely to be positive detection, while a person detected in one image only is more likely to be a false positive.
Based on the discussion above, it may be easily concluded that current studies do not employ image sequences that show everyday-life scenes, which are unrealistic from an SAR standpoint. The first area of improvement needed in the existing methods is performance. The second area of improvement needed in the existing methods is the speed of training and testing. One of the most prominent methods, faster R-CNN, provides a speed value of 0.2 seconds per image. For instance, in Ref. [25], a comparison between the mask R-CNN architecture and YOLOv3 is made, where the mask R-CNN accuracy is shown to be much higher in performance than YOLOv3, though YOLOv3 surpassed mask R-CNN in the detection speed. YOLOv4 is compared with SSD and faster R-CNN, and the results show YOLOv4’s accuracy to be much greater than that of faster R-CNN and SSD, while the speed of detection in SSD was found to be significantly higher than that of YOLOv3 and faster R-CNN. The backbone model and the SSD head are the two components of SSD. As a feature extractor, the backbone model is a pre-trained image categorization network, which is often a ResNet-trained network with the final fully connected classification layer removed. The SSD head is an additional set of convolutional layers added to this backbone, with the outputs interpreted as the bounding boxes and classifications of objects in the spatial position of the final layer activations.
3 Proposed methodology
The real time detection of humans for SAR missions is proposed in Fig. 1. In contrast to prior developments that employed the DarkNet framework, the YOLOv5 implementation investigated here uses a machine learning framework based on the Torch library (Pytorch). This framework makes the model easy to comprehend, train with, and deploy. Unlike ROI detection-based networks, the YOLO models locate and recognize items immediately.

Figure 1.Proposed approach using YOLOv5.
You only look once version X (YOLOX), a high-performance object detection model, was released in 2021, but in terms of implementation, you only look once version 5 small (YOLOv5s) is a fast and simple model with cutting-edge results. In Ref. [26], YOLOv5s is stated to outperform YOLOX in the speed for detecting and identifying (pedestrian) person in related fields. The Gold YOLO was introduced in 2020 [27], and its performance has only been evaluated for object identification in ground videos. Its speed on aerial datasets for human detection in varied positions has yet to be studied. While investigating human detection from aerial videos and writing of this work in late 2022, you only look once version 6 (YOLOv6) and you only look once version 7 (YOLOv7) released mid-2022, and you only look once version 8 (YOLOv8) released in 2023 are thus not investigated and reported in this research.
The YOLOv5 algorithm, being a single-stage object detection model, has been selected due to its run speed. The structure of the YOLOv5 architecture has a CSPdarknet53 backbone. Focus layers have the advantage of using less memory, having a smaller layer, and allowing for more backward and forward propagation. This structure aids in detecting small to large objects.
The YOLOv5 family has a total of 5 models [28]. YOLOv5 nano is the smallest and quickest YOLOv5 model. The you only look once version 5 large (YOLOv5L) model is useful in cases where datasets contain tiny items. This model suits the current research target of detecting humans in SAR applications. Like YOLOv3 and YOLOv4, the head in YOLOv5 generates feature outputs that are distinct for multiscale prediction. Before being given to the path aggregation network (PANet), the image is first passed onto CSPDarknet53 for extraction of features. The YOLO layer is ultimately responsible for producing the outcomes. According to Ref. [29], YOLOv5 is thought to be more accurate than earlier versions, yet YOLOv3 has a detection speed that is faster than later versions, while YOLOv4 and YOLOv5 have the same speed.
3.1 Preprocessing
This step includes dataset selection, where the image is divided into a 5×5 sub-image grid followed by placing bounding-boxes. For greater accuracy, the training dataset should have similar circumstances to that of the environment where the model is to be deployed for testing. SAR activities typically involve situations when people are sitting, walking, and resting on a beach, or in a park. The SAR dataset was chosen for training and testing because it supports rescue operation instances and includes a variety of wounded person poses in addition to more typical ones like standing, sitting, lying down, and jogging. This dataset contains high-definition images of resolution 1920×1080 taken during drone flight path in day light at different heights ranging from 5 meters to 50 meters with camera angle from 45 degrees to 90 degrees. Sample images from this dataset are displayed in Fig. 2.

Figure 2.Sample dataset images.
The network divides dataset images into regions to forecast bounding boxes and region probabilities, without region proposal. Based on anticipated probabilities, the bounding box weights were determined. It then computes the anticipated class and coordinates of the bounding box if the box center is within the cell. If the object is not found, the computed score is zero. Otherwise, the anticipated value minus the ground truth value equals the intersection over union (IOU).
3.2 Detection
The data is initially sent into CSPDarknet53 (the backbone) for extraction of features before being fed into PANet (the neck) for fusion to perform detection. The YOLO layer (the head) produces the detection results (score, class, size, and location). The non-max suppression makes sure the algorithm only finds each object once. Each box produces a probability Pc, where Pc represents the likelihood that an object exists. The task of non-max suppression is to select the box with the largest Pc value, discard boxes with Pc less than or equal to 0.6 and suppress boxes with IOU greater than or equal to 0.5.
The high-resolution image size cannot be fed directly as an input due to the tiny size of the targeted object as compared with the size of the full image. To achieve a robust YOLOv5 model, it becomes necessary to resize the input image before running it through the network. The default training size needed for the proposed network is 640×640, as also recommended in Ref. [30]. Different resolutions were tried for testing the images while maintaining the input images with 1920×1080 size. Increasing the network resolution can produce better detection results, as shown in Section 4. However, the optimized resolution that produces the highest accuracy results is 832×832.
For comparative purposes, all models discussed before need to be tested to investigate human detection in SAR imagery. The proposed method involving YOLOv5L is compared with YOLOv3, YOLOv4, and faster R-CNN models in terms of the execution speed and the number of false detection in the SAR dataset.
4 Results
In this section, the experimental results of the proposed method for human detection on the SAR dataset are presented. To evaluate YOLOv5, a set of performance measures are calculated. For comparison, well-known detectors were also employed on the same dataset. Preliminary results of this work are reported in Ref. [31].
4.1 Evaluation metrics
The research work [32] provided performance measures based on the spatial intersection of ground-truth and system-generated bounding box and then averaged over all sampled frames. The well-known evaluation measures employed were recall, mean average precision (mAP), and precision. In this research, only one (human) class is determined, so mAP is equivalent to average precision (AP). mAP is obtained by averaging the average precision of the classes, indicated by (1), which is also a metric for calculating the machine learning algorithm accuracy.
$ \mathrm{m}\mathrm{A}\mathrm{P}=\sum _{q=1}^{Q}\frac{\mathrm{A}\mathrm{v}\mathrm{e}P\left(q\right)}{Q} $ (1)
where q represents query, Q represents the total number of queries and AveP (q) stands for average precision for that query.
Locating a person as quickly as possible is critical to successful SAR operations. True positive (TP) in the location recognition issue is the number of good landing places discovered by the algorithm. The amount of non-good landing places mistakenly identified as excellent ones is termed as false positive (FP), and the number of good landing places missed is known as false negatives (FN). The term IOU is used to determine if a prediction is FP or TP. It is the ratio of the overlap of ground truth and prediction labels to the union of boxes. Precision evaluates the accuracy of detection findings. It is calculated as the TP detections to total detections as indicated in (2). In contrast, (3) illustrates recall, which evaluates TP compared with the total potential detections.
$ \mathrm{P}\mathrm{r}\mathrm{e}\mathrm{c}\mathrm{i}\mathrm{s}\mathrm{i}\mathrm{o}\mathrm{n}=\frac{\mathrm{T}\mathrm{P}}{\mathrm{T}\mathrm{P}+\mathrm{F}\mathrm{P}} $ (2)
$ \mathrm{R}\mathrm{e}\mathrm{c}\mathrm{a}\mathrm{l}\mathrm{l}=\frac{\mathrm{T}\mathrm{P}}{\mathrm{T}\mathrm{P}+\mathrm{F}\mathrm{N}} {\mathrm{.}} $ (3)
4.2 Preprocessing
Eight 1920×1080 pixel videos were used to create the visuals for the SAR dataset. The training and test images were divided into 60∶40. Compared with 792 testing images, where 2611 people were marked, the1189images of the training set have 3921 people marked. There were 1981 single frames with individuals on them picked from a total of 35 minutes of recordings. To train the model, the people in the chosen images were manually labelled. The LabelImg tool [33] was used to tag people. LabelImg tool is an open-source simple-to-use application for labelling object bounding boxes. The picture annotation includes the bounding box position around each object of interest, its height and width, and the label (walking, ntanding, running, sitting, lying, and not defined). Labels are saved as extensible markup language (XML) files in the YOLO format.
4.3 Multiscale training
The YOLO framework adjusts the size of the image while maintaining the aspect ratio set by width and height parameters in the .cfg weights file. Network resolution refers to these factors. Transforming the image resolution in the YOLO architecture may be defined as stated in (4).
$ {\mathrm{I}\mathrm{m}\mathrm{g}}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}\_\mathrm{w}\mathrm{i}\mathrm{d}\mathrm{t}\mathrm{h}}=\mathrm{N}\mathrm{e}{\mathrm{t}}_{\mathrm{w}\mathrm{i}\mathrm{d}\mathrm{t}\mathrm{h}} \tag{4a}$ ()
$ {\mathrm{I}\mathrm{m}\mathrm{g}}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}\_\mathrm{h}\mathrm{e}\mathrm{i}\mathrm{g}\mathrm{t}\mathrm{h}}=\frac{\mathrm{N}\mathrm{e}\mathrm{t}\_\mathrm{w}\mathrm{i}\mathrm{d}\mathrm{t}\mathrm{h}}{\mathrm{I}\mathrm{m}\mathrm{g}\_\mathrm{w}\mathrm{i}\mathrm{d}\mathrm{t}\mathrm{h}}{\mathrm{.}} \tag{4b}$ ()
As an example, if input image resolution is 1920×1080 and network resolution marked as width ($ \mathrm{N}\mathrm{e}\mathrm{t}\_\mathrm{w}\mathrm{i}\mathrm{d}\mathrm{t}\mathrm{h} $ = 832) and height ($ \mathrm{N}\mathrm{e}\mathrm{t}\_\mathrm{h}\mathrm{e}\mathrm{i}\mathrm{g}\mathrm{h}\mathrm{t} $ = 832), YOLO will modify the input image resolution to $ \mathrm{N}\mathrm{e}\mathrm{t}\_\mathrm{w}\mathrm{i}\mathrm{d}\mathrm{t}\mathrm{h} $ by changing the height of the input image to gain the ratio of $ \mathrm{N}\mathrm{e}\mathrm{t}\_\mathrm{w}\mathrm{i}\mathrm{d}\mathrm{t}\mathrm{h} $ to $ \mathrm{I}\mathrm{m}\mathrm{g}\_\mathrm{w}\mathrm{i}\mathrm{d}\mathrm{t}\mathrm{h} $. Thus, 1920×1080 will be changed to 832×468. For illustration purposes, the comparison of input and network image resolution is shown in Fig. 3.

Figure 3.Input and network image resolution: (a) input image resolution and (b) network image resolution.
Another approach for improving detection performance, primarily for small objects would be to utilize higher-resolution images and then use these images to train the learning network. Since the model downsizes the image by a factor of 32, the width and height should be ensured to be a multiple of 32. YOLO obtains images of size 320×320, 352×352, ···, 512×512, ···, 608×608, ···, 832×832 during training with a step of 32 as stated in (5).
$ {\mathrm{N}\mathrm{e}\mathrm{t}}_{\mathrm{w}\mathrm{i}\mathrm{d}\mathrm{t}\mathrm{h}}={\mathrm{N}\mathrm{e}\mathrm{t}}_{\mathrm{w}\mathrm{i}\mathrm{d}\mathrm{t}\mathrm{h}}+{k}{\mathrm{,}} \qquad {k}=32{n}{\mathrm{,}}\mathrm{ }{k}{\mathrm{,}}\mathrm{ }{n}\mathrm{ } \varepsilon \mathrm{ }{N}\mathrm{ }\left({N}\mathrm{ }\mathrm{i}\mathrm{s}\mathrm{ }\mathrm{a}\mathrm{n}\mathrm{ }\mathrm{i}\mathrm{n}\mathrm{t}\mathrm{e}\mathrm{g}\mathrm{e}\mathrm{r}\right){\mathrm{.}} $ (5)
Different network resolutions were used for testing images while maintaining the input images with a size of 1920×1080. During comparison, it can be shown that increasing the resolution during testing can produce better detection results. Table 1 below shows that the 832×832 network resolution produces the best accuracy results.

Table 1. Network resolution (%) YOLOv5 detection performance.
Table 1. Network resolution (%) YOLOv5 detection performance.
Network resolutiontest | AP | AP50 | 320×320 | 38 | 78 | 640×640 | 60 | 95 | 832×832 | 64 | 96 |
|
4.4 Implementation platform and processing time
The model was imported to Kaggle to use the pretrained weights on the SAR dataset. The trained model was exported to Google Collaboratory for validation and testing SAR images. Table 2 shows the hardware machine specification of Google Collaboratory.

Table 2. Hardware machine specification of Google Collaboratory.
Table 2. Hardware machine specification of Google Collaboratory.
Parameter | Specification | Model name | Intel(R) Xeon(R) CPU @ 2.30 GHz | CPU MHz | 2299.998 | Cache size | 46080 KB | CPU cores | 2 | RAM | 12 GB | GPU | Nvidia K80 / T4 | GPU memory clock | 0.82 GHz / 1.59 GHz | Max execution time | 12 hours | Max idle time | 90 minutes | Model name | Intel(R) Xeon(R) CPU @ 2.30 GHz | CPU (MHz) | 2299.998 |
|
The preprocessed image took 0.9 msec for execution, while the proposed model execution time is 24.5 msec per image with a maximum execution time of 12 hours, where the network size is 832×832. The faster R-CNN had a maximum execution time of 20 hours with an inference time of 1 s per image [32]. The objective of an SAR operation lies in searching the lost individual in the shortest possible time. Thus, compared with the faster R-CNN model, YOLOv5L executes faster in terms of predicting the human presence.
4.5 Detection results and comparisons
The results on the SAR dataset for object detection are given in Table 3. With YOLOv5L, the best results were obtained. Although it was slightly better than YOLOv4, the faster R-CNN detector performance was significantly worse than that of YOLOv5L. In all tested detectors, AP50 achieved the highest values. The best average precision has been conducted by YOLOv5L with a value of 96.9% compared with the YOLOv4 model where AP50 reached 96%. On the contrary, faster R-CNN acted worse with the lowest AP (91%). When detector precision was altered from AP50 to AP75, all detectors performed worse, and the highest mean accuracy recorded was 74.3% with YOLOv5L.

Table 3. Comparative results on the SAR dataset.
Table 3. Comparative results on the SAR dataset.
Model | Class | Images | Labels | mAP@0.5 | mAP @0.75 | mAP | YOLOv5L | All | 792 | 2605 | 0.969 | 0.743 | 0.643 | YOLOv4 | All | 792 | 2605 | 0.960 | 0.710 | 0.610 | YOLOv3 | All | 792 | 2605 | 0.925 | 0.63 | 0.902 | Faster R-CNN | All | 792 | 2605 | 0.910 | 0.510 | 0.500 |
|
The accuracy and recall ratios for all evaluated models are shown in Table 4. YOLOv5L has the best precision-to-recall ratio, with 97% precision and a recall of better than 93%, indicating that it was the highest performer in detection results and spotted the largest number of humans in the given image. Faster R-CNN had the highest recall but with lower precision of just 67% and a better number of FP detections than YOLOv5L.

Table 4. Precision-to-recall ratios for different models.
Table 4. Precision-to-recall ratios for different models.
Model | Precision | Recall | YOLOv5L | 0.971 | 0.932 | YOLOv4 | 0.960 | 0.910 | YOLOv3 | 0.962 | 0.892 | Faster R-CNN | 0.670 | 0.936 |
|
Following the calculation of recall and precision for various IOU thresholds, the recall and precision plots for a classifier at different IOU thresholds are constructed in Fig. 4. After that, the precision-to-recall curve is used to compute the average precision shown in Fig. 5 (a). It was decided that if the intersection of the associated ground truth bounding box and the detected bounding box to that of the union is 50% or greater, the detection is considered successful.

Figure 4.Classifier at different IOU thresholds: (a) precision and (b) recall.

Figure 5.Average precision: (a) precision–to-recall and (b) F1score.
The harmonic-mean of recall and precision described in (6) is the F1score, which represents the accuracy. The range of F1score is 1 to 0, where 1 represents excellent performance, i.e., all of the predictions were correct. Fig. 5 (b) represents the F1score obtained as a consequence of our experiment. Fig. 5 (b) shows F1score of 0.95, which suggests a good correlation between Precision and Recall.
$ F{1}_{\mathrm{s}\mathrm{c}\mathrm{o}\mathrm{r}\mathrm{e}}=2 \times \frac{\left(\mathrm{P}\mathrm{r}\mathrm{e}\mathrm{c}\mathrm{i}\mathrm{s}\mathrm{i}\mathrm{o}\mathrm{n}\times \mathrm{R}\mathrm{e}\mathrm{c}\mathrm{a}\mathrm{l}\mathrm{l}\right)}{\mathrm{P}\mathrm{r}\mathrm{e}\mathrm{c}\mathrm{i}\mathrm{s}\mathrm{i}\mathrm{o}\mathrm{n}+\mathrm{R}\mathrm{e}\mathrm{c}\mathrm{a}\mathrm{l}\mathrm{l}}{\mathrm{.}} $ (6)
Mao [26] considered you only look once version 5 small (YOLOv5s) and you only look once version 5 medium (YOLOv5m). These two models differ in the size, speed, and recorded mAP. In this research, YOLOv5L has been implemented. All YOLOv5 models (YOLOv5s, YOLOv5m, and YOLOv5L) differ in the size, speed and benchmark performance metrics such as precision, recall, and mAP. Furthermore, the number of layers, the depth multiple, and width multiple in the YOLOv5 models are different. These factors lead to an increase in training duration as the network size increases from small to large. With an increase in the depth and width, the model complexity increases with an increase in floating-point operations per second. It is possible to design/customize a new architecture by deleting/adding layers in the backbone of the model to get a change in performance. To comparatively check the performance of these models on the SAR dataset, the benchmark metric measurements (precision, recall, and mAP) were recorded. The results are shown in Table 5, where it is clear that the performance of YOLOv5L is superior to that of YOLOv5s and YOLOv5m.

Table 5. Performance comparison of YOLOv5 models on the SAR dataset.
Table 5. Performance comparison of YOLOv5 models on the SAR dataset.
Model | Precision | Recall | mAP | YOLOv5s | 0.940 | 0.917 | 0.933 | YOLOv5m | 0.775 | 0.767 | 0.762 | YOLOv5L | 0.971 | 0.932 | 0.969 |
|
To further test the performance of YOLOv5L versus YOLOv5s and YOLOv5m, the HERIDAL dataset was used, and the results are displayed in Table 6. On the HERIDAL dataset, YOLOv5L performed with precision of 0.90, recall of 0.893, and mAP of 0.834. These parameters are superior to that of YOLOv5s and YOLOv5m.

Table 6. Performance comparison of YOLOv5 models on the HERIDAL dataset.
Table 6. Performance comparison of YOLOv5 models on the HERIDAL dataset.
Model | Precision | Recall | mAP | YOLOv5s | 0.753 | 0.694 | 0.731 | YOLOv5m | 0.797 | 0.812 | 0.810 | YOLOv5L | 0.900 | 0.893 | 0.864 |
|
Additionally, the HERIDAL dataset was also employed to compare YOLOv3, YOLOv4, YOLOv5L and faster R-CNN models. Table 7 shows the comparison results.

Table 7. Comparison of YOLOv3, YOLOv4, YOLOv5L and faster R-CNN models on the HERIDAL dataset.
Table 7. Comparison of YOLOv3, YOLOv4, YOLOv5L and faster R-CNN models on the HERIDAL dataset.
Model | Precision | Recall | mAP | YOLOv5L | 0.900 | 0.893 | 0.864 | YOLOv4 | 0.860 | 0.830 | 0.783 | YOLOv3 | 0.877 | 0.780 | 0.752 | Faster R-CNN | 0.770 | 0.890 | 0.861 |
|
In SAR missions, detector accuracy is essential to avoid wasting resources on erroneous detection. Examples of detection results on persons with the YOLOv5L model trained on the SAR dataset are displayed in Fig. 6. Fig. 6 shows all possible detection outcomes. A positive detection occurs when a person is located, when its bounding box and corresponding ground truth intersect to higher than or equal to 50%, whereas a negative detection is one in which a person has not been found and there is less than 50% intersection over the person’s ground truth and the bounding box. A FP detection occurs when a portion of the image that does not contain a human is interpreted as one. The proposed model detects all subjects in the images with the confidence level, as shown in Fig. 6. Images in the first column of Fig. 6 are the labelled ones, while images shown in the second column of Fig. 6 are the predictions. According to the results of the used test images, YOLOv5L was the most successful in detecting humans in SAR events. However, there have been instances where the YOLOv5L model has performed poorly, as shown in Fig. 7. The scene of shadows (shown in row 1) and when the detector perceives dark portions of the vegetation (shown in row 2) are the most common cases of erroneous detection. In SAR efforts, it is nearly expected for a person to fade into the surroundings.

Figure 6.Two samples of positive detections with high confidence.

Figure 7.False detections of YOLOv5L model (shadows, dark areas).
5 Conclusions and future directions
Analyzing a large number of high-resolution images for smaller objects and the details may be a time-consuming operation, and as a result, numerous mistakes are likely to occur. In this research, an SAR system was presented that combines UAV technology with real-time computer vision and deep learning algorithms. The technique utilized in this research was a YOLOv5L architecture running on the Nvidia K80/T4 embedded autonomous computing platform. mAP of the proposed model was 96.9%, which is a good performance for an on-board human detection system compared with well-known detectors such as faster R-CNN, YOLOv4, YOLOv3, and other YOLOv5 models such as YOLOv5s and YOLOv5m. Other metrics such as precision and recall were also examined. YOLOv5L had the best precision-to-recall ratio, with 97% precision and recall of better than 93%, which indicates that YOLOv5L was the most successful human detection method and had spotted the large number of persons in the image (ground truth). Precision is more critical than recall when fewer FP are required in favor of more FN, which indicates getting FP is costly, but getting FN is not.
Future work may involve the development of a model for identifying object movement (standing, sitting, walking, running, and lying down) and tracking individuals within SAR settings. The dataset size may be increased by adding images that replicate additional weather conditions, such as snow, fog, and ice, which may occur in genuine SAR operations. The blur images could be involved to represent camera movement and aerial photography in action. To further enhance the performance and precision of deep learning models, it is necessary to construct a training database containing images of various object classes for future research.
Disclosures
The authors declare no conflicts of interest.