Multi-UAV path planning for multiple emergency payloads delivery in natural disaster scenarios

Zarina Kutpanova; Mustafa Kadhim; Xu Zheng; Nurkhat Zhakiyev

doi:10.1016/j.jnlest.2025.100303

1 Introduction

Multiple unmanned aerial vehicles (UAVs) offer a promising solution [1] to deliver first aid and necessary supplies to areas affected by natural disasters, such as those affected by earthquakes, floods, landslides, forest fires, and pandemics [2,3]. These natural disasters cause critical damage to human life and infrastructure, significantly hindering rescue missions due to the difficulty of accessing the disaster zones [4]. This scenario is aggravated if the communication infrastructure is damaged. In such situations, multiple UAVs are deployed to provide rapid, flexible, and efficient delivery tasks where they can navigate through challenging environments, thereby ensuring continuous aid and coordination, as shown in Fig. 1. The autonomy and coordination of multiple UAVs improve the effectiveness of disaster response operations.

Figure 1.Illustration of UAVs’ payloads delivery in natural disaster environment.

Download full size

View all figures

Environmental factors, such as terrain obstacles, weather conditions, and city infrastructure, as well as UAV performance limitations like payload capacity, energy consumption, and flight range, have a significant influence on the multi-UAV path-planning problem in various mission scenarios [5,6]. To address these challenges, researchers have applied reinforcement learning (RL) techniques, which enable UAVs to interact with their environment, learn from their actions, and optimize rewards based on state transitions [7,8]. This interaction allows the model to understand environmental characteristics and develop strategies for UAV navigation. Reference [9] reports a Q-learning model for UAV navigation, with a focus on optimizing a proportional-integral-derivative (PID) controller to improve UAV navigation in a two-dimensional indoor environment. It illustrates how Q-learning could adapt the PID parameters to enhance the maneuverability of UAVs. In Ref. [9], the authors extended the model by incorporating a fixed sparse representation (FSR) and the honey badger algorithm (HBA) for function approximation, improving the convergence of the Q-learning model. More recently, optimization of RL methods for UAV missions has become a growing area of interest. Studies have increasingly used deep Q-network (DQN) within RL frameworks to handle complex state spaces, optimize UAV strategies, and enable decision-making based on past experiences. This has resulted in improved coordination and task execution among UAVs.

DQN is an RL approach grounded in artificial intelligence, utilizing Q-learning, a mechanism by which the agent seeks to optimize cumulative rewards by estimating Q-function values for every potential action in each state [1,10]. DQN-based approaches have been instrumental in achieving UAV navigation alongside radio mapping in three-dimensional (3D) space. These approaches rely on the Dueling DQN algorithm, which is modeled using a Markov decision process (MDP) framework [11], as demonstrated in Refs. [12–15]. Thus, a deep reinforcement learning (DRL) approach utilizing a partially observable Markov decision process (POMDP) has been proposed for UAV navigation in indoor environments [13,16]. To address the challenges of navigating urban areas with multiple UAVs, a benchmark DQN framework based on the decentralized partially observable Markov decision process (Dec-POMDP) has been developed, enabling efficient path planning in dense urban scenarios [17]. Notably, the approach proposed in Ref. [14] achieves both 3D navigation and radio signal mapping through Dueling DQN applied to an MDP model. Moreover, the DRL framework in Refs. [13] and [16] employs POMDP techniques for indoor UAV navigation. Additionally, Ref. [17] presents a benchmark solution utilizing Dec-POMDP for multi-UAV path planning in complex urban settings.

The SaCHBA-PDN model [15] tackled the aforementioned challenges by employing a Bernoulli shift map, piece-wise optimal decreasing neighborhood, and horizontal crossing with strategy adaptation to enhance path planning. Another approach, the CL-DMSPSO model [1], introduced an RL-based solution for more efficient path planning by integrating the comprehensive learning particle swarm optimization (CLPSO) algorithm [10] with the dynamic multi-swarm particle swarm optimization (DMSPSO) algorithm. Additionally, the AMTP model [12] applied an actor-critic multi-agent RL framework for multi-UAV operations, achieving superior performance over recent methods in several areas. Addressing the computation offloading problem, Liu et al. [13] proposed a multi-objective MDP model, combining Double DQN (DDQN) with Dueling DQN to enhance optimization efficiency. While these methods represent significant advancements, they often fall short in adequately considering the complexity of dynamic environments in real-world applications.

Although current DQN models for UAV planning have achieved outstanding outcomes, they still face limitations when it comes to dealing with disasters and competitive scenarios during the delivery mission cycle. This type of scenario is relatively widespread and crucial, as UAVs are increasingly being utilized for demanding missions such as conducting investigations in urban areas, mountains, and forests that have been affected by natural disasters. Emergency response in natural disaster scenarios requires accurate and rapid delivery actions, where the disaster areas would lack a proper UAV landing environment. Thus, traditional airdrop systems (parachute-based) might fail to deliver the payload to the delivery locations accurately [18] due to weather conditions or technical issues such as the release issue. The winch system would be the most optimal solution for accurate delivering; however, this system has not been considered in recent research [19] because of its high energy consumption, which causes UAVs to fail in return to the base. From another perspective, the hazard areas bring severe threats to the execution of payloads delivering tasks and the retention of UAVs. Such a scenario increases the complexity of DQN models, which affects reaching optimal path planning. Thereby UAVs might follow a greedy strategy where remote areas might not be considered.

The problem of multi-UAV cooperation in a natural disaster mission environment with natural influences on payload delivery, landing, and return issues is studied. The power capacity of UAVs, the complicated environment, and the potential threats from hazard areas need to be considered. This mission is executed once the natural disaster occurs, thereby, a group of UAVs will be released to deliver emergency payloads in the disaster region and try to land back in the starting area. The path planning for UAVs should be properly derived such that the coverage of delivery points is guaranteed while the UAVs ensure energy capacity, crashing, and no greedy strategies. Meanwhile, the proposed model can effectively cope with other challenges like no communication zone and no fly zone (NFZ). The reason for focusing on the risk of crashing into birds or unexpected building components is the potential for significant operational disruptions and safety hazards. For instance, Ref. [20] highlights that collisions between small UAVs and birds or building components can cause substantial damage, impacting not only the UAV’s mission but also posing risks to human safety. This risk is particularly critical in disaster scenarios where UAVs play a crucial role in search and rescue operations, infrastructure assessments, and delivering essential supplies. Ensuring the safety and reliability of UAV operations in such environments is therefore paramount.

According to the problem formulation, a novel DQN model for multi-UAV path planning for accurate payload delivery in a natural disaster scenario was proposed to address these issues. The problem is formulated by transforming a complicated environment into a grid world and various kinds of regions into different sets of cell grids marked by different labels. A DQN model that can derive the action plans for UAVs in a distributed manner, aiming to simplify the problem and lower the computation load, was proposed. A novel state space consisting of the physical space, the hazard space representing the crashing risk areas, and the return space representing the corresponding values of delivering aids to different points was designed. The typical mapping strategies were followed to generate the Q-network. The reward function was improved by inserting penalties when UAVs continuously land in urgent bases due to energy issues, deliver payloads to a few delivery points, use a greedy strategy, and pass hazard areas. The model facilitates UAVs in bypassing hazard incidents and delivering payloads to remote and nearby areas. Simulation results suggest that the proposed model can effectively handle difficulties within the mission of aid delivery, such as inaccessible areas and crashing threats, while significantly boosting the return success rate (RSR).

The contribution to this work is as follows:

• A unique system model is developed for multi-UAV path planning in natural disaster environments to deliver first aid supplies to multiple inaccessible locations. The formulated model incorporates the payload airdrop utilizing a winch system and considers the risk posed by the hazard, both of which are essential factors in designing optimal solutions.

• To consider this scenario, a DDQN-based model is designed to produce optimal plans, where novel state spaces and reward functions are studied to enhance delivery accuracy and avoid the greedy theory that ignores remote areas.

• A comprehensive set of numerical findings that demonstrate the superior performance of the obtained technique in natural disaster scenarios compared to baseline models is presented.

The paper is organized as follows. Section 2 provides a comprehensive explanation of the problem formulation. Section 3 illustrates the proposed model including the training process. Section 4 illustrates the outcomes of the simulation. Section 5 brings the whole work to its conclusion.

2 Problem definition

This section explains the problem formulation of multi-UAV path planning in natural disaster scenarios. Three novel factors were considered, previously underexplored in recent research: Winch-based payload systems, urgent landing areas, and hazard zones. These factors complement the already established influences on multi-UAV performance, such as environmental complexity, maximum payload capacity, energy consumption, and flight, which significantly impact navigation efficiency, mission duration, and overall multi-UAV effectiveness.

The winch-based payload represents the used payload sling system where the parachute-based airdrop might fail to delivery accurately. The urgent landing areas are designed to represent a real-world scenario of civil defense stations, which gives UAVs the ability to land into nearby civil defense stations if required. The bird gathering can be considered as the attack model which risks the UAV and the payloads. To be more specific about these information and system model settings and have a better understanding, the system environment, UAV model, delivery model, attack model, and optimization problem are defined, respectively.

2.1 System environment

In order to establish this study basis, an organized strategy is employed to describe the structures present in the environment and their respective properties.

2.1.1 Physical environment

The physical setting requires a digital grid world $\mathsf{\mathcal{M}} $ of dimension $ M\times M\in {{\mathbb{N}}^{2}} $ (${M} $ is the length of the grid world’s side; ${\mathbb{N}} $ is the set of natural numbers). This digital grid world contains particular sets that verify the UAV model in the face of obstacles or difficulties.

First, the allowed starting/landing area for UAVs is defined as $ \mathsf{\mathcal{L}} $ which consists of $ {L} $ positions, given by

$ \begin{array}{*{20}{l}} \mathsf{\mathcal{L}}=\left\{ {{\left[ x_{i}^{l}{\mathrm{,}}\;y_{i}^{l} \right]}^{\text{T}}}{\mathrm{,}}\;i=1{\mathrm{,}}\;2{\mathrm{,}}\; \cdots {\mathrm{,}}\;L\,{\mathrm{:}}\; {{\left[ x_{i}^{l}{\mathrm{,}}\;y_{i}^{l} \right]}^{\text{T}}}\in \mathsf{\mathcal{M}} \right\}\; \end{array} $ (1)

where in the grid world $ \mathcal{M} $, $ x $ and $ y $ stand for the abscissa and the ordinate, respectively; $ \text{T} $ denotes the matrix transformation. The abscissa refers to the horizontal coordinate of a point on a graph, while the ordinate refers to the vertical coordinate of the same point. Additionally, this set has revised by identifying optimal and urgent landing zones through the definition of two categories of landings, in which $ L_{\mathrm{urgent}} $ denotes the urgent landing position that may be nearer than the optimal one and $ L_{\mathrm{optimal}} $ denotes the desired landing position to which the operator intends for the UAV to return. This update has been proposed to generate much closer real-world scenarios. In the meantime, the tall buildings and NFZ, which are inaccessible to UAVs, are represented by the set $ \mathcal{Z} $ formed by Z positions, defined as

$ \begin{array}{*{20}{l}} \mathcal{Z}=\left\{ {{\left[ x_{i}^{z}{\mathrm{,}}\; y_{i}^{z} \right]}^{\text{T}}}{\mathrm{,}}\; i=1{\mathrm{,}}\;2{\mathrm{,}}\; \cdots {\mathrm{,}}\;Z\,{\mathrm{:}}\; {{\left[ x_{i}^{z}{\mathrm{,}}\;y_{i}^{z} \right]}^{\text{T}}}\in \mathcal{M} \right\} \end{array} $ (2)

The set $ \mathcal{B} $ is defined to consider short buildings and obstacle areas that disrupt wireless connections [17] and short buildings that UAVs are permitted to fly over. This set allows UAVs to operate even in the positions having no connections. It consists of B positions and can be formulated as

$ \begin{array}{*{20}{l}} \mathcal{B}=\left\{ {{\left[ x_{i}^{b}{\mathrm{,}}\; y_{i}^{b} \right]}^{\text{T}}}{\mathrm{,}}\;i=1{\mathrm{,}}\;2{\mathrm{,}}\; \cdots {\mathrm{,}}\; B\,{\mathrm{:}}\;{ {\left[ x_{i}^{b}{\mathrm{,}}\;y_{i}^{b} \right]}^{\text{T}} }\in \mathcal{M} \right\} \end{array} $ (3)

Equation (4) gives us the definition of the set of hazard positions $ \mathcal{G} $, which contains G positions where UAVs might crash birds or unexpected objects.

$ \begin{array}{*{20}{l}} \mathcal{G} =\left\{ {{\left[ x_{i}^{g}{\mathrm{,}}\; y_{i}^{g} \right]}^{\text{T}}}{\mathrm{,}}\; i=1{\mathrm{,}}\;2{\mathrm{,}}\; \cdots{\mathrm{,}}\; G\,{\mathrm{:}}\; {{\left[ x_{i}^{g}{\mathrm{,}}\;y_{i}^{g} \right]}^{\text{T}}}\in \mathcal{M} \right\} \end{array} $ (4)

The set $ \mathcal{G} $ could refer to any hazard risk, such as birds, unexpected objects, hostile UAVs, or other kinds of risk that would damage or delay the mission. A single UAV is able to detect these positions of hazards by using sensors, such as Song Meter SM4 acoustic recorder [21] and RIEGL miniVUX-1-UAV [22]. Once the hazard is spotted, the location is instantly updated into $ \mathcal{G} $ and reported to the central server and transmitted with all UAVs.

2.1.2 UAVs

The set $ \mathcal{I} $ consists of all the employed UAVs, and the total count of employed UAVs flying within the bounds of the grid world $ \mathcal{M} $ is denoted by $ I $. The state of the $ i $-th UAV is described by the following parameters:

• The position $ {{\boldsymbol{p}}_{i}} \left(t \right)={{\left[{{x}_{i}} \left(t \right){\mathrm{,}}\;{{y}_{i}} \left(t \right){\mathrm{,}}\;{{z}_{i}} \left(t \right) \right]}^{\text{T}}}\in {{\mathbb{R}}^{3}} $ where the parameter $ {{z}_{i}} \left(t \right) $ refers to the altitude of the UAV, which ranges from 0 to $ h $ depending on the UAV ability.

• The operational status $ {{\phi }_{i}} \left(t \right)\in \left\{ 0{\rm{,}}\;1{\rm{,}}\;2 \right\} $, representing inactive, active, and overactive, respectively.

• The battery energy level $ {{b}_{i}} \left(t \right)\in \mathbb{N} $.

The precise description of each UAV’s action space (movement) $ \mathcal{A} $ is as follows:

$ \begin{array}{*{20}{l}} \mathcal{A}=\left\{ \begin{matrix} \underbrace{\left[ \begin{matrix} 0 \\ 0 \\ 0 \\ \end{matrix} \right]}_{\text{Hover}}{\mathrm{,}}\;&\underbrace{\left[ \begin{matrix} c \\ 0 \\ 0 \\ \end{matrix} \right]}_{\begin{smallmatrix} \text{East} \\ \text{move} \end{smallmatrix}}{\mathrm{,}}\;&\underbrace{\left[ \begin{matrix} 0 \\ c \\ 0 \\ \end{matrix} \right]}_{\begin{smallmatrix} \text{North} \\ \text{move} \end{smallmatrix}}{\mathrm{,}}\; & \underbrace{\left[ \begin{matrix} -c \\ 0 \\ 0 \\ \end{matrix} \right]}_{\begin{smallmatrix} \text{West} \\ \text{move} \end{smallmatrix}}{\mathrm{,}}\;&\underbrace{\left[ \begin{matrix} 0 \\ -c \\ 0 \\ \end{matrix} \right]}_{\begin{smallmatrix} \text{South} \\ \text{move} \end{smallmatrix}}{\mathrm{,}}\; &\underbrace{\left[ \begin{matrix} 0 \\ 0 \\ -h \\ \end{matrix} \right]}_{\text{Land}} \\ \end{matrix} \right\}\; \end{array} $ (5)

The set consists of vectors for hovering, moving east, moving north, moving west, moving south, and landing. Within the set $ \mathcal{A} $, each step is denoted by the unit distance $ c $. Every single step taken requires the same period of time, denoted as $ t=1 $. This movement is to automate UAVs on choosing the right wing rotation [23–26]. In the landing mode, UAVs will consider to stand still and land by changing the coordinators to $ [0{\mathrm{,}}\;0{\mathrm{,}}\;-h]^{\rm{T}} $, where –h means the UAV changes its flying height from h to 0. However, (6) is employed to prohibit UAVs from landing outside the designated area $ \mathsf{\mathcal{L}} $. The movement actions of UAVs, denoted as $ {{\boldsymbol{a}}_{i}} \left(t \right) $, are restricted to the set $ \tilde{\mathcal{A}} \left({{\boldsymbol{p}}_{i}} \left(t \right) \right) $. This limitation is intended to ensure that the UAVs only land within the designated landing area.

$ \begin{array}{*{20}{l}} \tilde{\mathcal{A}} \left( {{\boldsymbol{p}}_{i}} \left( t \right) \right)=\left\{ \begin{array}{*{35}{l}} \mathcal{A}{\mathrm{,}}\; & {{\boldsymbol{p}}_{i}} \left( t \right)\in \mathcal{L} \\ \mathcal{A}\backslash {{\left[ 0{\mathrm{,}}\;0{\mathrm{,}}\;-h \right]}^{\text{T}}}{\mathrm{,}}\; & \text{otherwise} \\ \end{array} \right. \end{array} $ (6)

After every action taken by the UAV, the position needs to be updated accordingly and given as

$ \begin{array}{*{20}{l}} {{\boldsymbol{p}}_{i}} \left( t+1 \right)=\left\{ \begin{array}{*{35}{l}} {{\boldsymbol{p}}_{i}} \left( t \right)\text{,}\; & {{\phi }_{i}} \left( t \right)=0 \\ {{\boldsymbol{p}}_{i}} \left( t \right)+{{\boldsymbol{a}}_{i}} \left( t \right){\mathrm{,}}\; & \text{otherwise} \\ \end{array} \right. \end{array} $ (7)

The hazard locations have been considered by modifying (8) where traditional operation changes from 1 to 0 in the landing case. Status 3 added to (8) represents that the UAV reaches the delivering area and needs to stand still and run the winch system to deliver the payloads. Status 2 added to (8) represents that the UAV is passing the hazard area and needs to take slow the speed down and continue to follow the same gene rated path. The slowing down status is represented as $ \mathcal{S} $ which is a UAV internal operation.

$ \begin{array}{*{20}{l}} {{\phi }_{i}}(t+1)=\left\{ \begin{array}{*{35}{l}} 0{\mathrm{,}}\; & {{\boldsymbol{a}}_{i}}(t)={{\left[ 0{\mathrm{,}}\;0{\mathrm{,}}\;-h \right]}^{\text{T}}} \\ {} & \vee {{\boldsymbol{a}}_{i}}(t)\in \mathcal{G}\vee {{\phi }_{i}}(t)=0 \\ 2{\mathrm{,}}\; & \mathcal{S} \\ 3{\mathrm{,}}\; & {{\boldsymbol{a}}_{i}}(t) = \left[ 0{\mathrm{,}}\;0{\mathrm{,}}\;0 \right]^{\text{T}} \\ 1{\mathrm{,}}\; & \text{otherwise} \\ \end{array} \right. \end{array} $ (8)

Each of sensor’s usage or the movement and action taken by UAV means consuming a certain amount of battery. So, the total battery capacity $ E $ of each UAV must be determined carefully and precisely to ensure a successful return to the optimal landing point. As stated in Ref. [17], the power consumption of hovering and moving can be regarded as equivalent. According to the model offered by Ref. [27], this consumption of UAVs is much higher at low operating speeds compared to that at high speeds. For this reason, the power consumption of slowing down (status 2) and delivering (status 3) should be higher than that of moving. In a standard scenario, the UAV will consume a single unit of electricity, i.e. $ e $, for each movement. However, in slowing down and delivering modes, the unit of consumed electricity will be adjusted to $ \rho e $, with $ \rho $ expected to at least 2 in the slowing down mode, and at least 2 in the delivering mode.

$ \begin{array}{*{20}{l}} {{b}_{i}} \left( t+1 \right) = \left\{ \begin{array}{*{35}{l}} {{{b}_{i}} \left( t \right) - e}{\mathrm{,}}\; & {{\phi }_{i}} \left( t \right)=0 \\ {{{b}_{i}} \left( t \right) - \rho e}{\mathrm{,}}\; & {{\phi}_{i}} \left( t \right) = 2 \text{ or } 3\\ {{{b}_{i}} \left( t \right)}{\mathrm{,}}\; & \text{otherwise} \\ \end{array} \right. \end{array} $ (9)

where $ {{b}_{i}} \left(t \right) $ is the current battery capacity in the $ t $-th time slot.

In this case, the greedy strategy involves each UAV making a locally optimal decision based on its immediate energy constraints, prioritizing the closest feasible landing point instead of the globally optimal one. Due to high power consumption, UAVs may not have enough energy to return to their designated landing zones. Therefore, they choose the nearest landing site to ensure a safe touchdown, even if it compromises the overall mission efficiency. To address these issues, the concept of optimal and urgent landing areas has been used to let the UAVs decide to land based on their battery capacity. Also, statuses 1–3 are going to play an important role to punish the UAV decision if it is still able to return to the optimal landing area or to take another mission.

2.2 Design object

To optimize the efficiency of UAV missions, it is crucial to carefully design the delivery points and mission cost, taking into account the specific requirements of real missions. Additionally, it is important to examine the extent of delivery point coverage. Therefore, the optimization aim can be divided into the following expectations:

• The number of payloads delivered safely and accurately should be as accurate as possible.

• The delivery coverage should be as high as possible, which means that the payloads are supposed to be delivered as many as feasible.

• The objective is to minimize the loss, thereby ensuring the safe return of UAVs to the designated landing spots.

Then, the optimization issue can be evaluated by computing the aforementioned three expectations as follows:

$ \underset{{{\boldsymbol{a}}_{i}}\left( t \right)}{{{\mathrm{max}} }}\;\delta \sum\limits_{t=0}^{T}{\sum\limits_{i=1}^{I}{{{C}_{i}} \left( t \right)+\zeta \sum\limits_{i=1}^{I}{{{R}_{i}}+\gamma \sum\limits_{i=1}^{I}{{{Q}_{i}}}}}}\; $ (10)

where each item corresponds to one of the expectations and the weight of each expectation is determined by three constants, namely $ \delta $, $ \zeta $, and $ \gamma $. $ \sum\limits_{t=0}^{T}{\sum\limits_{i=1}^{I}{{{C}_{i}} \left(t \right)}} $ refers to the total amount of the payloads delivered by $ I $ UAVs in $ T $ time slots. $ \sum\limits_{i=1}^{I}{{{R}_{i}}} $ means the number of the delivery points which the payloads are delivered to. $ \sum\limits_{i=1}^{I}{{{Q}_{i}}} $ indicates the number of UAVs which successfully return.

3 Model

In this section, the proposed model was detailed. First, Fig. 2 depicts the DQN approach, in which each UAV behaves as an independent agent inside a specific map. The model integrates the state space, Q-network, reward, and action functions, which means that an agent makes a decision based on the current situation using the action function and this decision is then assessed and rewarded. Simultaneously, the executed action will also result in a corresponding modification in the state, hence potentially impacting the follow-up action. In order to comprehend the mechanism and operational process of the suggested model, its components are presented: Dual-view state conversion, state space, Q-network construction, reward function, and DDQN training process.

Figure 2.Illustration of RL for UAVs. In this illustration, π(s) indicates the action function π(·) with the current state s as input; action i-j, reward i-j, and state i-j refer to the movement, the received rewards, and the new state for UAV i in j-th round, respectively.

Download full size

View all figures

3.1 Dual-view state conversion

Concerning diverse UAVs, it is essential for them to focus on the contextual information in their nearby areas, such as the surroundings, delivery locations, and nearby UAVs. Hence, it is imperative to create a state representation for each UAV including the one located at the center of the map. However, since various UAVs may be deployed in different areas of the region, it may not be realistic to have an equal-sized nearby area, which is a need for DQN. The standard strategy was followed to expand the map for every UAV [17]. The centered and non-centered maps for a UAV is illustrated in Fig. 3.

Figure 3.Scenes of input maps that are in two ways: (a) centered or (b) non-centered, where the location of UAV is shown by the star symbol and the point where the dashed lines connect.

Download full size

View all figures

3.2 State space

The state space in a standard DQN model represents the complete set of possible scenarios within the environment. The grid world serves as the foundation for creating the state space, with each grid being assigned a number that represents a specific feature of the state within that grid. The control station will globally manage the statuses of all UAVs, including planning the flight path of each UAV. This allows UAVs to collaborate with each other to successfully complete the task and achieve the desired objectives [28].

In this study, the state space of the scenario is defined as (11), consisting of three dimensions:

$ \begin{array}{*{20}{l}} {Ω} ={{{Ω}}_{\text{phy}}}\times {{Ω} }_{\text{hazard}}\times {{{Ω} }_{\text{return}}} \end{array} $ (11)

where $ {{{Ω} }_{\text{phy}}} $ represents the physical map that can describe the $ \mathcal{L} $, $ \mathcal{B} $, and $ \mathcal{Z} $ areas; $ {{{Ω} }_{\text{hazard}}} $ is the hazard map, including the birds gathering areas, i.e. $ \mathcal{G} $; $ {{{Ω} }_{\text{return}}} $ represents the return rate map, where both urgent and optimal landing areas are included.

The return rate is a state representation method developed for this work. Its objective is to offer an accurate assessment of the remaining operational battery life of the drone in regard to the value of delivering locations (the amount of delivered points, remote area coverage, and the distance of action). An illustration of the return rate estimation is shown in Fig. 4.

Figure 4.Illustration of the reward rate estimation calculation when UAV moves to a specific grid (like the green grid). The estimation considers the steps moving to the grid, the distance among the grid, each delivery point (DP, orange grid), the optimal landing area (blue grid), the urgent landing area (gray grid), as well as the remaining delivery points. In the left figure, b(i) indicates the battery level of UAV_i and indicates the size of payloads to be delivered at this grid. In the right one, and are the estimated return rates when UAV_i flies to each of these grids.

Download full size

View all figures

3.3 Q-network construction

Using the provided specification of the state space and the map processing, the Q-network determines the best actions for UAVs based on their prior states. Fig. 5 shows the construction of DQN.

Figure 5.Architecture of DQN in the proposed model. The convolutional Q-network will encode various information that exists in the region. Here f_center, f_local, and f_global indicate the functions deriving central, local, and global features, respectively; π(a|s) indicates the probability of each action a given current state s; π(s) refers to strategy functions of action.

Download full size

View all figures

The structure of the network starts from combining all state maps into a single hyper map, where each grid is labeled with a unique number representing a specific characteristic of the state. Afterwards, the environment map is combined with the physical map, hazard map, flight duration map, operational status map, and return map. These maps collectively handle the present position using the centering function. Second, the outcome of the initial phase would be transferred to the global and local mapping functions. The experiment includes two maps with different shapes. The convolution of the layers in the global mapping function can be adjusted accordingly. For one map, the sizes are $ 21 \times 21 \times 6 $ and $ 17 \times 17 \times 16 $, while for another map, the sizes are $ 19 \times 19 \times 6 $ and $ 15 \times 15 \times 16 $. In local mapping, the dimensions are consistently $ 17 \times 17 \times 6 $ and $ 13 \times 13 \times 16 $. Finally, the findings gained from the second phase will be transformed into a one-dimensional matrix with a length of either 4001 or 3233. Subsequently, the data would be inputted into the hidden layers, comprising three levels of equal size, each consisting of 256 units. Finally, it would be directed to the output layer, which has a size of 6. The neural network may determine the optimal action for the current and next states by calculating the maximum value and applying a normalized exponential function, respectively.

3.4 Reward function

The reward function ${{r}_{i}}(t) $ is an abstraction that transfers the optimization goal to the achievable goal, which assists the model to judge the performance of an action. Its process is given as

$ \begin{array}{*{20}{l}} {{r}_{i}}(t)=\alpha w(i)+{{\beta }_{i}}(t)+{{\varphi }_{i}}(t) + {{\eta }_{i}}(t) + {{\xi }_{i}}(t) +\varepsilon \end{array} $ (12)

where $ \alpha $ denotes the hyper-parameter and its value in each UAV should be the same at a particular time; $ {{\beta }_{i}}(t) $ represents the parameter to prevent UAVs from taking unexpected movement, including entering the $ \mathcal{Z} $ area, crashing into each other, or landing outside the $ \mathcal{L} $ area; the punishment $ {{\varphi }_{i}}(t) $ is defined as that UAVs passing over hazard/flocks of birds are punished; $ \varepsilon $ refers to the constant punishment for UAVs’ movement, which aims to force UAVs to reduce the flying time and seek the most efficient path to finish missions; $ {{\eta }_{i}}(t) $ denotes the punishment when UAV does not land in time; $ {{\xi }_{i}}(t) $ is the punishment for landing in the urgent space where $ |{{\beta }_{i}}(t)| > {{\xi }_{i}}(t) $; $ w(i) $ is the summation of all grid return rates for a particular UAV, defined as

$ \begin{array}{*{20}{l}} w(i) = {\text{e}}^{-b(i)} \times \dfrac{\Delta k}{X_1 + X_2} \end{array} $ (13)

$ \begin{array}{*{20}{l}} X_1 = \displaystyle\sum\limits_{\text {urgent}} {\rm{dis}}_{\text {back }}\left({\rm{DP}}(k){\mathrm{,}}\; L_{\text{urgent}}\right) \end{array} $ (14)

$ \begin{array}{*{20}{l}} X_2 = {\rm{dis}}_{\text {back }}\left({\rm{DP}}(k){\mathrm{,}}\; L_{\text{optimal}}\right) \end{array} $ (15)

where ${\rm{DP}}(k) $ is the $k $-th delivery point, $ b(i) $ is the remaining battery power of UAV $ i $, $ {\rm{dis}}_{\text {back}}(\cdot) $ denotes the shortest distance for the UAV to return to the landing position from the current position through the $ k $-th delivery point, $\left({\rm{DP}}(k){\mathrm{,}}\; L_{\text{urgent}}\right) $ denotes the distance between the $k $-th delivery point and $L_{\text{urgent}} $ while $\left({\rm{DP}}(k){\mathrm{,}}\; L_{\text{optimal}}\right) $ is that and $L_{\text{optimal}} $, $ \Delta k $ is the number of packages to be delivered to the delivery points, X₁ is the total movement cost for the current UAV to reach all urgent landing points after passing through the $ k $-th delivery point, and $X_2 $ is the cost to return to the original landing point. Equation (13) indicates the return rate on the UAV’s delivery for the $ k $-th delivery point.

Obviously, the developed reward function has considered various realistic environmental factors to assist Q-networks in making effective decisions.

3.5 DDQN training process

The DDQN algorithm was employed, and a multi-round training approach for the path planning model was implemented. This approach aims to optimize the expected cumulative rewards and enhance the performance of the training model. The loss function in DDQN is defined as

$ \begin{array}{*{20}{l}} {{L}^{\rm{DDQN}}} \left( \theta \right)={{\mathbb{E}}_{s{\mathrm{,}}{\boldsymbol a}{\mathrm{,}}{s}'\sim \mathcal{D}}}\left[ {{\left( {{Q}_{\theta }} \left( s{\mathrm{,}}\;{\boldsymbol a} \right)-Y\left( s{\mathrm{,}}\;{\boldsymbol a}{\mathrm{,}}\;{s}' \right) \right)}^{2}} \right] \end{array} $ (16)

where $\theta $ represents the parameter set of DDQN, $s $ refers to the current state, $ \boldsymbol{a} $ refers to the current action, $s' $ refers to the next state (which corresponds to $ s_t $, $ \boldsymbol{a}_t $, and $ s_{t+1} $ in the notation), $ {\mathcal{D}} $ represents the replay memory, ${\mathbb{E}} $ indicates the expectation, and the target value is given as

$ \begin{array}{*{20}{l}} Y\left( s{\mathrm{,}}\;{\boldsymbol a}{\mathrm{,}}\;{s}' \right)=r\left( s{\mathrm{,}}\;{\boldsymbol a} \right)+\gamma {{Q}_{{\bar{\theta }}}} \left( {s}'{\mathrm{,}}\;\underset{{{\boldsymbol {a}}'}}{ {{\mathrm{arg}}\, {\mathrm{max}} }}\,{{Q}_{\theta }} \left( {s}'{\mathrm{,}}\;{\boldsymbol {a}}' \right) \right) \end{array} $ (17)

Besides, the Q function is defined as

$ \begin{array}{*{20}{l}} {{Q}^{\pi }} \left( s{\mathrm{,}}\; {\boldsymbol a} \right)={{\mathbb{E}}_{\pi }}\left[ {{G}_{t}}\left| {{s}_{t}}=s{\mathrm{,}}\; {{\boldsymbol{a}}_{t}}={\boldsymbol a} \right. \right]\; \end{array} $ (18)

with $ {{G}_{t}} $ indicating the discounted cumulative return, defined as

$ {{G}_{t}}=\sum\limits_{k=t}^{T}{{{\gamma }_{k-t}}r\left( {{s}_{k}}{\mathrm{,}}\;{{\boldsymbol{a}}_{k}} \right)}\; $ (19)

Here, $ \gamma $ represents a constant that determines the value of long-term and short-term yields; $ r\left(s {\mathrm{,}}\; {\boldsymbol a} \right) $ denotes the expected reward when performing action $ {\boldsymbol a} $ given the state $ s $.

Regarding the selection of action for each UAV in every stage, the strategies $ \pi (s) $ produced from a normalized exponential function and $ \pi \left({{\boldsymbol{a}}_{i}}|s \right) $ have been utilized, which are provided in (20) and (21), respectively.

$ \begin{array}{*{20}{l}} \pi \left( s \right)=\underset{{\boldsymbol a}\in \mathcal{A}}{ {{\mathrm{arg}}\, {\mathrm{max}} }}\,{{Q}_{\theta }} \left( s{\mathrm{,}}\;{\boldsymbol a} \right) \end{array} $ (20)

$ \pi \left( {{\boldsymbol{a}}_{i}}|s \right)=\frac{{{\text{e}}^{{{Q}_{\theta }}\left( s{\mathrm{,}}\;{{\boldsymbol{a}}_{i}} \right)/\beta }}}{\displaystyle\sum_{\forall {{\boldsymbol{a}}_{j}}\in \mathcal{A}}{{\text{e}}^{{{Q}_{\theta }}\left( s{\mathrm{,}}\;{{\boldsymbol{a}}_{j}} \right)/\beta }}} $ (21)

The pseudo code which adopts the whole model into the emergency payload delivery is shown in Algorithm 1.

Table 1.

View table

View all Tables

Table 1.

Algorithm 1 Multi-UAV path planning for multiple emergency payloads delivery in natural disaster environments

1: Input: Grid world $ \mathcal{M} $, delivery points $ K $, utilized UAVs $ \mathcal{I} $, a central server $ {\text {Base}} $, and total time period $ T $

2: Output: Action series $ {\text {Action}} $ for all UAVs

3: for each time slot $ t $ in $ T $do

4: for each UAV $ U_i $ in $ \mathcal{I} $do

5: $ U_i $ requests $ k $ consecutive actions from $ \text {Base} $

6: for$ j $ from $ 1 $ to $ k $do

7: Run the proposed model

8: $ {\text{Action}}_i \leftarrow \underset{{\boldsymbol a}\in \mathcal{A}}{ {{\mathrm{arg}}\, {\mathrm{max}} }}\,{{Q}_{\theta }} \left(s {\mathrm{,}}\; {\boldsymbol a} \right) $

9: end for

10: $ \text {Base} $ returns $ {\text{Action}}_i $ to $ U_i $

11: end for

12: All UAVs execute $ k $ actions

13: $ t = t +k $

14: end for

4 Simulation

This section focuses on offering solutions for the task of multi-UAVs path planning in a natural disaster scenario, specifically for delivering payloads. To replicate real-life conditions, two maps, namely Manhattan32 and Urban50, are employed as the task environments for UAVs. Both maps depict two actual cities characterized by the vacant land and numerous buildings with varying heights, shapes, and floor areas. These features pose a challenge for the proposed UAV model, as it must adjust to the intricacies of the real environment. Additionally, both maps are configured with at least one NFZ area.

4.1 Experimental setup

4.1.1 Comparison method

Current studies yet are not enough to the assumed scenario in this study, because the scenario’s settings are difficult to be adjusted due to the complex procedure. Thus, we select Ref. [17] as the baseline for comparison. Simultaneously, the efficacy of model is evaluated by using the following metrics:

• Return success rate (RSR): The number of drones successfully landed as a percentage of the overall number of UAVs. Thus, RSR contains both optimal and urgent landing, which means the drone must return to the optimal landing area, but if any circumstances appear against the drones, then they have the choice to land in the urgent landing areas.

• Delivery ratio: The ratio of delivered payloads to overall delivery points.

• Delivery ratio and landed: The product of RSR and delivery ratio, which is used to measure a sum of delivery ratio and drone return.

4.1.2 System setup

1) Manhattan32

The Manhattan32 scenario is based on a Manhattan-like city structure containing mostly regularly distributed city blocks with streets in between, as well as two NFZ districts and an open space in the upper left corner, divided into $ {{M}^{2}}=32\times 32 $ cells in each grid direction. In this map, each scenario parameter is defined as a value randomly selected within the corresponding range. They include the number of deployed UAVs $ I\in \left[1{\mathrm{,}}\;8 \right] $; the number of delivery points $ K\in \left[1 {\mathrm{,}}\; 10 \right] $; the payloads (medical kits) to be delivered for each delivery point $ {{D}_{k{\mathrm{,init}}}}\in \left[10 {\mathrm{,}}\; 25 \right] $; the maximum battery capacity $ {{b}_{0}}\in \left[100 {\mathrm{,}}\; 500 \right] $; the battery consumption for moving $ e=1 $, for slowing down $ \rho e $ with $ \rho =2 $, and for delivering $ \rho e $ with $ \rho =3 $.

The delivery points and hazards’ positions are randomly distributed across the empty regions of the map, specifically the grids marked with 0.

2) Urban50

The Urban50 scenario is characterized by an urban layout in which the variations in the shape and size of buildings are more pronounced compared to the Manhattan32 scenario. The map has dimensions of $ {{M}^{2}}=50\times 50 $ and includes an NFZ area, a central building, and the surrounding $ \mathsf{\mathcal{L}} $ area. The parameters of Urban50 can be characterized in a manner similar to the settings for Manhattan32. They are the number of deployed UAVs $ I\in \left[1{\mathrm{,}}\; 8 \right] $; the number of delivery points $ K\in \left[1 {\mathrm{,}}\; 10 \right] $; the payloads (medical kits) to be delivered for each delivery point $ {{D}_{k{\mathrm{,init}}}}\in \left[5 {\mathrm{,}}\; 25 \right] $; the maximum battery capacity $ {{b}_{0}}\in \left[150 {\mathrm{,}}\; 550 \right] $; the battery consumption for moving $ e=1 $, for slowing down $ \rho e $ with $ \rho =2 $, and for delivering $ \rho e $ with $ \rho =3 $.

4.2 Outcome analysis and comparison

The experimental procedures for both Manhattan32 and Urban50 were replicated 20 times. Both models in the two maps employed Monte Carlo simulations to evaluate their performance using a comprehensive set of scenario parameters and overall average performance metrics. The results are displayed in Tables 1 and 2.

Table 1.

View table

View all Tables

Table 1.

Algorithm 1 Multi-UAV path planning for multiple emergency payloads delivery in natural disaster environments

1: Input: Grid world $ \mathcal{M} $, delivery points $ K $, utilized UAVs $ \mathcal{I} $, a central server $ {\text {Base}} $, and total time period $ T $

2: Output: Action series $ {\text {Action}} $ for all UAVs

3: for each time slot $ t $ in $ T $do

4: for each UAV $ U_i $ in $ \mathcal{I} $do

5: $ U_i $ requests $ k $ consecutive actions from $ \text {Base} $

6: for$ j $ from $ 1 $ to $ k $do

7: Run the proposed model

8: $ {\text{Action}}_i \leftarrow \underset{{\boldsymbol a}\in \mathcal{A}}{ {{\mathrm{arg}}\, {\mathrm{max}} }}\,{{Q}_{\theta }} \left(s {\mathrm{,}}\; {\boldsymbol a} \right) $

9: end for

10: $ \text {Base} $ returns $ {\text{Action}}_i $ to $ U_i $

11: end for

12: All UAVs execute $ k $ actions

13: $ t = t +k $

14: end for

Table 2. Simulated results for Urban50.

View table
View all Tables
Table 2. Simulated results for Urban50.

Urban50 Baseline Our proposed method
RSR (optimal) 89.52% 62.42%
RSR (urgent landing) / 34.21%
RSR (total) 89.52% 96.63%
Delivery ratio 82.36% 87.01%
Delivery ratio and landed 73.72% 84.07%

For the Manhattan32 map as shown in Table 1, the proposed method demonstrates varied performance when compared to the baseline across several metrics. For RSR under optimal conditions, the baseline excels with a 91.52% success rate, significantly higher than the proposed method’s 51.69%. However, the proposed method introduces an urgent landing success rate of 47.69%, which has not been considered by the baseline. This inclusion results in a total RSR value of 99.38% for the proposed method, outperforming the baseline’s 91.52%. Additionally, the proposed method achieves a higher delivery ratio of 88.40% compared to the baseline’s 84.32%, indicating more efficient delivery performance. More importantly, when considering both delivery and successful return, the proposed method shows superior performance with an 87.86% success rate, compared to the baseline’s 77.17%. These results highlight the proposed method’s robustness and ability to handle diverse conditions effectively.

For the Urban50 map as shown in Table 2, the baseline again leads in optimal landing success with an 89.52% RSR value, while the proposed method achieves 62.42%. Despite this, the proposed method introduces a 34.21% success rate for urgent landing, which the baseline has not addressed. This capability contributes to a higher total RSR value of 96.63% for the proposed method, compared to the baseline’s 89.52%. Furthermore, the proposed method improves the delivery ratio, achieving 87.01% versus the baseline’s 82.36%, reflecting better delivery efficiency. When combining delivery success with safe return, the proposed method outperforms the baseline with an 84.07% success rate compared to 73.72%. Overall, these findings suggest that while the baseline performs better in optimal conditions, the proposed method offers a more versatile and reliable solution, particularly in handling urgent situations and ensuring successful delivery and return in complex urban environments.

Figs. 6 and 7 display the trajectories of UAVs when two and three UAVs are deployed on the various maps, which indicate that our approach successfully navigates through hazards’ zones in all circumstances.

Figure 6.Example trajectories for the Manhattan32 map: Baseline with (a) 2 and (b) 3 agents; our proposed method with (c) 2 and (d) 3 agents.

Download full size

View all figures

Figure 7.Example trajectories for the Urban50 map: Baseline with (a) 2 and (b) 3 agents; our proposed method with (c) 2 and (d) 3 agents.

Download full size

View all figures

For example, as shown in Figs. 6(a) and (b), when UAV plans to return to the base, UAV will not choose to fall and return to the base where if it chooses to land into the urgent landing area. Obviously, the proposed method can effectively finish the delivery tasks and automatically choose the best landing area. Simultaneously, the proposed method has the capability to thoroughly take into account the delivery points located throughout the entire map. As depicted in Figs. 7(c) and (d), the action trajectories of the three UAVs in the proposed method cover all the regions on the map where the delivery points are located. In contrast, the baseline method successfully delivered all the points but failed to consider the return criteria. In Fig. 6(b), the baseline takes a strange action of going to deliver payloads to the same points that have already handled by other UAVs (blue and black).

To provide a clearer representation of the delivery points’ coverage, the proportion of successful IoT devices delivered with different numbers of UAVs is presented in Fig. 8. The results show that the proposed model generally has a higher delivery points coverage capability than the baseline.

Figure 8.Delivery points coverage situations: (a) Manhattan32 and (b) Urban50.

Download full size

View all figures

4.3 Influence of scenario parameters

In this subsection, the operation of the model is analyzed across various parameters to confirm its stability. The outcomes are given as follows:

• The number of UAVs deployed: Fig. 9 illustrates how the number of UAVs sent out affects the final delivery ratio, while keeping the parameter settings constant. The result of our proposed method is rapidly increased in the initial phase, resulting in a higher delivery ratio with a reduced number of UAVs. For the Manhattan32 map as depicted in Fig. 9(a), the deployment of drones reaches a stable state after 4, while for the Urban50 map as illustrated in Fig. 9(b), the deployment of drones tends to stabilize after 5.

Figure 9.Impact of the number of UAVs deployed: (a) Manhattan32 and (b) Urban50.

Download full size

View all figures

• The number of delivery points: Fig. 10 illustrates the impact of the number of delivery points on the delivery ratio. The proposed method can achieve a collection ratio of 100% when there are a few delivery points. Still, the rate of collection reduces as the number of delivery points increases. However, as depicted in Figs. 10(a) and (b), the proposed method exhibits a slower and more consistent decrease in the metric, indicating a high level of stability in the model.

Figure 10.Impact of the number of delivery points: (a) Manhattan32 and (b) Urban50.

Download full size

View all figures

• Maximum battery capacity: Fig. 11 illustrates the influence of the maximum power capacity of the UAV battery. The enhanced battery capacity enables UAV to operate for longer periods, thereby enabling the exploration of a wider area. Consequently, as the battery charge reaches its maximum level, the delivery ratio experiences an increase. Our proposed method can achieve superior performance while requiring a lower maximum charge. As illustrated in Fig. 11(a), the rate at which our proposed method delivers starts to stabilize after reaching a maximum charge of 100.

Figure 11.Influence of maximum battery capacity: (a) Manhattan32 and (b) Urban50.

Download full size

View all figures

The above results all indicate the resilience of the proposed model’s performance.

Notably, our proposed model has two key limitations which still need to be addressed for further improvement: The impact of weather conditions, particularly crosswinds, on energy consumption, and the handling of varying payload sizes and weights. Crosswinds can significantly affect the UAV’s energy efficiency depending on its travel direction. Strong crosswinds may require more power to maintain stability and course, leading to higher energy consumption and potentially reducing the UAV’s range. Currently, our proposed model accounts these factors by giving a fix power consumption value $ \rho $ for each movement. However, this can be upgraded by the central control. To make a fully automated scenario, it might incorporate nonlinear weighting functions that reflect the additional energy costs imposed by varying weather conditions. These functions will factor in the wind speed, direction, and turbulence, providing a more realistic simulation of energy consumption during flight. Similarly, varying payload sizes and weights represent another important challenge. The UAV’s energy consumption increases with heavier payloads, and failing to account for this can lead to inefficiency in path planning, as UAVs carrying heavier loads might require additional power to complete their routes. The proposed approach assumes a static payload weight, which limits the adaptability of the proposed model to real-world delivery scenarios. To solve this, a dynamic payload management system that adjusts the energy consumption model based on real-time payload measurements might be introduced. This enhancement will allow the system to optimize routes not only for the most efficient paths but also for varying loads, ensuring more accurate and energy-efficient delivery operations.

5 Conclusions

This paper focuses on the issue of multi-UAV path planning in a natural disaster scenario, specifically addressing the challenge of delivering first aid to areas that are difficult to reach. The hazard areas are presumed to present a potential danger to hazards/birds. The primary objective of path planning is to achieve precise and optimal coverage of delivery points for maximum payload and to establish expedient landing stations. Hence, a novel DQN model is developed for cooperative path planning. This model expands the state space to include detailed information about the complex environment. Additionally, the reward function is enhanced to address the security concerns and ensure the optimal coverage at various delivery points for UAVs. In addition, the possibility of using a winch system airdrop to ensure precise delivery of the payloads has been taken into account. Ultimately, the simulation results demonstrate that the proposed model has the capability to enhance the performance of the system. In the forthcoming investigation, the utilization of sophisticated DQN models for unsupervised path planning of multiple UAVs aims to be explored. This involves enabling UAVs to independently complete specific tasks and communicate with one another without relying on servers.

Disclosures

The authors declare no conflicts of interest.

Table 1.

Table 1.

Table 1.

Table 1.

Table 2. Simulated results for Urban50.

Table 2. Simulated results for Urban50.