Reinforcement Learning with Reward Shaping for Last-Mile Delivery Dispatch Efficiency

Huang, Sichong

Reinforcement Learning with Reward Shaping for Last-Mile Delivery Dispatch Efficiency

Sichong Huang

European Journal of Business, Economics & Management, 2025, vol. 1, issue 4, 122-130

Abstract: As the final and most labor-intensive segment of the logistics chain, last-mile delivery grapples with inherent challenges: dynamic traffic conditions, fluctuating order volumes, and the conflicting demands of timeliness, cost control, and resource efficiency. Conventional dispatch approaches-such as heuristic algorithms and static optimization models-exhibit limited adaptability to real-time fluctuations, often resulting in suboptimal resource utilization and elevated operational costs. To address these gaps, this study proposes a reinforcement learning (RL) framework integrated with multi-dimensional reward shaping (RS) to enhance dynamic last-mile delivery dispatch efficiency. First, we formalize the dispatch problem as a Markov Decision Process (MDP) that explicitly incorporates real-time factors (e.g., traffic congestion, order urgency, and vehicle status) into the state space. Second, we design a domain-specific RS function that introduces intermediate rewards (e.g., on-time arrival bonuses, empty-running penalties) to mitigate the sparsity of traditional terminal rewards and accelerate RL agent convergence. Experiments were conducted on a real-world dataset from a logistics enterprise in Chengdu (June-August 2024), comparing the proposed RS-PPO framework against two baselines: the classic Savings Algorithm (SA) and standard PPO without reward shaping (PPO-noRS). Results demonstrate that RS-PPO improves the on-time delivery rate (OTR) by 18.2% (vs. SA) and 9.5% (vs. PPO-noRS), reduces the average delivery cost (ADC) by 12.7% (vs. SA) and 7.3% (vs. PPO-noRS), and shortens convergence time by 40.3% (vs. PPO-noRS). Additionally, RS-PPO boosts vehicle utilization rate (VUR) by 29.8% (vs. SA) and 13.4% (vs. PPO-noRS). This framework provides a practical, data-driven solution for logistics enterprises seeking to balance service quality, cost efficiency, and sustainability-aligning with global last-mile optimization trends.

Keywords: last-mile delivery; reinforcement learning; multi-dimensional reward shaping; dynamic dispatch; Markov decision process (search for similar items in EconPapers)
Date: 2025
References: Add references at CitEc
Citations:

Downloads: (external link)
https://pinnaclepubs.com/index.php/EJBEM/article/view/359/362 (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:dba:ejbema:v:1:y:2025:i:4:p:122-130

Access Statistics for this article

More articles in European Journal of Business, Economics & Management from Pinnacle Academic Press
Bibliographic data for series maintained by Joseph Clark ().