Abstract
This article formulates gantry real-time scheduling in a gantry work cell, where the material transfer is driven by gantries, as a Markov decision process (MDP). Classical learning methods and planning methods for solving the optimization problems in MDP are discussed. An innovative method, called 'Q-ADP,' is proposed to integrate reinforcement learning (RL) with approximate dynamic programming (ADP). Q-ADP uses model-free Q-learning algorithm to learn state values through interactions with the environment, meanwhile, planning steps during the learning process opt for ADP to keep updating state values through several sample paths. A model of one-step transition probabilities is built based on the machines' reliability model, and serves the ADP algorithm. To demonstrate the effectiveness of this method, a numerical study is performed to show the production performance, compared to a standard Q-learning algorithm. The simulation results show that Q-ADP outperforms standard Q-learning under the same length of training process. It is also shown that with the benefit of repeated updating state values through sample paths, Q-ADP requires less data for gantry policy to converge, which makes the method promising when real data are limited. Note to Practitioners - The goal of this work is to find a near optimal gantry assignment policy to realize real-time control of material handling gantry/robot movements in gantry work cells. Properly assigning gantries based on real-time situations of the production system can avoid machines' stoppage due to material shortage, and consequently improve production performance. This gantry scheduling is a sequential decision-making problem and can be presented by Markov Decision Process (MDP). To solve the MDP problem, an algorithm integrating model-free Q-learning and model-based approximate dynamic programming (ADP) is proposed. By learning directly from the interaction with the environment, the method avoids bias problem from any model designing. Meanwhile, a planning process during learning can efficiently speed up the learning for convergence of the policy, and this particularly benefits to the scenario when the real data are insufficient.
| Original language | English |
|---|---|
| Article number | 9069276 |
| Pages (from-to) | 85-93 |
| Number of pages | 9 |
| Journal | IEEE Transactions on Automation Science and Engineering |
| Volume | 18 |
| Issue number | 1 |
| DOIs | |
| State | Published - Jan 2021 |
Keywords
- Approximate dynamic programming (ADP)
- Markov decision process (MDP)
- Q-learning
- gantry scheduling
- planning and learning
Fingerprint
Dive into the research topics of 'A Method Integrating Q-Learning with Approximate Dynamic Programming for Gantry Work Cell Scheduling'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver