Skip to main navigation Skip to search Skip to main content

An ϵ-Greedy Multiarmed Bandit Approach to Markov Decision Processes †

  • Independent Researcher

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

We present REGA, a new adaptive-sampling-based algorithm for the control of finite-horizon Markov decision processes (MDPs) with very large state spaces and small action spaces. We apply a variant of the (Formula presented.) -greedy multiarmed bandit algorithm to each stage of the MDP in a recursive manner, thus computing an estimation of the “reward-to-go” value at each stage of the MDP. We provide a finite-time analysis of REGA. In particular, we provide a bound on the probability that the approximation error exceeds a given threshold, where the bound is given in terms of the number of samples collected at each stage of the MDP. We empirically compare REGA against another sampling-based algorithm called RASA by running simulations against the SysAdmin benchmark problem with (Formula presented.) states. The results show that REGA and RASA achieved similar performance. Moreover, REGA and RASA empirically outperformed an implementation of the algorithm that uses the “original” (Formula presented.) -greedy algorithm that commonly appears in the literature.

Original languageEnglish
Pages (from-to)99-112
Number of pages14
JournalStats
Volume6
Issue number1
DOIs
StatePublished - Mar 2023

Keywords

  • Markov decision process (MDP)
  • epsilon-greedy method
  • multiarmed bandits
  • optimization under uncertainties
  • sampling

Fingerprint

Dive into the research topics of 'An ϵ-Greedy Multiarmed Bandit Approach to Markov Decision Processes †'. Together they form a unique fingerprint.

Cite this