An online policy gradient algorithm for Markov decision processes with continuous states and actions

Yao Ma, Tingting Zhao, Kohei Hatano, Masashi Sugiyama

研究成果: Contribution to journalLetter査読

1 被引用数 (Scopus)

抄録

We consider the learning problem under an online Markov decision process (MDP) aimed at learning the time-dependent decision-making policy of an agent that minimizes the regret - the difference from the best fixed policy. The difficulty of online MDP learning is that the reward function changes over time. In this letter, we show that a simple online policy gradient algorithm achieves regret for T steps under a certain concavity assumption and under a strong concavity assumption. To the best of our knowledge, this is the first work to present an online MDP algorithm that can handle continuous state, action, and parameter spaces with guarantee. We also illustrate the behavior of the proposed online policy gradient method through experiments.

本文言語英語
ページ(範囲)563-593
ページ数31
ジャーナルNeural Computation
28
3
DOI
出版ステータス出版済み - 3 1 2016

All Science Journal Classification (ASJC) codes

  • 人文科学(その他)
  • 認知神経科学

フィンガープリント

「An online policy gradient algorithm for Markov decision processes with continuous states and actions」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル