An online policy gradient algorithm for Markov decision processes with continuous states and actions

Yao Ma, Tingting Zhao, kohei hatano, Masashi Sugiyama

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

We consider the learning problem under an online Markov decision process (MDP), which is aimed at learning the time-dependent decision-making policy of an agent that minimizes the regret - the difference from the best fixed policy. The difficulty of online MDP learning is that the reward function changes over time. In this paper, we show that a simple online policy gradient algorithm achieves regret O(√T) for T steps under a certain concavity assumption and O(logT) under a strong concavity assumption. To the best of our knowledge, this is the first work to give an online MDP algorithm that can handle continuous state, action, and parameter spaces with guarantee. We also illustrate the behavior of the online policy gradient method through experiments.

Original languageEnglish
Title of host publicationMachine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2014, Proceedings
PublisherSpringer Verlag
Pages354-369
Number of pages16
EditionPART 2
ISBN (Print)9783662448502
DOIs
Publication statusPublished - Jan 1 2014
EventEuropean Conference on Machine Learning and Knowledge Discovery in Databases, ECML PKDD 2014 - Nancy, France
Duration: Sep 15 2014Sep 19 2014

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
NumberPART 2
Volume8725 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

OtherEuropean Conference on Machine Learning and Knowledge Discovery in Databases, ECML PKDD 2014
CountryFrance
CityNancy
Period9/15/149/19/14

Fingerprint

Gradient Algorithm
Markov Decision Process
Gradient methods
Regret
Concavity
Decision making
Gradient Method
Reward
Experiments
Parameter Space
Decision Making
Policy
Minimise
Experiment
Learning

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Ma, Y., Zhao, T., hatano, K., & Sugiyama, M. (2014). An online policy gradient algorithm for Markov decision processes with continuous states and actions. In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2014, Proceedings (PART 2 ed., pp. 354-369). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 8725 LNAI, No. PART 2). Springer Verlag. https://doi.org/10.1007/978-3-662-44851-9_23

An online policy gradient algorithm for Markov decision processes with continuous states and actions. / Ma, Yao; Zhao, Tingting; hatano, kohei; Sugiyama, Masashi.

Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2014, Proceedings. PART 2. ed. Springer Verlag, 2014. p. 354-369 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 8725 LNAI, No. PART 2).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Ma, Y, Zhao, T, hatano, K & Sugiyama, M 2014, An online policy gradient algorithm for Markov decision processes with continuous states and actions. in Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2014, Proceedings. PART 2 edn, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), no. PART 2, vol. 8725 LNAI, Springer Verlag, pp. 354-369, European Conference on Machine Learning and Knowledge Discovery in Databases, ECML PKDD 2014, Nancy, France, 9/15/14. https://doi.org/10.1007/978-3-662-44851-9_23
Ma Y, Zhao T, hatano K, Sugiyama M. An online policy gradient algorithm for Markov decision processes with continuous states and actions. In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2014, Proceedings. PART 2 ed. Springer Verlag. 2014. p. 354-369. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); PART 2). https://doi.org/10.1007/978-3-662-44851-9_23
Ma, Yao ; Zhao, Tingting ; hatano, kohei ; Sugiyama, Masashi. / An online policy gradient algorithm for Markov decision processes with continuous states and actions. Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2014, Proceedings. PART 2. ed. Springer Verlag, 2014. pp. 354-369 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); PART 2).
@inproceedings{1b42131492664bd5b3c40edd96a0dc75,
title = "An online policy gradient algorithm for Markov decision processes with continuous states and actions",
abstract = "We consider the learning problem under an online Markov decision process (MDP), which is aimed at learning the time-dependent decision-making policy of an agent that minimizes the regret - the difference from the best fixed policy. The difficulty of online MDP learning is that the reward function changes over time. In this paper, we show that a simple online policy gradient algorithm achieves regret O(√T) for T steps under a certain concavity assumption and O(logT) under a strong concavity assumption. To the best of our knowledge, this is the first work to give an online MDP algorithm that can handle continuous state, action, and parameter spaces with guarantee. We also illustrate the behavior of the online policy gradient method through experiments.",
author = "Yao Ma and Tingting Zhao and kohei hatano and Masashi Sugiyama",
year = "2014",
month = "1",
day = "1",
doi = "10.1007/978-3-662-44851-9_23",
language = "English",
isbn = "9783662448502",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Verlag",
number = "PART 2",
pages = "354--369",
booktitle = "Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2014, Proceedings",
address = "Germany",
edition = "PART 2",

}

TY - GEN

T1 - An online policy gradient algorithm for Markov decision processes with continuous states and actions

AU - Ma, Yao

AU - Zhao, Tingting

AU - hatano, kohei

AU - Sugiyama, Masashi

PY - 2014/1/1

Y1 - 2014/1/1

N2 - We consider the learning problem under an online Markov decision process (MDP), which is aimed at learning the time-dependent decision-making policy of an agent that minimizes the regret - the difference from the best fixed policy. The difficulty of online MDP learning is that the reward function changes over time. In this paper, we show that a simple online policy gradient algorithm achieves regret O(√T) for T steps under a certain concavity assumption and O(logT) under a strong concavity assumption. To the best of our knowledge, this is the first work to give an online MDP algorithm that can handle continuous state, action, and parameter spaces with guarantee. We also illustrate the behavior of the online policy gradient method through experiments.

AB - We consider the learning problem under an online Markov decision process (MDP), which is aimed at learning the time-dependent decision-making policy of an agent that minimizes the regret - the difference from the best fixed policy. The difficulty of online MDP learning is that the reward function changes over time. In this paper, we show that a simple online policy gradient algorithm achieves regret O(√T) for T steps under a certain concavity assumption and O(logT) under a strong concavity assumption. To the best of our knowledge, this is the first work to give an online MDP algorithm that can handle continuous state, action, and parameter spaces with guarantee. We also illustrate the behavior of the online policy gradient method through experiments.

UR - http://www.scopus.com/inward/record.url?scp=84907045929&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84907045929&partnerID=8YFLogxK

U2 - 10.1007/978-3-662-44851-9_23

DO - 10.1007/978-3-662-44851-9_23

M3 - Conference contribution

SN - 9783662448502

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 354

EP - 369

BT - Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2014, Proceedings

PB - Springer Verlag

ER -