Mixing-time regularized policy gradient

Tetsuro Morimura, Takayuki Osogami, Tomoyuki Shirai

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Citations (Scopus)

Abstract

Policy gradient reinforcement learning (PGRL) has been receiving substantial attention as a mean for seeking stochastic policies that maximize cumulative reward. However, the learning speed of PGRL is known to decrease substantially when PGRL explores the policies that give the Markov chains having long mixing time. We study a new approach of regularizing how the PGRL explores the policies by the use of the hitting time of the Markov chains. The hitting time gives an upper bound on the mixing time, and the proposed approach improves the learning efficiency by keeping the mixing time of the Markov chains short. In particular, we propose a method of temporal-difference learning for estimating the gradient of the hitting time. Numerical experiments show that the proposed method outperforms conventional methods of PGRL.

Original languageEnglish
Title of host publicationProceedings of the National Conference on Artificial Intelligence
PublisherAI Access Foundation
Pages1997-2003
Number of pages7
ISBN (Electronic)9781577356790
Publication statusPublished - Jan 1 2014
Event28th AAAI Conference on Artificial Intelligence, AAAI 2014, 26th Innovative Applications of Artificial Intelligence Conference, IAAI 2014 and the 5th Symposium on Educational Advances in Artificial Intelligence, EAAI 2014 - Quebec City, Canada
Duration: Jul 27 2014Jul 31 2014

Publication series

NameProceedings of the National Conference on Artificial Intelligence
Volume3

Other

Other28th AAAI Conference on Artificial Intelligence, AAAI 2014, 26th Innovative Applications of Artificial Intelligence Conference, IAAI 2014 and the 5th Symposium on Educational Advances in Artificial Intelligence, EAAI 2014
CountryCanada
CityQuebec City
Period7/27/147/31/14

Fingerprint

Reinforcement learning
Markov processes
Experiments

All Science Journal Classification (ASJC) codes

  • Software
  • Artificial Intelligence

Cite this

Morimura, T., Osogami, T., & Shirai, T. (2014). Mixing-time regularized policy gradient. In Proceedings of the National Conference on Artificial Intelligence (pp. 1997-2003). (Proceedings of the National Conference on Artificial Intelligence; Vol. 3). AI Access Foundation.

Mixing-time regularized policy gradient. / Morimura, Tetsuro; Osogami, Takayuki; Shirai, Tomoyuki.

Proceedings of the National Conference on Artificial Intelligence. AI Access Foundation, 2014. p. 1997-2003 (Proceedings of the National Conference on Artificial Intelligence; Vol. 3).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Morimura, T, Osogami, T & Shirai, T 2014, Mixing-time regularized policy gradient. in Proceedings of the National Conference on Artificial Intelligence. Proceedings of the National Conference on Artificial Intelligence, vol. 3, AI Access Foundation, pp. 1997-2003, 28th AAAI Conference on Artificial Intelligence, AAAI 2014, 26th Innovative Applications of Artificial Intelligence Conference, IAAI 2014 and the 5th Symposium on Educational Advances in Artificial Intelligence, EAAI 2014, Quebec City, Canada, 7/27/14.
Morimura T, Osogami T, Shirai T. Mixing-time regularized policy gradient. In Proceedings of the National Conference on Artificial Intelligence. AI Access Foundation. 2014. p. 1997-2003. (Proceedings of the National Conference on Artificial Intelligence).
Morimura, Tetsuro ; Osogami, Takayuki ; Shirai, Tomoyuki. / Mixing-time regularized policy gradient. Proceedings of the National Conference on Artificial Intelligence. AI Access Foundation, 2014. pp. 1997-2003 (Proceedings of the National Conference on Artificial Intelligence).
@inproceedings{971d9f20312a4c668bb5f9276ca5f48d,
title = "Mixing-time regularized policy gradient",
abstract = "Policy gradient reinforcement learning (PGRL) has been receiving substantial attention as a mean for seeking stochastic policies that maximize cumulative reward. However, the learning speed of PGRL is known to decrease substantially when PGRL explores the policies that give the Markov chains having long mixing time. We study a new approach of regularizing how the PGRL explores the policies by the use of the hitting time of the Markov chains. The hitting time gives an upper bound on the mixing time, and the proposed approach improves the learning efficiency by keeping the mixing time of the Markov chains short. In particular, we propose a method of temporal-difference learning for estimating the gradient of the hitting time. Numerical experiments show that the proposed method outperforms conventional methods of PGRL.",
author = "Tetsuro Morimura and Takayuki Osogami and Tomoyuki Shirai",
year = "2014",
month = "1",
day = "1",
language = "English",
series = "Proceedings of the National Conference on Artificial Intelligence",
publisher = "AI Access Foundation",
pages = "1997--2003",
booktitle = "Proceedings of the National Conference on Artificial Intelligence",
address = "United States",

}

TY - GEN

T1 - Mixing-time regularized policy gradient

AU - Morimura, Tetsuro

AU - Osogami, Takayuki

AU - Shirai, Tomoyuki

PY - 2014/1/1

Y1 - 2014/1/1

N2 - Policy gradient reinforcement learning (PGRL) has been receiving substantial attention as a mean for seeking stochastic policies that maximize cumulative reward. However, the learning speed of PGRL is known to decrease substantially when PGRL explores the policies that give the Markov chains having long mixing time. We study a new approach of regularizing how the PGRL explores the policies by the use of the hitting time of the Markov chains. The hitting time gives an upper bound on the mixing time, and the proposed approach improves the learning efficiency by keeping the mixing time of the Markov chains short. In particular, we propose a method of temporal-difference learning for estimating the gradient of the hitting time. Numerical experiments show that the proposed method outperforms conventional methods of PGRL.

AB - Policy gradient reinforcement learning (PGRL) has been receiving substantial attention as a mean for seeking stochastic policies that maximize cumulative reward. However, the learning speed of PGRL is known to decrease substantially when PGRL explores the policies that give the Markov chains having long mixing time. We study a new approach of regularizing how the PGRL explores the policies by the use of the hitting time of the Markov chains. The hitting time gives an upper bound on the mixing time, and the proposed approach improves the learning efficiency by keeping the mixing time of the Markov chains short. In particular, we propose a method of temporal-difference learning for estimating the gradient of the hitting time. Numerical experiments show that the proposed method outperforms conventional methods of PGRL.

UR - http://www.scopus.com/inward/record.url?scp=84908176689&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84908176689&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84908176689

T3 - Proceedings of the National Conference on Artificial Intelligence

SP - 1997

EP - 2003

BT - Proceedings of the National Conference on Artificial Intelligence

PB - AI Access Foundation

ER -