TY - JOUR
T1 - TD algorithm for the variance of return and mean-variance reinforcement learning
AU - Sato, Makoto
AU - Kimura, Hajime
AU - Kobayashi, Shibenobu
N1 - Copyright:
Copyright 2011 Elsevier B.V., All rights reserved.
PY - 2001
Y1 - 2001
N2 - Estimating probability distributions on returns provides various sophisticated decision making schemes for control problems in Markov environments, including risk-sensitive control, efficient exploration of environments and so on. Many reinforcement learning algorithms, however, have simply relied on the expected return. This paper provides a scheme of decision making using mean and variance of returndistributions. This paper presents a TD algorithm for estimating the variance of return in MDP(Markov decision processes) environments and a gradient-based reinforcement learning algorithm on the variance penalized criterion, which is a typical criterion in risk-avoiding control. Empirical results demonstrate behaviors of the algorithms and validates of the criterion for risk-avoiding sequential decision tasks.
AB - Estimating probability distributions on returns provides various sophisticated decision making schemes for control problems in Markov environments, including risk-sensitive control, efficient exploration of environments and so on. Many reinforcement learning algorithms, however, have simply relied on the expected return. This paper provides a scheme of decision making using mean and variance of returndistributions. This paper presents a TD algorithm for estimating the variance of return in MDP(Markov decision processes) environments and a gradient-based reinforcement learning algorithm on the variance penalized criterion, which is a typical criterion in risk-avoiding control. Empirical results demonstrate behaviors of the algorithms and validates of the criterion for risk-avoiding sequential decision tasks.
UR - http://www.scopus.com/inward/record.url?scp=4143110354&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=4143110354&partnerID=8YFLogxK
U2 - 10.1527/tjsai.16.353
DO - 10.1527/tjsai.16.353
M3 - Article
AN - SCOPUS:4143110354
SN - 1346-0714
VL - 16
SP - 353
EP - 362
JO - Transactions of the Japanese Society for Artificial Intelligence
JF - Transactions of the Japanese Society for Artificial Intelligence
IS - 3
ER -