TY - JOUR

T1 - Policy learning using modified learning vector quantization for reinforcement learning problems

AU - Afif Mohd Faudzi, Ahmad

AU - Murata, Junichi

PY - 2015/7/1

Y1 - 2015/7/1

N2 - Reinforcement learning (RL) enables an agent tofind an optimal solution to a problem by interacting with the environment. In the previous research, Q-learning, one of the popular learning methods in RL, is used to generate a policy. From it, abstract policy is extracted by LVQ algorithm. In this paper, the aim is to train the agent to learn an optimal policy from scratch as well as to generate the abstract policy in a single operation by LVQ algorithm. When applying LVQ algorithm in a RL frame- work, due to an erroneous teaching signal in LVQ algorithm, the learning sometimes end up with failure or with non-optimal solution. Here, a new LVQ algorithm is proposed to overcome this problem. The new LVQ algorithm introduce, first, a regular reward that is obtained by the agent autonomously based on its behavior and second, a function that convert a regular reward to a new reward so that the learning system does not suffer from an undesirable effect by a small reward. Through these modifications, the agent is expected to find the optimal solution more efficiently.

AB - Reinforcement learning (RL) enables an agent tofind an optimal solution to a problem by interacting with the environment. In the previous research, Q-learning, one of the popular learning methods in RL, is used to generate a policy. From it, abstract policy is extracted by LVQ algorithm. In this paper, the aim is to train the agent to learn an optimal policy from scratch as well as to generate the abstract policy in a single operation by LVQ algorithm. When applying LVQ algorithm in a RL frame- work, due to an erroneous teaching signal in LVQ algorithm, the learning sometimes end up with failure or with non-optimal solution. Here, a new LVQ algorithm is proposed to overcome this problem. The new LVQ algorithm introduce, first, a regular reward that is obtained by the agent autonomously based on its behavior and second, a function that convert a regular reward to a new reward so that the learning system does not suffer from an undesirable effect by a small reward. Through these modifications, the agent is expected to find the optimal solution more efficiently.

UR - http://www.scopus.com/inward/record.url?scp=84971265228&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84971265228&partnerID=8YFLogxK

M3 - Article

AN - SCOPUS:84971265228

VL - 20

SP - 1

EP - 6

JO - Research Reports on Information Science and Electrical Engineering of Kyushu University

JF - Research Reports on Information Science and Electrical Engineering of Kyushu University

SN - 1342-3819

IS - 2

ER -