Laboratory of Computer and Information Science / Neural Networks Research Centre CIS Lab Helsinki University of Technology

On the stability of reinforcement learning under partial observability and generalizing representations

Paul Wagner. Master's thesis, Aalto University School of Science and Technology, Department of Information and Computer Science, Espoo, Finland, May 2010.

Download

Abstract

Computational reinforcement learning is a subfield of artificial intelligence and machine learning and deals with algorithms for solving 'reinforcement learning' problems. The term 'reinforcement learning' refers to the learning process behind operant conditioning, in which behaviors of the learner are adapted based on the preferabilities of their consequences. Preferred or high-utility consequences act as rewards that reinforce behaviors or behavior variations that lead to them, thus leading to the maximization of preferred states of affairs or utility.

Computational reinforcement learning algorithms perform optimization of a cached control or decision-making policy for controlling an initially unknown dynamical system so as to maximize long-term reward (or minimize cost). The reward is an abstract evaluation signal that conveys information about success in a given task. Such algorithms can be viewed, for example, as performing 'model-free' planning so as to maximize utility, or as an approach for the problem of stochastic adaptive optimal control. Most algorithms are based on the (PO)MDP formalism and dynamic programming, and use machine learning techniques for inducing generalization.

The learning setting is an instance of active learning: the learner has to obtain the needed information about the target system by experimenting with it based on its current knowledge. As a consequence, there is a learning timescale feedback loop. This feedback loop in combination with either approximate representation of knowledge or partial observability of the target system can completely destabilize some reinforcement learning algorithms, Q-learning being the most vulnerable to this combination, and trap many popular algorithms into sustained oscillation.

We will survey the currently most popular methodologies and algorithms from the viewpoint of the actor-critic framework. More precisely, we look at the policy gradient and greedy value function approaches, from which the greedy value function approach is known to be susceptible to the oscillation problem. Some of the considered algorithms are the natural actor-critic algorithm, fitted Q iteration, SARSA and Q-learning. We will look in detail at the reasons behind the oscillation problem and observe that the heart of the problem is a certain form of non-Markovity due to incomplete information. The destabilizing mechanism is also illustrated experimentally with minimal artificial examples.

Keywords: reinforcement learning, stochastic optimal control, adaptive control, active learning, policy gradient, natural gradient, approximate dynamic programming, simulation-based optimization, partial observability, Markov property, policy oscillation, policy chattering

You are at: CIS → HUT - CIS - Paul Wagner - M.Sc. thesis

Page maintained by pwagner at cis.hut.fi, last updated Monday, 13-Dec-2010 17:04:52 EET