CDS Lecture Series


Sean Meyn
Department of Electrical and Computer Engineering
Coordinate Science Lab
University of Illinois at Urbana-Champaign

Q-learning and Pontryagin's Minimum Principle

Q-learning is a technique used to compute an optimal policy for a controlled Markov chain based on observations of the system controlled using a non-optimal policy. It has proven to be effective for models with finite state and action space. In this talk we will see how the construction of the algorithm is identical to concepts from more classical nonlinear control theory - in particular, Jacobson & Mayne's differential dynamic programming introduced in the 1960's. We will see how Q-learning can be extended to deterministic and Markovian systems in continuous time, with general state and action space. The main ideas are summarized as follows. (i) Watkin's "Q-function" is an extension of the Hamiltonian that appears in the Minimum Principle. Based on this observation we obtain extensions of Watkin's algorithm to approximate the Hamiltonian within a prescribed finite-dimensional function class. (ii) A transformation of the optimality equations is performed based on the adjoint of a resolvent operator. This is used to construct a consistent algorithm based on stochastic approximation that requires only causal filtering of the time-series data. (iii) Examples are presented to illustrate the application of these techniques, including application to distributed control of multi-agent systems. Reference: P. Mehta and S. Meyn. Q-learning and Pontryagin's Minimum Principle. Submitted to the 48th IEEE Conference on Decision and Control, December 16-18 2009.

Back to CDS Lecture Series
Back to Intelligent Servosystems Laboratory