CS394R: Reinforcement Learning: Theory and Practice -- Fall 2016: Resources Page

Resources for Reinforcement Learning: Theory and Practice

Week 0: Class Overview, Introduction

Slides from week 0: Peter's; Scott's.

Week 1: Multi-armed Bandits and MDPs

Slides from Tuesday: pdf.

Gradient Bandit Slides from Thursday: pdf.

Ch.3 Slides from Thursday: pdf.

The one from Shivaram Kalyanakrishnan: pdf.

Readings from a past version of the course:

Sections 1, 2, 4, and 5 and the proof of Theorem 1 in Section 3. The proof of Theorem 3 and the appendices are optional.
UCB: Finite-time Analysis of the Multiarmed Bandit Problem
Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer
2002

Sections 1, 2, 3.1, 4, and 5. The details of the proof (Sections 3.2-3.4) are optional.
Thompson Sampling: an asymptotically optimal finite-time analysis
Emilie Kaufmann, Nathaniel Korda, and Remi Munos
2012

Csaba Szepesvari's banditalgs.com.

Vermorel and Mohri: Multi-Armed Bandit Algorithms and Empirical Evaluation.

Shivaram Kalyanakrishnan and Peter Stone: Efficient Selection of Multiple Bandit Arms: Theory and Practice. In ICML 2010. Here are some related slides.

An RL reading list from Shivaram Kalyanakrishnan.

An Empirical Evaluation of Thompson Sampling
Olivier Chapelle and Lihong Li
NIPS 2011

Week 2: Dynamic Programming and Monte Carlo Methods

Slides from Tuesday: pdf

Slides from Thursday: pdf

A paper on "On the Complexity of solving MDPs" (Littman, Dean, and Kaelbling, 1995).

Pashenkova, Rish, and Dechter: Value Iteration and Policy Iteration Algorithms for Markov Decision Problems.

Some slides on robot localization that include information on importance sampling.

Week 3: TD Learning and n-step Bootstrapping

Slides from Thursday: pdf.

Harm van Seijen, Hado van Hasselt, Shimon Whiteson, and Marco Wiering, A Theoretical and Empirical Analysis of Expected Sarsa. In ADPRL 2009.

A Q-learning video

A blog post explaining why double learning helps deal with maximization bias.

Week 4: Planning

Slides from week 4: The planning ones; The ones on TEXPLORE; Slides on MCTS; Slides on POMDPs.

Slides by Alan Fern on Monte Carlo Tree Search and UCT

On the Analysis of Complex Backup Strategies in Monte Carlo Tree Search by Khandelwal, Liebman, Stone, and Niekum.

A Survey of Monte Carlo Tree Search Methodsby Browne et al.
(IEEE TRANSACTIONS ON COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, VOL. 4, NO. 1, MARCH 2012)

The Dependence of Effective Planning Horizon on Model Accuracy
by Nan Jiang, Alex Kulesza, Satinder Singh, and Richard Lewis.
In International Conference on Autonomous Agents and MultiAgent Systems (AAMAS), 2015.

A survey on Bayesian RL by Ghavamzadeh et al.

A paper that uses MCTS for robot learning: TEXPLORE: Real-Time Sample-Efficient Reinforcement Learning for Robots, by Hester and Stone.

Week 5: On-policy Prediction with Approximation

Slides from week 5: pdf.

Evolutionary Function Approximation by Shimon Whiteson.

Dopamine: generalization and Bonuses (2002) Kakade and Dayan.

Keepaway Soccer: From Machine Learning Testbed to Benchmark - a paper that compares CMAC, RBF, and NN function approximators on the same task.

Boyan, J. A., and A. W. Moore, Generalization in Reinforcement Learning: Safely Approximating the Value Function. In Tesauro, G., D. S. Touretzky, and T. K. Leen (eds.), Advances in Neural Information Processing Systems 7 (NIPS). MIT Press, 1995. Another example of function approximation divergence and a proposed solution.

Experiments with Reinforcement Learning in Problems with Continuous State and Action Spaces (1998) Juan Carlos Santamaria, Richard S. Sutton, Ashwin Ram. Comparisons of several types of function approximators (including instance-based like Kanerva).

Binary action search for learning continuous-action control policies (2009). Pazis and Lagoudakis. (slides)

Least-Squares Temporal Difference Learning Justin Boyan.

A Convergent Form of Approximate Policy Iteration (2002) T. J. Perkins and D. Precup. A convergence guarantee with function approximation.

Moore and Atkeson: The Parti-game Algorithm for Variable Resolution Reinforcement Learning in Multidimensional State Spaces.

Sherstov and Stone: Function Approximation via Tile Coding: Automating Parameter Choice.

Chapman and Kaelbling: Input Generalization in Delayed Reinforcement Learning: An Algorithm and Performance Comparisons.

Sašo Džeroski, Luc De Raedt and Kurt Driessens: Relational Reinforcement Learning.

Sprague and Ballard: Multiple-Goal Reinforcement Learning with Modular Sarsa(0).

A post on Deep Q learning. another

Week 6: On Policy Control with Approximation and Off Policy Methods with Approximation

Slides from week 6, Thursday (Ch 11): pdf.

Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results by Mahadaven.

Discounted Reinforcement Learning is Not an Optimization Problem by Naik, Shariff, Yasui, and Sutton.

Toward Off-Policy Learning Control with Function Approximation
Maei et al. ICML 2010 - solves Baird's counterexample - Greedy-GQ for linear function approximation control

Residual Algorithms: Reinforcement Learning with Function Approximation (1995) Leemon Baird. More on the Baird counterexample as well as an alternative to doing gradient descent on the MSE.

Week 7: Eligibility Traces

Rich Sutton's slides: pdf

The keepaway slides: pdf

The corrected version of the eligibility trace example shown in class: pdf

The forward and backward views of TD(lambda) are equivalent.

Dayan: The Convergence of TD(&lambda) for General &lambda.

A paper that addresses relationship between first-visit and every-visit MC (Singh and Sutton, 1996). For some theoretical relationships see section starting at section 3.3 (and referenced appendices). The equivalence of MC and first visit TD(1) is proven starting in Section 2.4.

The paper that introduced Dutch traces and off-policy true on-line TD

An empirical analysis of true on-line TD: True Online Temporal-Difference Learning by van Seijen et al. (includes comparison to replacing traces)

Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation
Maei et al. NIPS 2009 - GTD for nonlinear function approximation policy evaluation

Train faster, generalize better: Stability of stochastic gradient descent by Moritz Hardt, Benjamin Recht, and Yoram Singer

Keepaway PASS slides, GETOPEN slides and the keepaway main page

An extensive empirical study of many different linear TD algorithms by Adam White and Martha White (AAMAS 2016).

Week 8: Policy Gradient Methods

Slides from class: pdf.

Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion.
Nate Kohl and Peter Stone
In Proceedings of the IEEE International Conference on Robotics and Automation, May 2004.

This paper compares the policy gradient RL method with other algorithms on the walk learning: Machine Learning for Fast Quadrupedal Locomotion. Kohl and Stone. AAAI 2004.

Overview of Policy Gradient Methods by Jan Peters: http://www.scholarpedia.org/article/Policy_gradient_methods

from Jan Peters' group: Policy Search for Motor Primitives in Robotics

Szita and Lörincz: Learning Tetris Using the Noisy Cross-Entropy Method.

Autonomous helicopter flight via reinforcement learning.
Andrew Ng, H. Jin Kim, Michael Jordan and Shankar Sastry.
In S. Thrun, L. Saul, and B. Schoelkopf (Eds.), Advances in Neural Information Processing Systems (NIPS) 17, 2004.

PEGASUS: A policy search method for large MDPs and POMDPs.
Andrew Ng and Michael Jordan
Some of the helicopter videos learned with PEGASUS.

Autonomous Helicopter Control using Reinforcement Learning Policy Search Methods.
J. Bagnell and J. Schneider
Proceedings of the International Conference on Robotics and Automation 2001, IEEE, May, 2001.

Autonomous reinforcement learning on raw visual input data in a real world application.
Sascha Lange, Martin Riedmiller, Arne Voigtlander.
IJCNN 2012.

A couple of articles on the details of actor-critic in practice by Tsitsklis and by Williams.

Natural Actor Critic.
Jan Peters and Stefan Schaal
Neurocomputing 2008. Earlier version in ECML 2005.

PILCO: A Model-Based and Data-Efficient Approach to Policy Search.
Marc Peter Deisenroth and Carl Edward Rasmussen
ICML 2011

The original policy gradient RL paper.

Guided policy search
Sergey Levine and Vladlen Koltun.
ICML 2013.
associated videos

Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics
Sergey Levine, Pieter Abbeel. NIPS 2014.
video

Trust Region policy optimization
John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel. ICML 2015.
video

A post by Karpathy on deep RL including with policy gradients (repeated from week 5)

Characterizing Reinforcement Learning Methods through Parameterized Learning Problems
Shivaram Kalyanakrishnan and Peter Stone.
Machine Learning (MLJ), 84(1--2):205-47, July 2011.

Comparing Evolutionary and Temporal Difference Methods for Reinforcement Learning.
Matthew Taylor, Shimon Whiteson, and Peter Stone.
In Proceedings of the Genetic and Evolutionary Computation Conference, pp. 1321-28, July 2006.

Evolutionary Function Approximation for RL. Whiteson and Stone, MLJ 2006.

Spinning up in Deep Reinforcement Learning Resources from OpenAI on getting up to speed using Deep Reinforcement Learning

Week 9: Case Studies and Applications

Guided policy search for visual manipulation: pdf video

OpenAI hide and seek: blog post

OpenAI Rubik's cube: blog post

The slides I showed on understanding what Deep RL nodes have learned (in particular LSTM units in a partially observable environment).

The slides I showed on AlphaGo

The GGP ones.

The NEAT+Q ones.

Some minimax slides: ppt.

Motif backgammon (online player)

Tesauro, G., Temporal Difference Learning and TD-Gammon . Communication of the ACM, 1995

Practical Issues in Temporal Difference Learning: an earlier paper by Tesauro (with a few more details)

Pollack, J.B., & Blair, A.D. Co-evolution in the successful learning of backgammon strategy. Machine Learning, 1998

Tesauro, G. Comments on Co-Evolution in the Successful Learning of Backgammon Strategy. Machine Learning, 1998.

Modular Neural Networks for Learning Context-Dependent Game Strategies, Justin Boyan, 1992: a partial replication of TD-gammon.

A fairly complete overview of one of the first applications of UCT to Go: "Monte-Carlo Tree Search and Rapid Action Value Estimation in Computer Go". Gelly and Silver. AIJ 2011.

Some papers from Simon Lucas' group on comparing TD learning and co-evolution in various games: Othello; Go; Treasure hunt.

S. Gelly and D. Silver. Achieving Master-Level Play in 9x9 Computer Go. In Proceedings of the 23rd Conference on Artificial Intelligence, Nectar Track (AAAI-08), 2008. Also available from here.

Simulation-Based Approach to General Game Playing
Hilmar Finnsson and Yngvi Bjornsson
AAAI 2008.

Some papers from the UT Learning Agents Research Group on General Game Playing

Deep Reinforcement Learning with Double Q-learning.
Hado van Hasselt, Arthur Guez, David Silver

Scaling Reinforcement Learning toward RoboCup Soccer.
Peter Stone and Richard S. Sutton.
Proceedings of the Eighteenth International Conference on Machine Learning, pp. 537-544, Morgan Kaufmann, San Francisco, CA, 2001.

The UT Austin Villa RoboCup team home page.

Greg Kuhlmann's follow-up on progress in 3v2 keepaway

Kalyanakishnan et al.: Model-based Reinforcement Learning in a Complex Domain.

Making a Robot Learn to Play Soccer Using Reward and Punishment.
Heiko Müller, Martin Lauer, Roland Hafner, Sascha Lange, Artur Merke and Martin Riedmiller.
30th Annual German Conference on AI, KI 2007.

Reinforcement Learning for Sensing Strategies.
C. Kwok and D. Fox.
Proceedings of IROS, 2004.

Learning to trade via direct reinforcement
John Moody and Matthew Saffell
IEEE Transactions on Neural Networks, 2001.

Reinforcement learning for optimized trade execution
Yuriy Nevmyvaka, Yi Feng, and Michael Kearns
ICML 2006

Adaptive Treatment of Epilepsy via Batch-mode Reinforcement Learning
Arthur Guez, Robert D. Vincent, Massimo Avoli, Joelle Pineau.
IAAI 2008

PAC Optimal Planning for Invasive Species Management: Improved Exploration for Reinforcement Learning from Simulator-Defined MDPs.
Thomas G. Dietterich, Majid Taleghan, and Mark Crowley
AAAI 2013.

Design, Analysis, and Learning Control of a Fully Actuated Micro Wind Turbine.
J. Zico Kolter, Zachary Jackowski, Russ Tedrake
American Controls Conference 2012.

Week 10: Abstraction: Options and Hierarchy

Tuesday slides, most courtesy of George Konidaris: pdf.

Konidaris and Barto's paper on skill chaining: pdf.

Bacon, Harb, and Precup's paper on the Option-Critic architecture: pdf.

Vezhnevets el. al paper on feudal networks: pdf.

Hausman et. al paper on learning skill embeddings: pdf.

Abel et. al paper on value-preserving abstractions: pdf.

Tom Dietterich's classic paper on MAXQ: pdf.

The Journal version of the MaxQ paper

A follow-up paper on eliminating irrelevant variables within a subtask: State Abstraction in MAXQ Hierarchical Reinforcement Learning

Automatic Discovery and Transfer of MAXQ Hierarchies (from Dietterich's group - 2008)

Hierarchical Model-Based Reinforcement Learning: Rmax + MAXQ.
Proceedings of the 25th International Conference on Machine Learning, 2008.
Nicholas K. Jong and Peter Stone

Automatic Discovery of Subgoals in RL using Diverse Density by McGovern and Barto.

Improved Automatic Discovery of Subgoals for Options in Hierarchical Reinforcement Learning by Kretchmar et al.

Nick Jong and Todd Hester's paper on the utility of temporal abstraction. The slides.

Lihong Li and Thomas J. Walsh and Michael L. Littman, Towards a Unified Theory of State Abstraction for MDPs , Ninth International Symposium on Artificial Intelligence and Mathematics , 2006.

Tom Dietterich's tutorial on abstraction.

Nick Jong's paper on state abstraction discovery. The slides.

Nick Jong's Thesis code repository and annotated slides

Week 11: Learning from Human Input

Tuesday slides on imitation learning and IRL

TAMER slides

Deep TAMER slides

behavioral cloning from observation (BCO) slides

Some slides on inverse RL from Pieter Abbeel.

Maximum Entropy IRL: pdf.

Bayesian IRL: pdf.

Generative Adversarial Imitation Learning: pdf.

Behavioral Cloning from Observations: pdf.

Dataset Aggregation (DAgger): pdf.

IRL with rankings: pdf.

Niekum et al. Learning to assemble IKEA from demonstrations: pdf.

Knox et al. Understanding Human Teaching Modalities in Reinforcement LearningEnvironments: A Preliminary Report: pdf.

Kaochar et al. Towards Understanding How Humans Teach Robots: pdf.

Toris et al. A Practical Comparison of Three Robot Learning fromDemonstration Algorithms: pdf.

sequential TAMER+RL and other follow-up papers by Brad Knox.

The deep TAMER paper.

The BCO paper.

Towards Resolving Unidentifiability in Inverse Reinforcement Learning.
Kareem Amin and Satinder Singh

Nonlinear Inverse Reinforcement Learning with Gaussian Processes
Sergey Levine, Zoran Popovic, Vladlen Koltun.

Inverse Reinforcement Learning in Partially Observable Environments
Jaedeug Choi and Kee-Eung Kim

Some papers on IRL and learning by demonstration

Deep Apprenticeship Learning for Playing Video Games

Maximum Entropy Deep Inverse Reinforcement Learning

Generative Adversarial Imitation Learning

Generative Adversarial Imitation from Observation

Creating Advice-Taking Reinforcement Learners.
Richard Maclin and Jude Shavlik.
Machine Learning, 22, pp. 251-281, 1996.

Gregory Kuhlmann, Peter Stone, Raymond Mooney, and Jude Shavlik: Guiding a Reinforcement Learner with Natural Language Advice: Initial Results in RoboCup Soccer.

Sonia Chernova and Manuela Veloso: Confidence-Based Policy Learning from Demonstration Using Gaussian Mixture Models.

Week 12: Multiagent RL and Safe RL

Slides from week 12 from Michael Bowling: pdf

The ones on grid game pdf

Slides on safe RL and IRL pdf

Journal version of WoLF

Doran Chakraborty and Peter Stone
Convergence, Targeted Optimality and Safety in Multiagent Learning (CMLeS)
ICML 2010.
journal version
Some slides(ppt)

A CMLeS-like algorithm that can be applied

Some slides on threats(pdf) - and the relevant paper

Busoniu, L. and Babuska, R. and De Schutter, B.
A comprehensive survey of multiagent reinforcement learning
IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applicati ons and Reviews, 28(2), 156-172, 2008.

Multi-Agent Reinforcement Learning: Independent vs. Coopeative Agents
by Ming Tan

Michael Bowling
Convergence and No-Regret in Multiagent Learning
NIPS 2004

Kok, J.R. and Vlassis, N., Collaborative multiagent reinforcement learning by payoff propagation, The Journal of Machine Learning Research, 7, 1828, 2006.

A brief survey on Multiagent Learning. by Doran Chakraborty

gametheory.net

Some useful slides (part C) from Michael Bowling on game theory, stochastic games, correlated equilibria; and (Part D) from Michael Littman with more on stochastic games.

Scaling up to bigger games with empirical game theory

Rob Powers and Yoav Shoham
New Criteria and a New Algorithm for Learning in Multi-Agent Systems
NIPS 2004.
journal version

A suite of game generators called GAMUT from Stanford.

RoShamBo (rock-paper-scissors) contest

U. of Alberta page on automated poker.

A paper introducing ad hoc teamwork

An article addressing ad hoc teamwork, applied in both predator/prey and RoboCup soccer.

Ad hoc teamwork as flocking

High confidence policy improvement (Thomas et al.): pdf

Safe reinforcement learning via shielding (Alshiekh et al.): pdf

Bootstrapping with models: confidence intervals for off-policy evaluation (Hanna et al.): pdf

Week 13: Exploration and Intrinsic Motivation

Tuesday slides on exploration and IM

R-Max - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning
Ronen Brafman and Moshe Tenenholtz
The Journal of Machine Learning Research (JMLR) 2002

Efficient Structure Learning in Factored-state MDPs
Alexander L. Strehl, Carlos Diuk, and Michael L. Littman
AAAI'2007

The Adaptive k-Meteorologists Problem and Its Application to Structure Learning and Feature Selection in Reinforcement Learning Carlos Diuk, Lihong Li, and Bethany R. Leffler
ICML 2009

Slides and video for the k-meteorologists paper

An Analysis of Model-Based Interval Estimation for Markov Decision Processes
Alexander L. Strehl and Michael L. Littman
MLJ 2008.

A shorter paper on MBIE

Model-Based Exploration in Continuous State Spaces
Nicholas K. Jong and Peter Stone
The Seventh Symposium on Abstraction, Reformulation, and Approximation, July 2007.

TEXPLORE: Real-Time Sample-Efficient Reinforcement Learning for Robots.
Todd Hester and Peter Stone
Machine Learning 2012

Near-Optimal Reinforcement Learning in Polynomial Time
Satinder Singh and Michael Kearns

Strehl et al.: PAC Model-Free Reinforcement Learning.

Safe Exploration in Markov Decision Processes
Moldovan and Abbeel, ICML 2012
(safe exploration in non-ergodic domains by favoring policies that maintain the ability to return to the start state)

Intrinsically motivated reinforcement learning (Singh et al.)

Go-Explore (Ecoffet et al.)

Evolved intrinsic rewards for efficient exploration (Niekum et al.)

Curiostiy-based exploration for multi-step tasks (Colas et al.)

Intrinsically motivated model learning for developing curious robots (Hester et al.)

Week 14: Modern Landscape

Tuesday slides on distributional RL, metalearning, and multimodal learning

Distributional reinforcement learning: pdf

Proximal Policy Optimization pdf

Open AI's Spinning Up in Deep RL Tutorial: website

A follow-up on SAC: Latent Space Policies for Hierarchical Reinforcement Learning

Action-Conditional Video Prediction Using Deep Networks in ATARI Games.
Juhnyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard Lewis, and Satinder Singh.
Neural Information Processing Systems, 2015.
Appendix
Videos

Reinforcement learning with unsupervised auxiliary tasks from Deep Mind includes some action conditional learning.

An Introduction to Inter-task Transfer for Reinforcement Learning.
Matthew E. Taylor and Peter Stone.
AI Magazine, 32(1):15-34, 2011.

Some transfer learning slides; The ones on instance-based transfer; the ones on curriculum learning

Improving Action Selection in MDP's via Knowledge Transfer.
Alexander A. Sherstov and Peter Stone.
In Proceedings of the Twentieth National Conference on Artificial Intelligence, July 2005.

General Game Learning using Knowledge Transfer.
Bikramjit Banerjee and Peter Stone.
In The 20th International Joint Conference on Artificial Intelligence, 2007

Some other papers on Transfer learning

This work addresses the risk of negative transfer and task dissimilarity
A2T: Attend, Adapt and Transfer Attentive Deep Architecture for Adaptive Transfer from multiple sources

This work addresses an improvement to finetuning by adding columns to a deep net and never removing the previously learned weights and avoids catastrophic forgetting.
Progressive Neural Networks

This work explicitly models the differences between two domains to adjust a network trained on one domain and applied to a different one.
Beyond sharing weights for deep domain adaptation

This work trains a network on several task simultaneously and also incorporates expert demonstrations to create general representations that can then be transferred.
Actor-Mimic Deep Multitaskc and Transfer Reinforcement Learning

[Back to Department Homepage]

Page maintained by Peter Stone
Questions? Send me mail