CS394R: Reinforcement Learning: Theory and Practice -- Fall 2016: Resources Page
Resources for
Reinforcement Learning: Theory and Practice
Week 0: Class Overview, Introduction
Slides from week 0:
pdf
.
Week 1: Introduction and Evaluative Feedback
Slides from Tuesday:
pdf
.
Slides from Thursday:
pdf
.
The one from Shivaram Kalyanakrishnan:
pdf
.
Sections 1, 2, 4, and 5 and the proof of Theorem 1 in Section 3. The proof of Theorem 3 and the appendices are optional.
UCB:
Finite-time Analysis of the Multiarmed Bandit Problem
Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer
2002
Sections 1, 2, 3.1, 4, and 5. The details of the proof (Sections 3.2-3.4) are optional.
Thompson Sampling: an asymptotically optimal finite-time analysis
Emilie Kaufmann, Nathaniel Korda, and Remi Munos
2012
Csaba Szepesvari's
banditalgs.com
.
Vermorel and Mohri:
Multi-Armed Bandit Algorithms and Empirical Evaluation
.
Shivaram Kalyanakrishnan and Peter Stone:
Efficient Selection of Multiple Bandit Arms: Theory and Practice
. In ICML 2010. Here are some
related slides
.
An
RL reading list
from Shivaram Kalyanakrishnan.
Rich Sutton's slides for Chapter 2 (1st edition):
html
.
Rich Sutton's slides for Chapter 2 (2nd edition):
pdf
.
An Empirical Evaluation of Thompson Sampling
Olivier Chapelle and Lihong Li
NIPS 2011
Week 2: MDPs and Dynamic Programming
Slides from week 2:
pdf
Rich Sutton's slides for Chapter 3 (1st edition):
pdf
.
Rich Sutton's slides for Chapter 4 (1st edition):
html
.
Email discussion on the Gambler's problem
.
A paper on
"On the Complexity of solving MDPs"
(Littman, Dean, and Kaelbling, 1995).
Pashenkova, Rish, and Dechter:
Value Iteration and Policy Iteration Algorithms for Markov Decision Problems
.
Week 3: Monte Carlo Methods and Temporal Difference Learning
Slides from week 3:
pdf
.
Some slides on
robot localization
that include information on importance sampling.
Harm van Seijen, Hado van Hasselt, Shimon Whiteson, and Marco Wiering,
A Theoretical and Empirical Analysis of Expected Sarsa.
In ADPRL 2009.
A paper that
addresses relationship between first-visit and every-visit MC
(Singh and Sutton, 1996). For some theoretical relationships see section starting at section 3.3 (and referenced appendices). The equivalence of MC and first visit TD(1) is proven starting in Section 2.4.
Rich Sutton's slides for Chapter 5:
html
.
Rich Sutton's old slides for Chapter 6:
html
.
Rich Sutton's updated slides for Chapter 6:
pdf
.
A
Q-learning video
Week 4: Multi-Step Bootstrapping and Planning
Slides from week 4:
pdf
.
The planning ones
.
Slides
by Alan Fern on Monte Carlo Tree Search and UCT
On the Analysis of Complex Backup Strategies in Monte Carlo Tree Search
by Khandelwal et al.
A Survey of Monte Carlo Tree Search Methods
by Browne et al.
(IEEE TRANSACTIONS ON COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, VOL. 4, NO. 1, MARCH 2012)
The Dependence of Effective Planning Horizon on Model Accuracy
by Nan Jiang, Alex Kulesza, Satinder Singh, and Richard Lewis.
In International Conference on Autonomous Agents and MultiAgent Systems (AAMAS), 2015.
Rich Sutton's
Chapter 8 slides
Rich Sutton's slides for Chapter 9 of the 1st edition (planning and learning):
html
.
A new survey on
Bayesian RL
by Ghavamzadeh et al.
Week 5: Approximate On-policy Prediction and Control
Slides from week 5:
pdf
.
Rich Sutton's slides for Chapter 8 of the 1st edition (generalization):
html
.
Rich Sutton's slides for Chapter 9:
pdf
Evolutionary Function Approximation
by Shimon Whiteson.
Dopamine: generalization and Bonuses
(2002) Kakade and Dayan.
Keepaway Soccer: From Machine Learning Testbed to Benchmark
- a paper that compares CMAC, RBF, and NN function approximators on the same task.
Residual Algorithms: Reinforcement Learning with Function Approximation
(1995) Leemon Baird. More on the Baird counterexample as well as an alternative to doing gradient descent on the MSE.
Boyan, J. A., and A. W. Moore,
Generalization in Reinforcement Learning: Safely Approximating the Value Function.
In Tesauro, G., D. S. Touretzky, and T. K. Leen (eds.), Advances in Neural Information Processing Systems 7 (NIPS). MIT Press, 1995. Another example of function approximation divergence and a proposed solution.
Experiments with Reinforcement Learning in Problems with Continuous State and Action Spaces
(1998) Juan Carlos Santamaria, Richard S. Sutton, Ashwin Ram. Comparisons of several types of function approximators (including instance-based like Kanerva).
Binary action search for learning continuous-action control policies
(2009). Pazis and Lagoudakis. (
slides
)
Least-Squares Temporal Difference Learning
Justin Boyan.
A Convergent Form of Approximate Policy Iteration
(2002) T. J. Perkins and D. Precup. A convergence guarantee with function approximation.
Moore and Atkeson:
The Parti-game Algorithm for Variable Resolution Reinforcement Learning in Multidimensional State Spaces
.
Sherstov and Stone:
Function Approximation via Tile Coding: Automating Parameter Choice
.
Chapman and Kaelbling:
Input Generalization in Delayed Reinforcement Learning: An Algorithm and Performance Comparisons
.
Sašo Džeroski, Luc De Raedt and Kurt Driessens:
Relational Reinforcement Learning
.
Sprague and Ballard:
Multiple-Goal Reinforcement Learning with Modular Sarsa(0)
.
A post on
Deep Q learning
.
another
Week 6: Approximate Off-policy Methods and Eligibility Traces
Slides from week 6:
pdf
.
Slides from Thursday:
pdf
.
Neural network slides (from Tom Mitchell's book)
Rich Sutton's slides for Chapter 7 of the first edition:
html
.
Rich Sutton's updated slides:
pdf
Dayan:
The Convergence of TD(&lambda) for General &lambda
.
The paper that introduced
Dutch traces and off-policy true on-line TD
An empirical analysis of true on-line TD:
True Online Temporal-Difference Learning
by van Seijen et al. (includes comparison to replacing traces)
Toward Off-Policy Learning Control with Function Approximation
Maei et al. ICML 2010 -
solves Baird's counterexample
- Greedy-GQ for linear function approximation control
Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation
Maei et al. NIPS 2009 - GTD for nonlinear function approximation policy evaluation
Train faster, generalize better: Stability of stochastic gradient descent
by Moritz Hardt, Benjamin Recht, and Yoram Singer
Keepaway
PASS
and
GETOPEN
and the keepaway
main page
An extensive
empirical study of many different linear TD algorithms
by Adam White and Martha White (AAMAS 2016).
Week 7: Applications and Case Studies
Neural network slides (from Tom Mitchell's book)
The slides I showed on
understanding Deep RL nodes have learned
(in particular LSTM units in a partially observable environment).
The slides I showed on
AlphaGo
Some minimax slides:
ppt
.
Slides
by Sylvain Gelly on UCT
Motif backgammon
(online player)
GNU backgammon
Tesauro, G.,
Temporal Difference Learning and TD-Gammon
. Communication of the ACM, 1995
Practical Issues in Temporal Difference Learning
: an earlier paper by Tesauro (with a few more details)
Pollack, J.B., & Blair, A.D.
Co-evolution in the successful learning of backgammon strategy
. Machine Learning, 1998
Tesauro, G.
Comments on Co-Evolution in the Successful Learning of Backgammon Strategy
. Machine Learning, 1998.
Modular Neural Networks for Learning Context-Dependent Game Strategies
, Justin Boyan, 1992: a partial replication of TD-gammon.
A fairly complete overview of one of the first applications of UCT to Go:
"Monte-Carlo Tree Search and Rapid Action Value Estimation in Computer Go"
. Gelly and Silver. AIJ 2011.
Some papers from Simon Lucas' group on comparing TD learning and co-evolution in various games:
Othello
;
Go
;
Simple grid-world
Treasure hunt
.
S. Gelly and D. Silver.
Achieving Master-Level Play in 9x9 Computer Go
. In Proceedings of the 23rd Conference on Artificial Intelligence, Nectar Track (AAAI-08), 2008. Also available from
here
.
Simulation-Based Approach to General Game Playing
Hilmar Finnsson and Yngvi Bjornsson
AAAI 2008.
Some papers from the UT Learning Agents Research Group on
General Game Playing
Deep Reinforcement Learning with Double Q-learning
.
Hado van Hasselt, Arthur Guez, David Silver
Week 8: Efficient Model-Based Exploration
Slides from week 8:
pdf
.
I also showed slides on fitted rmax from Nick Jong's thesis:
annotated pdf
some Rmax slides
Code for
Fitted RMax
.
Near-Optimal Reinforcement Learning in Polynomial Time
Satinder Singh and Michael Kearns
Strehl
et al.
:
PAC Model-Free Reinforcement Learning
.
Efficient Structure Learning in Factored-state MDPs
Alexander L. Strehl, Carlos Diuk, and Michael L. Littman
AAAI'2007
A shorter paper on
MBIE
The Adaptive k-Meteorologists Problem and Its Application to Structure Learning and Feature Selection in Reinforcement Learning
Carlos Diuk, Lihong Li, and Bethany R. Leffler
ICML 2009
Slides and video for the k-meteorologists paper
Safe Exploration in Markov Decision Processes
Moldovan and Abbeel, ICML 2012
(safe exploration in non-ergodic domains by favoring policies that maintain the ability to return to the start state)
Week 9: Abstraction: Options and Hierarchy
Slides from week 9:
pdf
Ruohan Zhang's 2013 slides on
forms of hierarchy
.
Sasha Sherstov's 2004 slides on
option discovery
.
Automatic Discovery of Subgoals in RL using Diverse Density
by McGovern and Barto.
A page devoted to
option discovery
Improved Automatic Discovery of Subgoals for Options in Hierarchical Reinforcement Learning
by Kretchmar et al.
Nick Jong and Todd Hester's paper on
the utility of temporal abstraction
.
The slides
.
The
Journal version of the MaxQ paper
A follow-up paper on eliminating irrelevant variables within a subtask:
State Abstraction in MAXQ Hierarchical Reinforcement Learning
Automatic Discovery and Transfer of MAXQ Hierarchies
(from Dietterich's group - 2008)
Lihong Li and Thomas J. Walsh and Michael L. Littman,
Towards a Unified Theory of State Abstraction for MDPs
, Ninth International Symposium on Artificial Intelligence and Mathematics , 2006.
Tom Dietterich's
tutorial on abstraction
.
Nick Jong's paper on
state abstraction discovery
.
The slides
.
Nick Jong's
Thesis code repository
and
annotated slides
Week 10: Multiagent RL
Slides from week 10:
pdf
The ones on
threats
(pdf) - and the relevant
paper
The ones on
CMLeS
(ppt)
Journal version of
WoLF
A CMLeS-like algorithm that can be applied
Busoniu, L. and Babuska, R. and De Schutter, B.
A comprehensive survey of multiagent reinforcement learning
IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applicati ons and Reviews, 28(2), 156-172, 2008.
Multi-Agent Reinforcement Learning: Independent vs. Coopeative Agents
by Ming Tang
Michael Bowling
Convergence and No-Regret in Multiagent Learning
NIPS 2004
Kok, J.R. and Vlassis, N.,
Collaborative multiagent reinforcement learning by payoff propagation
, The Journal of Machine Learning Research, 7, 1828, 2006.
A brief survey on Multiagent Learning.
by Doran Chakraborty
gametheory.net
Some useful
slides
(part C) from Michael Bowling on game theory, stochastic games, correlated equilibria; and (Part D) from Michael Littman with more on stochastic games.
Scaling up to bigger games with
empirical game theory
Rob Powers and Yoav Shoham
New Criteria and a New Algorithm for Learning in Multi-Agent Systems
NIPS 2004.
journal version
A suite of game generators called
GAMUT
from Stanford.
RoShamBo (rock-paper-scissors) contest
U. of Alberta page on
automated poker
.
A paper introducing
ad hoc teamwork
An article addressing
ad hoc teamwork
, applied in both predator/prey and RoboCup soccer.
Ad hoc teamwork as
flocking
Week 11: Policy Gradient Methods
This paper compares the policy gradient RL method with other algorithms on the walk learning:
Machine Learning for Fast Quadrupedal Locomotion
. Kohl and Stone. AAAI 2004.
from Jan Peters' group:
Policy Search for Motor Primitives in Robotics
Szita and Lörincz:
Learning Tetris Using the Noisy Cross-Entropy Method
.
Autonomous helicopter flight via reinforcement learning
.
Andrew Ng, H. Jin Kim, Michael Jordan and Shankar Sastry.
In S. Thrun, L. Saul, and B. Schoelkopf (Eds.), Advances in Neural Information Processing Systems (NIPS) 17, 2004.
PEGASUS: A policy search method for large MDPs and POMDPs
.
Andrew Ng and Michael Jordan
Some of the
helicopter videos
learned with PEGASUS.
Autonomous Helicopter Control using Reinforcement Learning Policy Search Methods
.
J. Bagnell
and J. Schneider
Proceedings of the International Conference on Robotics and Automation 2001, IEEE, May, 2001.
A couple of articles on the details of actor-critic in practice by
Tsitsklis
and by
Williams
.
Natural Actor Critic
.
Jan Peters and Stefan Schaal
Neurocomputing 2008. Earlier version in ECML 2005.
PILCO: A Model-Based and Data-Efficient Approach to Policy Search
.
Marc Peter Deisenroth and Carl Edward Rasmussen
ICML 2011
The original
policy gradient RL paper
.
Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics
Sergey Levine, Pieter Abbeel. NIPS 2014.
video
Trust Region policy optimization
John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel. ICML 2015.
video
A post by Karpathy on
deep RL
including with policy gradients (repeated from week 5)
Characterizing Reinforcement Learning Methods through Parameterized Learning Problems
Shivaram Kalyanakrishnan and Peter Stone.
Machine Learning (MLJ), 84(1--2):205-47, July 2011.
Week 12: Inverse RL and Transfer Learning
Some
transfer learning slides
; The ones on
instance-based transfer
; the ones on
curriculum learning
Slides on
inverse RL
from Pieter Abbeel.
Towards Resolving Unidentifiability in Inverse Reinforcement Learning
.
Kareem Amin and Satinder Singh
Nonlinear Inverse Reinforcement Learning with Gaussian Processes
Sergey Levine, Zoran Popovic, Vladlen Koltun.
Inverse Reinforcement Learning in Partially Observable Environments
Jaedeug Choi and Kee-Eung Kim
Improving Action Selection in MDP's via Knowledge Transfer
.
Alexander A. Sherstov and Peter Stone.
In Proceedings of the Twentieth National Conference on Artificial Intelligence, July 2005.
Associated
slides
.
General Game Learning using Knowledge Transfer
.
Bikramjit Banerjee and Peter Stone.
In The 20th International Joint Conference on Artificial Intelligence, 2007
Associated
slides
.
Recent papers on IRL and learning by demonstration
Deep Apprenticeship Learning for Playing Video Games
Maximum Entropy Deep Inverse Reinforcement Learning
Generative Adversarial Imitation Learning
Recent papers on Transfer learning
This work addresses the risk of negative transfer and task dissimilarity
A2T: Attend, Adapt and Transfer Attentive Deep Architecture for Adaptive Transfer from multiple sources
This work addresses an improvement to finetuning by adding columns to a deep net and never removing the previously learned weights and avoids catastrophic forgetting.
Progressive Neural Networks
This work explicitly models the differences between two domains to adjust a network trained on one domain and applied to a different one.
Beyond sharing weights for deep domain adaptation
This work trains a network on several task simultaneously and also incorporates expert demonstrations to create general representations that can then be transferred.
Actor-Mimic Deep Multitaskc and Transfer Reinforcement Learning
Week 13: Deep RL
Reinforcement learning with unsupervised auxiliary tasks
from Deep Mind includes some action conditional learning.
An explanation of
LSTMs
.
The
Recurrent Temporal Restricted Boltzmann Machine
[
Back to Department Homepage
]
Page maintained by
Peter Stone
Questions? Send me
mail