CS394R/ECE381V: Reinforcement Learning: Theory and Practice -- Spring 2024: Resources Page

Resources for Reinforcement Learning: Theory and Practice


Week 1: Class Overview, Intro, and Multi-armed Bandits

  • Slides from Tuesday: Peter's; Amy's.
  • Slides from Thursday: Peter's; Amy's.
  • The ones from Shivaram Kalyanakrishnan: pdf.
  • Readings from a past version of the course:
  • Sections 1, 2, 4, and 5 and the proof of Theorem 1 in Section 3. The proof of Theorem 3 and the appendices are optional.
    UCB: Finite-time Analysis of the Multiarmed Bandit Problem
    Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer
    2002
  • Sections 1, 2, 3.1, 4, and 5. The details of the proof (Sections 3.2-3.4) are optional.
    Thompson Sampling: an asymptotically optimal finite-time analysis
    Emilie Kaufmann, Nathaniel Korda, and Remi Munos
    2012
  • Csaba Szepesvari's banditalgs.com.
  • Vermorel and Mohri: Multi-Armed Bandit Algorithms and Empirical Evaluation.
  • Shivaram Kalyanakrishnan and Peter Stone: Efficient Selection of Multiple Bandit Arms: Theory and Practice. In ICML 2010. Here are some related slides.
  • An RL reading list from Shivaram Kalyanakrishnan.
  • An Empirical Evaluation of Thompson Sampling
    Olivier Chapelle and Lihong Li
    NeurIPS 2011

  • Week 2: MDPs and Dynamic Programming

  • Slides from Tuesday: Chapter 3.
  • Slides from Thursday: Chapter 4 overview; Examples.
  • A paper on "On the Complexity of solving MDPs" (Littman, Dean, and Kaelbling, 1995).
  • Pashenkova, Rish, and Dechter: Value Iteration and Policy Iteration Algorithms for Markov Decision Problems.
  • Rich Sutton's slides for Chapter 4 (1st edition): html.
  • Email discussion on the Gambler's problem.

  • Week 3: Monte Carlo Methods and Temporal Difference Learning

  • Slides from Tuesday: Chapter 5
  • Slides from Thursday: Chapter 6 overview; Examples.
  • Some slides on robot localization that include information on importance sampling.
  • Rich Sutton's slides for Chapter 5: html.
  • Rich Sutton's old slides for Chapter 6: html.
  • A blog post explaining why double learning helps deal with maximization bias.
  • A Q-learning video
  • Harm van Seijen, Hado van Hasselt, Shimon Whiteson, and Marco Wiering, A Theoretical and Empirical Analysis of Expected Sarsa. In ADPRL 2009.
  • A paper with surprising results on the strict variance bounds of popular importance sampling techniques. (Yao Liu, Pierre-Luc Bacon, and Emma Brunskill. ICML 2020.)
  • The article, referred to in Sec. 5.1 of the book, that proves that every-visit MC converges to the correct value function in the limit (infinite experience).
  • A paper that shows that using an empirical estimate of the behavior policy works better than using the true behavior policy in the importance sampling ratio.
  • A paper on Reducing Sampling Error in Batch Temporal Difference Learning by Pavse et al.

  • Week 4: n-step Bootstrapping and Planning

  • Slides from Tuesday: Chapter 7
  • Slides from Thursday: intro and summary; Examples; The planning ones; Slides on MCTS;
  • Multi-Step Reinforcement Learning: A Unifying Algorithm by Asis et al.
  • Slides by Alan Fern on Monte Carlo Tree Search and UCT
  • On the Analysis of Complex Backup Strategies in Monte Carlo Tree Search by Khandelwal, Liebman, Stone, and Niekum. ICML 2016.
  • A Survey of Monte Carlo Tree Search Methods by Browne et al. (IEEE TRANSACTIONS ON COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, VOL. 4, NO. 1, MARCH 2012)
  • The Dependence of Effective Planning Horizon on Model Accuracy by Nan Jiang, Alex Kulesza, Satinder Singh, and Richard Lewis. In International Conference on Autonomous Agents and MultiAgent Systems (AAMAS), 2015.
  • Rich Sutton's slides for Chapter 9 of the 1st edition (planning and learning): html.
  • A paper that uses MCTS for robot learning: TEXPLORE: Real-Time Sample-Efficient Reinforcement Learning for Robots, by Hester and Stone.
  • TD Gamma Paper mentioned in class: TD Gamma: Re-evaluating Complex Backups in Temporal Difference Learning, by Konidaris, Niekum, and Thomas. NeurIPS 2011.
  • TD Omega Paper mentioned in class: Policy Evaluation Using the Omega-Return, by Thomas, Niekum, Theocharous, and Konidaris. NeurIPS 2015.

  • Week 5: On-policy Prediction with Approximation

  • Main slide deck from Tuesday/Thursday (created by Scott Niekum): Chapter 9; Examples; The intro slides.
  • Rich Sutton's slides for Chapter 8 of the 1st edition (generalization and function approximation): html.
  • Keepaway Soccer: From Machine Learning Testbed to Benchmark - a paper that compares CMAC, RBF, and NN function approximators on the same task. The keepaway slides: pdf
  • Some slides showing the proof of the TD fixed point with the factor of 1/(1 - gamma)
  • A good overview of tile coding.
  • Evolutionary Function Approximation by Shimon Whiteson.
  • Sherstov and Stone: Function Approximation via Tile Coding: Automating Parameter Choice.
  • Sridhar Mahadaven's proto-value functions
  • Boyan, J. A., and A. W. Moore, Generalization in Reinforcement Learning: Safely Approximating the Value Function.
  • Andrew Smith's Applications of the Self-Organising Map to Reinforcement Learning
  • Least-Squares Temporal Difference Learning Justin Boyan.
  • Experiments with Reinforcement Learning in Problems with Continuous State and Action Spaces (1998) Juan Carlos Santamaria, Richard S. Sutton, Ashwin Ram. Comparisons of several types of function approximators (including instance-based like Kanerva).
  • Binary action search for learning continuous-action control policies (2009). Pazis and Lagoudakis.
  • A Convergent Form of Approximate Policy Iteration (2002) T. J. Perkins and D. Precup. A convergence guarantee with function approximation.
  • Chapman and Kaelbling: Input Generalization in Delayed Reinforcement Learning: An Algorithm and Performance Comparisons.
  • Dopamine: generalization and Bonuses (2002) Kakade and Dayan.
  • Model-Free Least-Squares Policy Iteration Michail G. Lagoudakis and Ronald Parr Proceedings of NIPS*2001: Neural Information Processing Systems: Natural and Synthetic Vancouver, BC, December 2001, pp. 1547-1554.
  • Technical update: Least-squares temporal difference learning Justin A. Boyan

  • Week 6: On Policy Control with Approximation and Off Policy Methods with Approximation

  • Slides from Tuesday: Chapter 10.
  • Slides from Thursday: Chapter 11
  • Sprague and Ballard: Multiple-Goal Reinforcement Learning with Modular Sarsa(0).
  • Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results by Mahadaven.
  • Discounted Reinforcement Learning is Not an Optimization Problem by Naik, Shariff, Yasui, and Sutton.
  • Moore and Atkeson: The Parti-game Algorithm for Variable Resolution Reinforcement Learning in Multidimensional State Spaces.
  • Residual Algorithms: Reinforcement Learning with Function Approximation (1995) Leemon Baird. More on the Baird counterexample as well as an alternative to doing gradient descent on the MSE.
  • Toward Off-Policy Learning Control with Function Approximation Maei et al. ICML 2010 - solves Baird's counterexample -Greedy-GQ for linear function approximation control
  • Experiments with Reinforcement Learning in Problems with Continuous State and Action Spaces (1998) Juan Carlos Santamaria, Richard S. Sutton, Ashwin Ram. Comparisons of several types of function approximators (including memory-based).
  • Temporal-Logic-Based Reward Shaping for Continuing Reinforcement Learning Tasks. A paper about reward shaping for average reward RL.
  • A paper that uses average reward RL.
  • Dr Jekyll and Mr Hyde: The Strange Case of Off-Policy Policy Updates. By Laroche and Tachet des Combes on convergence guarantees.
  • Near Optimal Provable Uniform Convergence in Off-Policy Evaluation for Reinforcement Learning By Yin et al.
  • Doubly Robust Off-Policy Actor-Critic: Convergence and Optimality By Xu et al.
  • On Convergence of Average-Reward Off-Policy Control Algorithms in Weakly Communicating MDPs By Wan and Sutton.

  • Week 7: Eligibility Traces

  • Slides from Tuesday: intro and summary; Examples
  • Rich Sutton's slides: pdf
  • The keepaway slides: pdf
  • Keepaway PASS slides, GETOPEN slides and the keepaway main page
  • The forward and backward views of TD(lambda) are equivalent.
  • Dayan: The Convergence of TD(&lambda) for General &lambda.
  • The paper that introduced Dutch traces and off-policy true on-line TD
  • An empirical analysis of true on-line TD: True Online Temporal-Difference Learning by van Seijen et al. (includes comparison to replacing traces)
  • Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation
  • An extensive empirical study of many different linear TD algorithms by Adam White and Martha White (AAMAS 2016).
  • A paper that addresses relationship between first-visit and every-visit MC (Singh and Sutton, 1996). For some theoretical relationships see section starting at section 3.3 (and referenced appendices). The equivalence of MC and first visit TD(1) is proven starting in Section 2.4.
  • A paper that discusses pseudo-termination
  • Reinforcement Learning and Decision Making (RLDM) paper on intelligent decision making by Richard Sutton mentioned in class.
  • Maei et al. NeurIPS 2009 - GTD for nonlinear function approximation policy evaluation
  • TD Gamma Paper mentioned in class: TD Gamma: Re-evaluating Complex Backups in Temporal Difference Learning, by Konidaris, Niekum, and Thomas. NeurIPS 2011.
  • TD Omega Paper mentioned in class: Policy Evaluation Using the Omega-Return, by Thomas, Niekum, Theocharous, and Konidaris. NeurIPS 2015.

  • Week 8: Policy Gradient Methods

  • Slides from Tuesday: Chapter 13
  • Slides from Thursday: overview; policy search; AIBO walking; comparing VF and PS methods
  • Train faster, generalize better: Stability of stochastic gradient descent by Moritz Hardt, Benjamin Recht, and Yoram Singer
  • A video from Riedmiller's group
  • Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion. Nate Kohl and Peter Stone In Proceedings of the IEEE International Conference on Robotics and Automation, May 2004.
  • This paper compares the policy gradient RL method with other algorithms on the walk learning: Machine Learning for Fast Quadrupedal Locomotion. Kohl and Stone. AAAI 2004.
  • Overview of Policy Gradient Methods by Jan Peters: http://www.scholarpedia.org/article/Policy_gradient_methods
  • A couple of articles on the details of actor-critic in practice by Tsitsklis and by Williams.
  • Natural Actor Critic. Jan Peters and Stefan Schaal Neurocomputing 2008. Earlier version in ECML 2005.
  • The original policy gradient RL paper.
  • A post by Karpathy on deep RL including with policy gradients
  • Characterizing Reinforcement Learning Methods through Parameterized Learning Problems Shivaram Kalyanakrishnan and Peter Stone. Machine Learning (MLJ), 84(1--2):205-47, July 2011.
  • Comparing Evolutionary and Temporal Difference Methods for Reinforcement Learning. Matthew Taylor, Shimon Whiteson, and Peter Stone. In Proceedings of the Genetic and Evolutionary Computation Conference, pp. 1321-28, July 2006.
  • Evolutionary Function Approximation for RL. Whiteson and Stone, MLJ 2006.
  • Spinning up in Deep Reinforcement Learning Resources from OpenAI on getting up to speed using Deep Reinforcement Learning
  • Google blog post on QT-Opt.
  • QT-Opt Paper. Kalashnikov et al. (CoRL 2018).
  • End-to-End Training of Deep Visuomotor Policies. Levine et al. 2015

  • Week 9: Applications and Case Studies

  • Slides from Tuesday: intro and summary;
  • The slides I showed on AlphaGo
  • Main page for GT Sophy
  • In December 2022, I gave an invited talk in the NeurIPS workshop on Reinforcement Learning for Real Life (RL4RealLife) on "Outracing Champion Gran Turismo Drivers with Deep Reinforcement Learning", based on our article in Nature (30-minute video).
  • Autonomous navigation of stratospheric balloons using reinforcement learning.
  • Accelerating fusion science through learned plasma control.
  • Guided policy search for visual manipulation: pdf video
  • OpenAI hide and seek: blog post
  • OpenAI Rubik's cube: blog post
  • Reward-Respecting Subtasks for Model-Based Reinforcement Learning
  • from Jan Peters' group: Policy Search for Motor Primitives in Robotics
  • Szita and Lörincz: Learning Tetris Using the Noisy Cross-Entropy Method.
  • Autonomous Helicopter Control using Reinforcement Learning Policy Search Methods. J. Bagnell and J. Schneider Proceedings of the International Conference on Robotics and Automation 2001, IEEE, May, 2001.
  • Autonomous helicopter flight via reinforcement learning. Andrew Ng, H. Jin Kim, Michael Jordan and Shankar Sastry. In S. Thrun, L. Saul, and B. Schoelkopf (Eds.), Advances in Neural Information Processing Systems (NeurIPS) 17, 2004.
  • PEGASUS: A policy search method for large MDPs and POMDPs. Andrew Ng and Michael Jordan. Some of the helicopter videos learned with PEGASUS.
  • Autonomous reinforcement learning on raw visual input data in a real world application. Sascha Lange, Martin Riedmiller, Arne Voigtlander. IJCNN 2012.
  • Tesauro, G., Temporal Difference Learning and TD-Gammon. Communication of the ACM, 1995
  • Practical Issues in Temporal Difference Learning: an earlier paper by Tesauro (with a few more details)
  • Pollack, J.B., & Blair, A.D. Co-evolution in the successful learning of backgammon strategy. Machine Learning, 1998
  • Tesauro, G. Comments on Co-Evolution in the Successful Learning of Backgammon Strategy. Machine Learning, 1998.
  • Modular Neural Networks for Learning Context-Dependent Game Strategies, Justin Boyan, 1992: a partial replication of TD-gammon.
  • A fairly complete overview of one of the first applications of UCT to Go: "Monte-Carlo Tree Search and Rapid Action Value Estimation in Computer Go". Gelly and Silver. AIJ 2011.
  • Some papers from Simon Lucas' group on comparing TD learning and co-evolution in various games: Othello; Go;
  • S. Gelly and D. Silver. Achieving Master-Level Play in 9x9 Computer Go. In Proceedings of the 23rd Conference on Artificial Intelligence, Nectar Track (AAAI-08), 2008.
  • Simulation-Based Approach to General Game Playing Hilmar Finnsson and Yngvi Bjornsson AAAI 2008.
  • Some papers from the UT Learning Agents Research Group on General Game Playing
  • Deep Reinforcement Learning with Double Q-learning. Hado van Hasselt, Arthur Guez, David Silver
  • Scaling Reinforcement Learning toward RoboCup Soccer. Peter Stone and Richard S. Sutton. Proceedings of the Eighteenth International Conference on Machine Learning, pp. 537-544, Morgan Kaufmann, San Francisco, CA, 2001.
  • The UT Austin Villa RoboCup team home page.
  • Greg Kuhlmann's follow-up on progress in 3v2 keepaway
  • Kalyanakishnan et al.: Model-based Reinforcement Learning in a Complex Domain.
  • Making a Robot Learn to Play Soccer Using Reward and Punishment. Heiko Müller, Martin Lauer, Roland Hafner, Sascha Lange, Artur Merke and Martin Riedmiller. 30th Annual German Conference on AI, KI 2007.
  • Reinforcement Learning for Sensing Strategies. C. Kwok and D. Fox. Proceedings of IROS, 2004.
  • Learning to trade via direct reinforcement John Moody and Matthew Saffell IEEE Transactions on Neural Networks, 2001.
  • Reinforcement learning for optimized trade execution Yuriy Nevmyvaka, Yi Feng, and Michael Kearns ICML 2006
  • Adaptive Treatment of Epilepsy via Batch-mode Reinforcement Learning Arthur Guez, Robert D. Vincent, Massimo Avoli, Joelle Pineau. IAAI 2008
  • PAC Optimal Planning for Invasive Species Management: Improved Exploration for Reinforcement Learning from Simulator-Defined MDPs. Thomas G. Dietterich, Majid Taleghan, and Mark Crowley AAAI 2013.
  • Design, Analysis, and Learning Control of a Fully Actuated Micro Wind Turbine. J. Zico Kolter, Zachary Jackowski, Russ Tedrake American Controls Conference 2012.
  • Sutton's "The Bitter Lesson"

  • Week 10: Modern Landscape

  • Slides from Tuesday on Rainbow DQN and distributional RL
  • Slides from Thursday on TRPO, DDPG, TD3, and maximum entropy RL
  • Human-level control through deep reinforcement learning pdf (Mnih et al. Nature 2015)
  • Rainbow: Combining Improvements in Deep Reinforcement Learning pdf (Hessel et al. AAAI 2018)
  • Distributional Perspective on Reinforcement Learning pdf (Bellemare et al. ICML 2017)
  • Comparative Analysis of Distributional Reinforcement Learning pdf (Lyle et al. AAAI 2019)
  • OpenAI blog post on TRPO.
  • Trust Region Policy Optimization pdf (Schulman et al. ICML 2015)
  • OpenAI blog post on PPO.
  • Proximal Policy Optimization pdf (Schulman et al. 2017)
  • OpenAI blog post on DDPG.
  • Deterministic Policy Gradient Algorithms (DPG) pdf (Silver et al. ICML 2014)
  • Continuous Control with Deep Reinforcement Learning (DDPG) pdf (Lillicrap et al. ICLR 2016)
  • OpenAI blog post on TD3.
  • Twin Delayed Deep Deterministic Policy Gradients (TD3) pdf (Fujimoto et al. ICML 2018)
  • OpenAI blog post on SAC.
  • Soft Actor Critic pdf (Haarnoja et al. ICML 2018)
  • Soft Actor Critic Applications pdf (Haarnoja et al. 2018)
  • Learning to Walk via Deep RL pdf (Haarnoja et al. RSS 2019)
  • MAML pdf (Finn et al. ICML 2017)
  • Meta-Learning and Universality pdf (Finn and Levine ICLR 2018).
  • Local Nonparametric Meta-Learning pdf
  • One Shot Learning of Multi-Step Tasks from Observation via Activity Localization in Auxiliary Video pdf (Goo and Niekum ICRA 2019)
  • Natural Language for Reward Shaping pdf (Goyal et al. IJCAI 2019)
  • Teacher Gaze Patterns pdf (Saran et al. CoRL 2019)
  • A perspective paper by Bender and Koller on Meaning and Form related to ML models as symbol processors (ACL 2020).

  • Week 11: Abstraction: Options and Hierarchy

  • Slides from Tuesday: State Abstractions
  • Slides from Thursday: Overview.(see below for the ones on utility of state/temporal abtractions and on learning policy irrelevance)
  • Some slides on successor features. And some on transfer learning.
  • Tom Dietterich's classic paper on MAXQ: pdf.
  • The Option-Critic Architecture by Bacon, Harb, and Precup.
  • The Journal version of the MaxQ paper
  • Konidaris and Barto's paper on skill chaining: pdf.
  • Vezhnevets el. al paper on feudal networks: pdf.
  • Hausman et. al paper on learning skill embeddings: pdf.
  • Abel et. al paper on value-preserving abstractions: pdf.
  • Improved Automatic Discovery of Subgoals for Options in Hierarchical Reinforcement Learning by Kretchmar et al.
  • Nick Jong and Todd Hester's paper on the utility of temporal abstraction. The slides.
  • Lihong Li and Thomas J. Walsh and Michael L. Littman, Towards a Unified Theory of State Abstraction for MDPs , Ninth International Symposium on Artificial Intelligence and Mathematics , 2006.
  • Tom Dietterich's tutorial on abstraction.
  • Nick Jong's paper on state abstraction discovery. The slides.
  • Nick Jong's annotated slides
  • A follow-up paper on eliminating irrelevant variables within a subtask: State Abstraction in MAXQ Hierarchical Reinforcement Learning
  • Automatic Discovery and Transfer of MAXQ Hierarchies (from Dietterich's group - 2008)
  • Hierarchical Model-Based Reinforcement Learning: Rmax + MAXQ. Proceedings of the 25th International Conference on Machine Learning, 2008. Nicholas K. Jong and Peter Stone
  • Ruohan Zhang's 2013 slides on forms of hierarchy.
  • Sasha Sherstov's 2004 slides on option discovery.

  • Week 12: Exploration and Intrinsic Motivation

  • Slides from Tuesday: Exploration and Intrinsic Motivation I
  • Slides from Thursday: Exploration and Intrinsic Motivation II
  • VIME: Variational Information Maximizing Exploration Houthooft et al.
  • R-Max - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Ronen Brafman and Moshe Tenenholtz The Journal of Machine Learning Research (JMLR) 2002
  • Efficient Structure Learning in Factored-state MDPs Alexander L. Strehl, Carlos Diuk, and Michael L. Littman AAAI'2007
  • The Adaptive k-Meteorologists Problem and Its Application to Structure Learning and Feature Selection in Reinforcement Learning Carlos Diuk, Lihong Li, and Bethany R. Leffler ICML 2009
  • Slides and video for the k-meteorologists paper
  • An Analysis of Model-Based Interval Estimation for Markov Decision Processes Alexander L. Strehl and Michael L. Littman MLJ 2008.
  • Model-Based Exploration in Continuous State Spaces Nicholas K. Jong and Peter Stone The Seventh Symposium on Abstraction, Reformulation, and Approximation, July 2007.
  • TEXPLORE: Real-Time Sample-Efficient Reinforcement Learning for Robots. Todd Hester and Peter Stone Machine Learning 2012
  • Near-Optimal Reinforcement Learning in Polynomial Time Satinder Singh and Michael Kearns
  • Strehl et al.: PAC Model-Free Reinforcement Learning.
  • Safe Exploration in Markov Decision Processes Moldovan and Abbeel, ICML 2012 (safe exploration in non-ergodic domains by favoring policies that maintain the ability to return to the start state)
  • Intrinsically motivated reinforcement learning (Singh et al.)
  • Go-Explore (Ecoffet et al.)
  • Evolved intrinsic rewards for efficient exploration (Niekum et al.)
  • Curiostiy-based exploration for multi-step tasks (Colas et al.)
  • Intrinsically motivated model learning for developing curious robots (Hester et al.)
  • Interactions Between Learning and Evolution (Ackley and Littman)
  • Exploration with Unreliable Intrinsic Reward in Multi-Agent Reinforcement Learning (Boehmer et al.)

  • Week 13: Learning from Human Input

  • Slides from Tuesday: Overview
  • TAMER slides
  • Deep TAMER slides
  • EMPATHIC slides
  • Imitation from Observation (IfO) slides
  • Tuesday slides on imitation learning and IRL
  • Slides from Thursday: Brad Knox's slides
  • Some slides on inverse RL from Pieter Abbeel.
  • APPL: Adaptive Planner Parameter Learning; and some slides about it
  • Maximum Entropy IRL: pdf.
  • Generative Adversarial Imitation Learning: pdf.
  • Behavioral Cloning from Observations: pdf.
  • Dataset Aggregation (DAgger): pdf.
  • IRL with rankings: pdf.
  • Niekum et al. Learning to assemble IKEA from demonstrations: pdf.
  • Knox et al. Understanding Human Teaching Modalities in Reinforcement LearningEnvironments: A Preliminary Report: pdf.
  • Kaochar et al. Towards Understanding How Humans Teach Robots: pdf.
  • sequential TAMER+RL and other follow-up papers by Brad Knox.
  • The deep TAMER paper.
  • The EMPATHIC paper.
  • The BCO paper.
  • The GAIfO paper.
  • Towards Resolving Unidentifiability in Inverse Reinforcement Learning. Kareem Amin and Satinder Singh
  • Nonlinear Inverse Reinforcement Learning with Gaussian Processes Sergey Levine, Zoran Popovic, Vladlen Koltun.
  • Inverse Reinforcement Learning in Partially Observable Environments Jaedeug Choi and Kee-Eung Kim
  • Some papers on IRL and learning by demonstration:
  • Deep Apprenticeship Learning for Playing Video Games
  • Maximum Entropy Deep Inverse Reinforcement Learning
  • Generative Adversarial Imitation Learning
  • Generative Adversarial Imitation from Observation
  • Creating Advice-Taking Reinforcement Learners. Richard Maclin and Jude Shavlik. Machine Learning, 22, pp. 251-281, 1996.
  • Gregory Kuhlmann, Peter Stone, Raymond Mooney, and Jude Shavlik: Guiding a Reinforcement Learner with Natural Language Advice: Initial Results in RoboCup Soccer.
  • Sonia Chernova and Manuela Veloso: Confidence-Based Policy Learning from Demonstration Using Gaussian Mixture Models.

  • Week 14: Multiagent RL and Reproducibility

  • Slides from Tuesday's Lecture:
  • Amy's slides.
  • Joelle's slides on reproducable, reusable, and robust reinforcement learning.
  • Slides from Thursday:
  • overview slides.
  • Stochastic games slides from Michael Bowling: pdf
  • The ones on grid games 3, also by Bowling: pdf
  • The ones on CTDE and QMix(pdf)
  • Michael Bowling and Manuela Veloso
    Rational and Convergent Learning in Stochastic Games
    IJCAI 2001.
  • Journal version of WoLF
  • D.S. Brown and S. Niekum.
    Efficient Probabilistic Performance Bounds for Inverse Reinforcement Learning
    AAAI Conference on Artificial Intelligence, February 2018.
  • Doran Chakraborty and Peter Stone Convergence, Targeted Optimality and Safety in Multiagent Learning (CMLeS) ICML 2010. journal version
  • A CMLeS-like algorithm that can be applied
  • A paper on threats
  • Busoniu, L. and Babuska, R. and De Schutter, B. A comprehensive survey of multiagent reinforcement learning IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 28(2), 156-172, 2008.
  • Michael Littman
    Markov Games as a Framework for Multi-Agent Reinforcement Learning
    ICML, 1994.
  • Michael Bowling Convergence and No-Regret in Multiagent Learning NeurIPS 2004
  • Kok, J.R. and Vlassis, N., Collaborative multiagent reinforcement learning by payoff propagation, The Journal of Machine Learning Research, 7, 1828, 2006.
  • A brief survey on Multiagent Learning. by Doran Chakraborty
  • gametheory.net
  • Some useful slides (part C) from Michael Bowling on game theory, stochastic games, correlated equilibria; and (Part D) from Michael Littman with more on stochastic games.
  • Scaling up to bigger games with empirical game theory
  • Rob Powers and Yoav Shoham New Criteria and a New Algorithm for Learning in Multi-Agent Systems NeurIPS 2004. journal version
  • A suite of game generators called GAMUT from Stanford.
  • RoShamBo (rock-paper-scissors) contest
  • U. of Alberta page on automated poker.
  • A paper introducing ad hoc teamwork
  • An article addressing ad hoc teamwork, applied in both predator/prey and RoboCup soccer.
  • Ad hoc teamwork as flocking
  • Agarwal et al., NeurIPS 2021
    Deep Reinforcement Learning at the Edge of the Statistical Precipice
  • High confidence policy improvement (Thomas et al.): pdf
  • Safe reinforcement learning via shielding (Alshiekh et al.): pdf
  • Bootstrapping with models: confidence intervals for off-policy evaluation (Hanna et al.): pdf
  • QMIX Paper by Rashid et al. 2018. ICML 2018.
  • Do ImageNet Classifiers Generalize to ImageNet? pdf (Recht et al. 2019)
  • Perspective paper on Over/Underclaiming in NLP from Sam Bowman
  • RL Benchmarks:
  • MuJoCo Physics benchmark: Website; Github repo.
  • Atari 2600 Arcade benchmark: Arcade Learning Environment pdf(Bellemare at al. 2013)
  • B-Pref: Benchmarking Preference-Based Reinforcement Learningpdf (Lee et al. 2021)
  • Behaviour Suite for Reinforcement Learning: B-suite pdf (Osband et al. ICLR 2020)
  • Procgen benchmark from OpenAI: website.

  • [Back to Department Homepage]

    Page maintained by Peter Stone
    Questions? Send me mail