Peter Stone's Selected Publications

Classified by TopicClassified by Publication TypeSorted by DateSorted by First Author Last NameClassified by Funding Source


D-Shape: Demonstration-Shaped Reinforcement Learning via Goal Conditioning

D-Shape: Demonstration-Shaped Reinforcement Learning via Goal Conditioning.
Caroline Wang, Garrett Warnell, and Peter Stone.
In Proceedings of the 22nd International Conference on Autonomous Agents and Multiagent Systems (AAMAS), May 2023.

Download

[PDF]1.6MB  [slides.pdf]2.4MB  [poster.pdf]1.4MB  

Abstract

While combining imitation learning (IL) and reinforcement learning (RL) is a promising way to address poor sample efficiency in autonomous behavior acquisition, methods that do so typically assume that the requisite behavior demonstrations are provided by an expert that behaves optimally with respect to a task reward. If, however, suboptimal demonstrations are provided, a fundamental challenge appears in that the demonstration- matching objective of IL conflicts with the return-maximization objective of RL. This paper introduces D-Shape, a new method for combining IL and RL that uses ideas from reward shaping and goal-conditioned RL to resolve the above conflict. D-Shape allows learning from suboptimal demonstrations while retaining the ability to find the optimal policy with respect to the task reward. We experimentally validate D-Shape in sparse-reward gridworld domains, showing that it both improves over RL in terms of sample efficiency and converges consistently to the optimal policy in the presence of suboptimal demonstrations.

BibTeX Entry

@InProceedings{aamas23-wang,
  author = {Caroline Wang and Garrett Warnell and Peter Stone},
  title = {D-Shape: Demonstration-Shaped Reinforcement Learning via Goal Conditioning},
  booktitle = {Proceedings of the 22nd International Conference on Autonomous Agents and Multiagent Systems (AAMAS)},
  location = {London, UK},
  month = {May},
  year = {2023},
  abstract = {
    While combining imitation learning (IL) and reinforcement learning (RL) 
    is a promising way to address poor sample efficiency in autonomous behavior
    acquisition, methods that do so typically assume that the requisite
    behavior demonstrations are provided by an expert that behaves optimally
    with respect to a task reward. If, however, suboptimal demonstrations are
    provided, a fundamental challenge appears in that the demonstration-
    matching objective of IL conflicts with the return-maximization objective
    of RL. This paper introduces D-Shape, a new method for combining IL and RL
    that uses ideas from reward shaping and goal-conditioned RL to resolve the
    above conflict. D-Shape allows learning from suboptimal demonstrations
    while retaining the ability to find the optimal policy with respect to the
    task reward. We experimentally validate D-Shape in sparse-reward gridworld
    domains, showing that it both improves over RL in terms of sample
    efficiency and converges consistently to the optimal policy in the presence
    of suboptimal demonstrations.
  },
}

Generated by bib2html.pl (written by Patrick Riley ) on Tue Nov 19, 2024 10:24:41