[The Interactive Shaping Problem]
Within a sequential decision-making task, an agent receives a sequence of state descriptions (s1, s2, ... where si ∈ S) and action opportunities (choosing ai ∈ A at each si). From a human trainer who observes the agent and understands a predefined performance metric, the agent also receives occasional positive and negative scalar reinforcement signals (h1, h2, ...) that are positively correlated with the trainer's assessment of recent state-action pairs. How can an agent learn the best possible task policy (π : S → A), as measured by the performance metric, given the information contained in the input?
Though our broad goal is to create agents that can be shaped to perform any task, we restrict the Shaping Problem to tasks with predefined performance metrics to allow evaluation of the quality of shaping algorithms.
The TAMER Framework is our approach to the Shaping Problem.