Peter Stone's Selected Publications

• Classified by Topic • Classified by Publication Type • Sorted by Date • Sorted by First Author Last Name • Classified by Funding Source •

Models of human preference for learning reward functions

Models of human preference for learning reward functions.
W. Bradley Knox, Stephane Hatgis-Kessell, Serena Booth, Scott Niekum, Peter Stone, and Alessandro Allievi.
Transactions on Machine Learning Research (TMLR), 2023.

Download

[PDF]6.7MB [slides.pdf]13.4MB

Abstract

The utility of reinforcement learning is limited by the alignment of rewardfunctions with the interests of human stakeholders. One promising method foralignment is to learn the reward function from human-generated preferencesbetween pairs of trajectory segments, a type of reinforcement learning from humanfeedback (RLHF). These human preferences are typically assumed to be informedsolely by partial return, the sum of rewards along each segment. We find thisassumption to be flawed and propose modeling human preferences instead asinformed by each segment's regret, a measure of a segment's deviation fromoptimal decision-making. Given infinitely many preferences generated according toregret, we prove that we can identify a reward function equivalent to the rewardfunction that generated those preferences, and we prove that the previous partialreturn model lacks this identifiability property in multiple contexts. Weempirically show that our proposed regret preference model outperforms thepartial return preference model with finite training data in otherwise the samesetting. Additionally, we find that our proposed regret preference model betterpredicts real human preferences and also learns reward functions from thesepreferences that lead to policies that are better human-aligned. Overall, thiswork establishes that the choice of preference model is impactful, and ourproposed regret preference model provides an improvement upon a core assumptionof recent research. We have open sourced our experimental code, the humanpreferences dataset we gathered, and our training and preference elicitationinterfaces for gathering a such a dataset.

BibTeX Entry

@Article{brad_knox_TMLR2023,
  author   = {W. Bradley Knox and Stephane Hatgis-Kessell and Serena Booth and Scott Niekum and Peter Stone and Alessandro Allievi},
  title    = {Models of human preference for learning reward functions},
  journal = {Transactions on Machine Learning Research (TMLR)},
  year     = {2023},
  abstract = {The utility of reinforcement learning is limited by the alignment of reward
functions with the interests of human stakeholders. One promising method for
alignment is to learn the reward function from human-generated preferences
between pairs of trajectory segments, a type of reinforcement learning from human
feedback (RLHF). These human preferences are typically assumed to be informed
solely by partial return, the sum of rewards along each segment. We find this
assumption to be flawed and propose modeling human preferences instead as
informed by each segment's regret, a measure of a segment's deviation from
optimal decision-making. Given infinitely many preferences generated according to
regret, we prove that we can identify a reward function equivalent to the reward
function that generated those preferences, and we prove that the previous partial
return model lacks this identifiability property in multiple contexts. We
empirically show that our proposed regret preference model outperforms the
partial return preference model with finite training data in otherwise the same
setting. Additionally, we find that our proposed regret preference model better
predicts real human preferences and also learns reward functions from these
preferences that lead to policies that are better human-aligned. Overall, this
work establishes that the choice of preference model is impactful, and our
proposed regret preference model provides an improvement upon a core assumption
of recent research. We have open sourced our experimental code, the human
preferences dataset we gathered, and our training and preference elicitation
interfaces for gathering a such a dataset.
  },
}

Generated by bib2html.pl (written by Patrick Riley ) on Fri Jun 20, 2025 08:27:14