• Classified by Topic • Classified by Publication Type • Sorted by Date • Sorted by First Author Last Name • Classified by Funding Source •
Models of human preference for learning reward functions.
W. Bradley Knox,
Stephane Hatgis-Kessell, Serena Booth, Scott Niekum, Peter
Stone, and Alessandro Allievi.
Transactions on Machine Learning Research (TMLR), 2023.
The utility of reinforcement learning is limited by the alignment of rewardfunctions with the interests of human stakeholders. One promising method foralignment is to learn the reward function from human-generated preferencesbetween pairs of trajectory segments, a type of reinforcement learning from humanfeedback (RLHF). These human preferences are typically assumed to be informedsolely by partial return, the sum of rewards along each segment. We find thisassumption to be flawed and propose modeling human preferences instead asinformed by each segment's regret, a measure of a segment's deviation fromoptimal decision-making. Given infinitely many preferences generated according toregret, we prove that we can identify a reward function equivalent to the rewardfunction that generated those preferences, and we prove that the previous partialreturn model lacks this identifiability property in multiple contexts. Weempirically show that our proposed regret preference model outperforms thepartial return preference model with finite training data in otherwise the samesetting. Additionally, we find that our proposed regret preference model betterpredicts real human preferences and also learns reward functions from thesepreferences that lead to policies that are better human-aligned. Overall, thiswork establishes that the choice of preference model is impactful, and ourproposed regret preference model provides an improvement upon a core assumptionof recent research. We have open sourced our experimental code, the humanpreferences dataset we gathered, and our training and preference elicitationinterfaces for gathering a such a dataset.
@Article{brad_knox_TMLR2023, author = {W. Bradley Knox and Stephane Hatgis-Kessell and Serena Booth and Scott Niekum and Peter Stone and Alessandro Allievi}, title = {Models of human preference for learning reward functions}, journal = {Transactions on Machine Learning Research (TMLR)}, year = {2023}, abstract = {The utility of reinforcement learning is limited by the alignment of reward functions with the interests of human stakeholders. One promising method for alignment is to learn the reward function from human-generated preferences between pairs of trajectory segments, a type of reinforcement learning from human feedback (RLHF). These human preferences are typically assumed to be informed solely by partial return, the sum of rewards along each segment. We find this assumption to be flawed and propose modeling human preferences instead as informed by each segment's regret, a measure of a segment's deviation from optimal decision-making. Given infinitely many preferences generated according to regret, we prove that we can identify a reward function equivalent to the reward function that generated those preferences, and we prove that the previous partial return model lacks this identifiability property in multiple contexts. We empirically show that our proposed regret preference model outperforms the partial return preference model with finite training data in otherwise the same setting. Additionally, we find that our proposed regret preference model better predicts real human preferences and also learns reward functions from these preferences that lead to policies that are better human-aligned. Overall, this work establishes that the choice of preference model is impactful, and our proposed regret preference model provides an improvement upon a core assumption of recent research. We have open sourced our experimental code, the human preferences dataset we gathered, and our training and preference elicitation interfaces for gathering a such a dataset. }, }
Generated by bib2html.pl (written by Patrick Riley ) on Sun Nov 03, 2024 08:54:37