Peter Stone's Selected Publications

• Classified by Topic • Classified by Publication Type • Sorted by Date • Sorted by First Author Last Name • Classified by Funding Source •

Importance Sampling in Reinforcement Learning with an Estimated Behavior Policy

Importance Sampling in Reinforcement Learning with an Estimated Behavior Policy.
Josiah P. Hanna, Scott Niekum, and Peter Stone.
Machine Learning (MLJ), 110:1267–1317, May 2021.

Download

[PDF]3.7MB

Abstract

In reinforcement learning, importance sampling is a widely used method for evaluating an expectation under the distribution of data of one policy when the data has in fact been generated by a different policy. Importance sampling requires computing the likelihood ratio between the action probabilities of a target policy and those of the data-producing behavior policy. In this article, we study importance sampling where the behavior policy action probabilities are replaced by their maximum likelihood estimate of these probabilities under the observed data. We show this general technique reduces variance due to sampling error in Monte Carlo style estimators. We introduce two novel estimators that use this technique to estimate expected values that arise in the RL literature. We find that these general estimators reduce the variance of Monte Carlo sampling methods, leading to faster learning for policy gradient algorithms and more accurate off-policy policy evaluation. We also provide theoretical analysis showing that our new estimators are consistent and have asymptotically lower variance than Monte Carlo estimators.

BibTeX Entry

@article{MLJ21-Hanna,
  author = {Josiah P. Hanna and Scott Niekum and Peter Stone},
  title = {Importance Sampling in Reinforcement Learning with an Estimated Behavior Policy},
  journal = {Machine Learning (MLJ)},
  year = {2021},
  volume ={110},
  issue={6},
  month={May},
  pages={1267--1317},
  abstract = {
              In reinforcement learning, importance sampling is a
              widely used method for evaluating an expectation under
              the distribution of data of one policy when the data has
              in fact been generated by a different policy.
              Importance sampling requires computing the likelihood
              ratio between the action probabilities of a target
              policy and those of the data-producing behavior
              policy. In this article, we study importance sampling
              where the behavior policy action probabilities are
              replaced by their maximum likelihood estimate of these
              probabilities under the observed data.  We show this
              general technique reduces variance due to sampling error
              in Monte Carlo style estimators. We introduce two novel
              estimators that use this technique to estimate expected
              values that arise in the RL literature. We find that
              these general estimators reduce the variance of Monte
              Carlo sampling methods, leading to faster learning for
              policy gradient algorithms and more accurate off-policy
              policy evaluation. We also provide theoretical analysis
              showing that our new estimators are consistent and have
              asymptotically lower variance than Monte Carlo
              estimators.
  },
}

Generated by bib2html.pl (written by Patrick Riley ) on Fri Jun 20, 2025 08:27:14