Publications
Bold titles indicate strongly peer-reviewed conference papers.
Red titles indicate journal articles.
Jump to Year:
[2012]
[2013]
[2016]
[2017]
[2018]
[2019]
RIDM: Reinforced Inverse Dynamics Modeling for Learning From a Single Observed Demonstration.
Brahma S. Pavse, Faraz Torabi,
Josiah P. Hanna, Garrett Warnell, Peter Stone.
Imitation, Intent, and Interaction (I3) Workshop at ICML 2019. June 2019.
Abstract
BibTeX
Download:
[pdf] (472.9 KB)
Imitation learning has long been an approach to alleviate the tractability
issues that arise in reinforcement learning. However, most literature makes
several assumptions such as access to the expert's actions, availability of
many expert demonstrations, and injection of task-specific domain knowledge
into the learning process. We propose reinforced inverse dynamics modeling
(RIDM), a method of combining reinforcement learning and imitation from
observation (IfO) to perform imitation using a single expert demonstration,
with no access to the expert's actions, and with little task-specific domain
knowledge. Given only a single set of the expert's raw states, such as joint
angles in a robot control task, at each time-step, we learn an inverse
dynamics model to produce the necessary low-level actions, such as torques,
to transition from one state to the next such that the reward from the
environment is maximized. We demonstrate that RIDM outperforms other
techniques when we apply the same constraints on the other methods on six
domains of the MuJoCo simulator and for two different robot soccer tasks for
two experts from the RoboCup 3D simulation league on the SimSpark simulator.
Importance Sampling Policy Evaluation with an Estimated Behavior Policy.
Josiah P. Hanna, Scott Niekum, Peter Stone.
Proceedings of the 36th International Conference on Machine Learning (ICML). June 2019.
Abstract
BibTeX
Download:
[pdf] (2.0 MB)
[slides (pdf)] (4.2 MB)
We consider the problem of off-policy evaluation in Markov decision processes. Off-policy evaluation is the task of evaluating the expected return of one policy with data generated by a different, behavior policy. Importance sampling is a technique for off-policy evaluation that re-weights off-policy returns to account for differences in the likelihood of the returns between the two policies. In this paper, we study importance sampling with an estimated behavior policy where the behavior policy estimate comes from the same set of data used to compute the importance sampling estimate. We find that this estimator often lowers the mean squared error of off-policy evaluation compared to importance sampling with the true behavior policy or using a behavior policy that is estimated from a separate data set. Intuitively, estimating the behavior policy in this way corrects for error due to sampling in the action-space. Our empirical results also extend to other popular variants of importance sampling and show that estimating a non-Markovian behavior policy can further lower large-sample mean squared error even when the true behavior policy is Markovian.
Reducing Sampling Error in the Monte Carlo Policy Gradient Estimator.
Josiah P. Hanna, Peter Stone.
Proceedings of the 18th International Conference on Autonomous Agents and Multiagent Systems (AAMAS). May 2019.
This paper contains material that was previously presented at the 2018 NeurIPS Deep Reinforcement Learning Workshop.
Abstract
BibTeX
Download:
[pdf] (1.6 MB)
[slides (pdf)] (3.3 MB)
This paper studies a class of reinforcement learning algorithms known as policy gradient methods. Policy gradient methods optimize the performance of a policy by estimating the gradient of the expected return with respect to the policy parameters. One of the core challenges of applying policy gradient methods is obtaining an accurate estimate of this gradient. Most policy gradient methods rely on Monte Carlo sampling to estimate this gradient. When only a limited number of environment steps can be collected, Monte Carlo policy gradient estimates may suffer from sampling error -- samples receive more or less weight than they will in expectation. In this paper, we introduce the Sampling Error Corrected policy gradient estimator that corrects the inaccurate Monte Carlo weights. Our approach treats the observed data as if it were generated by a different policy than the policy that actually generated the data. It then uses importance sampling between the two -- in the process correcting the inaccurate Monte Carlo weights. Under a limiting set of assumptions we can show that this gradient estimator will have lower variance than the Monte Carlo gradient estimator. We show experimentally that our approach improves the learning speed of two policy gradient methods compared to standard Monte Carlo sampling even when the theoretical assumptions fail to hold.
Selecting Compliant Agents for Opt-in Microtolling.
Josiah P. Hanna, Guni Sharon, Stephen D. Boyles, Peter Stone.
Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI). January 2019.
Abstract
BibTeX
Download:
[pdf] (2.3 MB)
[poster (pdf)] (1.0 MB)
Towards a Data Efficient Off-policy Policy Gradient.
Josiah P. Hanna, Peter Stone.
AAAI Spring Symposium on Data Efficient Reinforcement Learning. March 2018.
Abstract
BibTeX
Download:
[pdf] (353.6 KB)
The ability to learn from off-policy data -- data generated from past interaction with the environment -- is essential to data efficient reinforcement learning. Recent work has shown that the use of off-policy data not only allows the re-use of data but can even improve performance in comparison to on-policy reinforcement learning. In this work we investigate if a recently proposed method for learning a better data generation policy, commonly called a behavior policy, can also increase the data efficiency of policy gradient reinforcement learning. Empirical results demonstrate that with an appropriately selected behavior policy we can estimate the policy gradient more accurately. The results also motivate further work into developing methods for adapting the behavior policy as the policy we are learning changes.
DyETC: Dynamic Electronic Toll Collection for Traffic Congestion Alleviation.
Haipeng Chen, Bo An, Guni Sharon,
Josiah P. Hanna, Peter Stone, Chunyan Miao, Yeng Chai Soh.
Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI). February 2018.
Abstract
BibTeX
Download:
[pdf] (2.6 MB)
To alleviate traffic congestion in urban areas, electronic toll collection
(ETC) systems are deployed all over the world. Despite the merits, tolls
are usually pre-determined and fixed from day to day, which fail to
consider traffic dynamics and thus have limited regulation effect when
traffic conditions are abnormal. In this paper, we propose a novel
dynamic ETC (DyETC) scheme which adjusts tolls to traffic conditions in
realtime. The DyETC problem is formulated as a Markov decision process
(MDP), the solution of which is very challenging due to its 1)
multi-dimensional state space, 2) multi-dimensional, continuous and
bounded action space, and 3) time-dependent state and action values. Due
to the complexity of the formulated MDP, existing methods cannot be
applied to our problem. Therefore, we develop a novel algorithm,
PG-$\beta$, which makes three improvements to traditional policy gradient
method by proposing 1) time-dependent value and policy functions, 2)
Beta distribution policy function and 3) state abstraction. Experimental
results show that, compared with existing ETC schemes, DyETC increases
traffic volume by around $8\%$, and reduces travel time by around
$14.6\%$ during rush hour. Considering the total traffic volume in a
traffic network, this contributes to a substantial increase to social
welfare.
Network-wide Adaptive Tolling for Connected and Automated Vehicles.
Guni Sharon, Michael W. Levin,
Josiah P. Hanna, Tarun Rambha, Stephen D. Boyles, Peter Stone.
Transportation Research Part C. September 2017.
This article contains material that was previously published in an
AAMAS 2017 paper.
Abstract
BibTeX
Download:
[pdf] (3.0 MB)
This article proposes Delta-tolling, a simple adaptive pricing scheme which only requires travel
time observations and two tuning parameters. These tolls are applied throughout a road
network, and can be updated as frequently as travel time observations are made.
Notably, Delta-tolling does not require any details of the traffic flow or travel demand models
other than travel time observations, rendering it easy to apply in real-time. The flexibility
of this tolling scheme is demonstrated in three specific traffic modeling contexts with varying
traffic flow and user behavior assumptions: a day-to-day pricing model using static
network equilibrium with link delay functions; a within-day adaptive pricing model using
the cell transmission model and dynamic routing of vehicles; and a microsimulation of
reservation-based intersection control for connected and autonomous vehicles with myopic
routing. In all cases, D-tolling produces significant benefits over the no-toll case, measured
in terms of average travel time and social welfare, while only requiring two
parameters to be tuned. Some optimality results are also given for the special case of the
static network equilibrium model with BPR-style delay functions.
Data-efficient Policy Evaluation through Behavior Policy Search.
Josiah P. Hanna, Philip Thomas, Peter Stone, Scott Niekum.
Proceedings of the 34th International Conference on Machine Learning (ICML). August 2017.
Abstract
BibTeX
Download:
[pdf] (1.2 MB)
[slides (pdf)] (1.1 MB)
[poster (pdf)] (445.9 KB)
We consider the task of evaluating a policy for a Markov decision process
(MDP). The standard unbiased technique for evaluating a policy is to deploy
the policy and observe its performance. We show that the data collected from
deploying a different policy, commonly called the behavior policy, can be
used to produce unbiased estimates with lower mean squared error than this
standard technique. We derive an analytic expression for the optimal behavior
policy---the behavior policy that minimizes the mean squared error of the
resulting estimates. Because this expression depends on terms that are
unknown in practice, we propose a novel policy evaluation sub-problem,
behavior policy search: searching for a behavior policy that reduces mean
squared error. We present a behavior policy search algorithm and empirically
demonstrate its effectiveness in lowering the mean squared error of policy
performance estimates.
Fast and Precise Black and White Ball Detection for RoboCup Soccer.
Jacob Menashe, Josh Kelle, Katie Genter,
Josiah P. Hanna, Elad Liebman, Sanmit Narvekar, Ruohan Zhang, Peter Stone.
RoboCup-2017: Robot Soccer World Cup XXI. July 2017.
Abstract
BibTeX
Download:
[pdf] (260.3 KB)
In 2016, UT Austin Villa claimed the Standard Platform League's second place
position at the RoboCup International Robot Soccer Competition in Leipzig,
Germany as well as first place at both the RoboCup US Open in Brunswick, USA
and the World RoboCup Conference in Beijing, China. This paper describes some
of the key contributions that led to the team's victories with a primary
focus on our techniques for identifying and tracking black and white soccer
balls. UT Austin Villa's ball detection system was overhauled in order to
transition from the league's bright orange ball, used every year of the
competition prior to 2016, to the truncated icosahedral pattern commonly
associated with soccer balls.
We evaluated and applied a series of heuristic region-of-interest
identification techniques and supervised machine learning methods to produce a
ball detector capable of reliably detecting the ball's position with no prior
knowledge of the ball's position. In 2016, UT Austin Villa suffered only a
single loss which occurred after regulation time during a penalty kick
shootout. We attribute much of UT Austin Villa's success in 2016 to our robots'
effectiveness at quickly and consistently localizing the ball.
In this work we discuss the specifics of UT Austin Villa's ball detector
implementation which are applicable to the specific problem of ball detection
in RoboCup, as well as to the more general problem of fast and precise object
detection in computationally constrained domains. Furthermore we provide
empirical analyses of our approach to support the conclusion that modern deep
learning techniques can enhance visual recognition tasks even in the face of
these computational constraints.
Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation.
Josiah P. Hanna, Peter Stone, Scott Niekum.
Proceedings of the 16th International Conference on Autonomous Agents and Multiagent Systems (AAMAS). May 2017.
Abstract
BibTeX
Download:
[pdf] (681.2 KB)
[slides (pdf)] (1.4 MB)
[poster (pdf)] (488.0 KB)
For an autonomous agent, executing a poor policy may be costly or even
dangerous. For such agents, it is desirable to determine confidence
interval lower bounds on the performance of any given policy without
executing said policy. Current methods for exact high confidence off-policy
evaluation that use importance sampling require a substantial amount of
data to achieve a tight lower bound. Existing model-based methods only
address the problem in discrete state spaces. Since exact bounds are
intractable for many domains we trade off strict guarantees of safety for
more data-efficient approximate bounds. In this context, we propose two
bootstrapping off-policy evaluation methods which use learned MDP
transition models in order to estimate lower confidence bounds on policy
performance with limited data in both continuous and discrete state spaces.
Since direct use of a model may introduce bias, we derive a theoretical
upper bound on model bias for when the model transition function is
estimated with i.i.d. trajectories. This bound broadens our understanding
of the conditions under which model-based methods have high bias. Finally,
we empirically evaluate our proposed methods and analyze the settings in
which different bootstrapping off-policy confidence interval methods
succeed and fail.
Grounded Action Transformation for Robot Learning in Simulation.
Josiah P. Hanna, Peter Stone.
Proceedings of the 31st AAAI Conference on Artificial Intelligence (AAAI). February 2017.
Abstract
BibTeX
Download:
[pdf] (1.4 MB)
[slides (pdf)] (1.4 MB)
[poster (pdf)] (397.8 KB)
Robot learning in simulation is a promising alternative to the prohibitive sample cost of learning in the physical world. Unfortunately, policies learned in simulation often perform worse than hand-coded policies when applied on the physical robot. Grounded simulation learning (GSL) promises to address this issue by altering the simulator to better match the real world. This paper proposes a new algorithm for GSL -- Grounded Action Transformation -- and applies it to learning of humanoid bipedal locomotion. Our approach results in a 43.27\% improvement in forward walk velocity compared to a state-of-the art hand-coded walk. We further evaluate our methodology in controlled experiments using a second, higher-fidelity simulator in place of the real world. Our results contribute to a deeper understanding of grounded simulation learning and demonstrate its effectiveness for learning robot control policies.
UT Austin Villa: RoboCup 2015 3D Simulation League Competition and Technical Challenges Champions.
Patrick MacAlpine,
Josiah P. Hanna, Jason Liang, Peter Stone.
RoboCup-2015: Robot Soccer World Cup XIX. July 2016.
Accompanying videos at
http://www.cs.utexas.edu/~AustinVilla/sim/3dsimulation/#2015
Abstract
BibTeX
Download:
[pdf] (809.2 KB)
The UT Austin Villa team, from the University of Texas at Austin, won the
2015 RoboCup 3D Simulation League, winning all 19 games that the team played.
During the course of the competition the team scored 87 goals and conceded only
1. Additionally the team won the RoboCup 3D Simulation League technical
challenge by winning each of a series of three league challenges: drop-in
player, kick accuracy, and free challenge. This paper describes the changes and
improvements made to the team between 2014 and 2015 that allowed it to win both
the main competition and each of the league technical challenges.
Minimum Cost Matching for Autonomous Carsharing.
Josiah P. Hanna, Michael Albert, Donna Chen, Peter Stone.
Proceedings of the 9th IFAC Symposium on Intelligent Autonomous Vehicles (IAV 2016). June 2016.
Abstract
BibTeX
Download:
[pdf] (120.3 KB)
[slides (pdf)] (4.9 MB)
Carsharing programs provide an alternative to private vehicle ownership. Combining carsharing programs with autonomous vehicles would improve user access to vehicles thereby removing one of the main challenges to widescale adoption of these programs. While the ability to easily move cars to meet demand would be significant for carsharing programs, if implemented incorrectly it could lead to worse system performance. In this paper, we seek to improve the performance of a fleet of shared autonomous vehicles through improved matching of vehicles to passengers requesting rides. We consider carsharing with autonomous vehicles as an assignment problem and examine four different methods for matching cars to users in a dynamic setting.
We show how applying a recent algorithm (Scalable Collision-avoiding Role Assignment with Minimal-makespan or SCRAM) for minimizing the maximal edge in a perfect matching can result in a more efficient, reliable, and fair carsharing system.
Our results highlight some of the problems with greedy or decentralized approaches.
Introducing a centralized system creates the possibility for users to strategically mis-report their locations and improve their expected wait time so we provide a proof demonstrating that cancellation fees can be applied to eliminate the incentive to mis-report location.
Operations of a Shared, Autonomous, Electric Vehicle Fleet: Implications of Vehicle \& Charging Infrastructure Decisions.
T Donna Chen, Kara M Kockelman,
Josiah P. Hanna.
Transportation Research Part A: Policy and Practice. January 2016.
Official version from
Publisher's Webpage
Abstract
BibTeX
Download:
[pdf] (750.7 KB)
There are natural synergies between shared autonomous vehicle (AV) fleets and
electric vehicle (EV) technology, since fleets of AVs resolve the practical
limitations of today's non-autonomous EVs, including traveler range anxiety,
access to charging infrastructure, and charging time management. Fleet-managed
AVs relieve such concerns, managing range and charging activities based on
real-time trip demand and established charging-station locations, as
demonstrated in this paper. This work explores the management of a fleet of
shared autonomous electric vehicles (SAEVs) in a regional, discrete-time,
agent-based model. The simulation examines the operation of SAEVs under
various vehicle range and charging infrastructure scenarios in a gridded city
modeled roughly after the densities of Austin, Texas.
Results based on 2009 NHTS trip distance and time-of-day distributions
indicate that fleet size is sensitive to battery recharge time and vehicle
range, with each 80-mile range SAEV replacing 3.7 privately owned vehicles
and each 200-mile range SAEV replacing 5.5 privately owned vehicles, under
Level II (240-volt AC) charging. With Level III 480-volt DC fast-charging
infrastructure in place, these ratios rise to 5.4 vehicles for the 80-mile
range SAEV and 6.8 vehicles for the 200-mile range SAEV. SAEVs can serve
96\‐98\% of trip requests with average wait times between 7 and 10
minutes per trip. However, due to the need to travel while ``empty" for
charging and passenger pick-up, SAEV fleets are predicted to generate an
additional 7.1 -- 14.0\% of travel miles. Financial analysis suggests that
the combined cost of charging infrastructure, vehicle capital and maintenance,
electricity, insurance, and registration for a fleet of SAEVs ranges from
$0.42 to $0.49 per occupied mile traveled, which implies SAEV service can be
offered at the equivalent per-mile cost of private vehicle ownership for
low-mileage households, and thus be competitive with current manually-driven
carsharing services and significantly cheaper than on-demand driver-operated
transportation services. When Austin-specific trip patterns (with more
concentrated trip origins and destinations) are introduced in a final case
study, the simulation predicts a decrease in fleet empty
vehicle-miles
(down to 3 -- 4\% of all SAEV travel) and average wait times (ranging
from 2 to 4 minutes per trip), with each SAEV replacing 5 -- 9 privately
owned vehicles.
Approximation of Lorenz-optimal Solutions in Multiobjective Markov Decision Processes.
Patrice Perny, Paul Weng, Judy Goldsmith,
Josiah P. Hanna.
Proceedings of the International Conference on Uncertainty in Artificial Intelligence (UAI). July 2013.
Abstract
BibTeX
Download:
[pdf] (428.9 KB)
This paper is devoted to fair optimization
in Multiobjective Markov Decision Processes
(MOMDPs). A MOMDP is an extension of
the MDP model for planning under uncertainty
while trying to optimize several reward
functions simultaneously. This applies
to multiagent problems when rewards define
individual utility functions, or in multicriteria
problems when rewards refer to different
features. In this setting, we study the determination
of policies leading to Lorenz-nondominated
tradeoffs. Lorenz dominance is a
refinement of Pareto dominance that was introduced
in Social Choice for the measurement
of inequalities. In this paper, we introduce
methods to efficiently approximate
the sets of Lorenz-non-dominated solutions of
infinite-horizon, discounted MOMDPs. The
approximations are polynomial-sized subsets
of those solutions.
The Academic Advising Planning Domain.
Joshua T. Guerin,
Josiah P. Hanna, Libby Ferland, Nicholas Mattei, Judy Goldsmith.
Proceedings of the 3rd Workshop on the International Planning Competition at ICAPS. July 2012.
Abstract
BibTeX
Download:
[pdf] (202.1 KB)
The International Probabilistic Planning Competition is a
leading showcase for fast stochastic planners. The current domains
used in the competition have raised challenges that the
leading deterministic-planner-based MDP solvers have been
able to meet. We argue that in order to continue to raise challenges
and match real world applications, domains must be
generated that exhibit true stochasticity, multi-valued domain
variables, and concurrent actions. In this paper we propose
the academic advising domain as a planning competition domain
that exhibits these characteristics. We believe that this
domain can build upon the success of previous contests in
pushing the limits of MDP planning research.