Learning Agent Tutorial
This tutorial describes the action selection code of the
keepaway
benchmark players and how it can be modified to include your own
learning algorithms. This tutorial assumes knowledge of this material
covered in the Basic Tutorial.
1) Keeper actions
The complete behavior of the keepers is described in detail in our papers. The keeper behavior follows a fixed
policy until it has possession of the ball. At this point, the player
has the following choices:
- Hold the ball
- Pass to player k
Therefore, there are K actions available, where K is the number of
keepers. The hold action takes one cycle. The pass actions may take
multiple cycles of kicks to complete, but persist once initiated.
2) SMDP methods
The keepaway domain is modeled by the players
as a Semi-Markov Decision Process. At each semi-markov step, the
agent receives information about the state of the environment,
takes an option, or extended action, then receives a
reward. The keepaway players use an implementation of the
SMDPAgent interface to handle SMDP events. The following three
methods must implemented:
- startEpisode - This method is called at the first action opportunity in the episode. It takes in the state and returns an action.
- step - This method is called at each action opportunity after the
first one. It takes in the state and a reward (for the previous SMDP action) and returns an action.
- endEpisode - This method is called at the end of the episode. It takes in the final reward. No return value is expected.
Common Pitfall: The endEpisode method is called for each
agent regardless of whether or not that agent ever touched the ball.
In contrast, startEpisode and step are called only when the ball is
touched, so you can't assume that they are ever called during an
episode. Therefore, if you want to perform some operation in
endEpisode only when that agent has touched the ball at least once,
then you will have to keep track of that yourself.
The state is presented as a vector of floating point feature values.
The features used in this domain are described in our
publications.
The size of the feature vector can be found by calling the
getNumFeatures method of the SMDPAgent implementing class.
The reward is a single floating point value. Each cycle the episode
lasts is given a +1 reward.
The action expected to be returned by the methods is represented by an
integer from 0 to K-1, where K is the number of keepers. If player k_0 is
the player with the ball, and we sort the remaining keepers by increasing
distance to k_0, the action i corresponds to a pass to k_i.
Action 0 corresponds to the hold action.
3) An Example
This section will walk you through an example of creating your own learner for the keepaway framework.
LearningAgent class
The first thing that you will need to do to create your own learner is define
a new implementing class for the SMDPAgent interface. I have provided
an example implementation here:
LearningAgent.h
LearningAgent.cc
This class is a skeleton for what a reinforcement learning
implementation might look like. It should be modified to incorporate
your own learning algorithm. The framework is meant to allow for many
different
learning methods such as temporal-difference learning and policy search. The only requirement is
that the three methods described in the previous section are implemented.
Updating the Makefile
Now you must add the new source files to the Makefile. You must
concatenate the name of the .cc file to the SRCS_PLAYER
variable:
SRCS_PLAYER = ${SRCS} \
BasicPlayer.cc \
KeepawayPlayer.cc \
HandCodedAgent.cc \
LearningAgent.cc \
main.cc
Now you can compile using make. Before you run make the first
time after adding the new file, you should run make depend.
Modifying main.cc
The next step is to link the new agent code to the rest of the code. You
will need to modify main.cc. There is already a hook in place for
you to enter your code. If the keeper policy is selected to be learned
the new learning agent class will be chosen. Here is what this section
of main.cc should look like when you're done:
if ( strlen( strPolicy ) > 0 && strPolicy[0] == 'l' ) {
// (l)earned
sa = new LearningAgent( numFeatures, numActions,
bLearn, loadWeightsFile, saveWeightsFile );
}
else {
// (ha)nd (ho)ld (r)andom
sa = new HandCodedAgent( numFeatures, numActions,
strPolicy );
}
Also, you will need to include the new header at the top of main.cc:
#include "SenseHandler.h"
#include "ActHandler.h"
#include "KeepawayPlayer.h"
#include "HandCodedAgent.h"
#include "LearningAgent.h"
Now you are ready to make again. Hopefully, there are no errors.
Modifying the startup script
Finally, we need to change a couple of options in the keepaway.sh
script. First, we need to change:
keeper_learn=1
This option turns on the learning flag to tell the learner that this is
a learning trial. Next, change:
keeper_policy="learned"
This option selects the new LearningAgent class instead of the
HandCodedAgent class.
That's it! Now you run ./keepaway.sh to run the new learning players.
Please email questions or comments to the mailing list.