Learning Agent Tutorial

This tutorial describes the action selection code of the keepaway benchmark players and how it can be modified to include your own learning algorithms. This tutorial assumes knowledge of this material covered in the Basic Tutorial.

1) Keeper actions

The complete behavior of the keepers is described in detail in our papers. The keeper behavior follows a fixed policy until it has possession of the ball. At this point, the player has the following choices:

Hold the ball
Pass to player k

Therefore, there are K actions available, where K is the number of keepers. The hold action takes one cycle. The pass actions may take multiple cycles of kicks to complete, but persist once initiated.

2) SMDP methods

The keepaway domain is modeled by the players as a Semi-Markov Decision Process. At each semi-markov step, the agent receives information about the state of the environment, takes an option, or extended action, then receives a reward. The keepaway players use an implementation of the SMDPAgent interface to handle SMDP events. The following three methods must implemented:

startEpisode - This method is called at the first action opportunity in the episode. It takes in the state and returns an action.
step - This method is called at each action opportunity after the first one. It takes in the state and a reward (for the previous SMDP action) and returns an action.
endEpisode - This method is called at the end of the episode. It takes in the final reward. No return value is expected.

Common Pitfall: The endEpisode method is called for each agent regardless of whether or not that agent ever touched the ball. In contrast, startEpisode and step are called only when the ball is touched, so you can't assume that they are ever called during an episode. Therefore, if you want to perform some operation in endEpisode only when that agent has touched the ball at least once, then you will have to keep track of that yourself.

The state is presented as a vector of floating point feature values. The features used in this domain are described in our publications. The size of the feature vector can be found by calling the getNumFeatures method of the SMDPAgent implementing class.

The reward is a single floating point value. Each cycle the episode lasts is given a +1 reward.

The action expected to be returned by the methods is represented by an integer from 0 to K-1, where K is the number of keepers. If player k_0 is the player with the ball, and we sort the remaining keepers by increasing distance to k_0, the action i corresponds to a pass to k_i. Action 0 corresponds to the hold action.

3) An Example

This section will walk you through an example of creating your own learner for the keepaway framework.

LearningAgent class
The first thing that you will need to do to create your own learner is define a new implementing class for the SMDPAgent interface. I have provided an example implementation here:
LearningAgent.h LearningAgent.cc

This class is a skeleton for what a reinforcement learning implementation might look like. It should be modified to incorporate your own learning algorithm. The framework is meant to allow for many different learning methods such as temporal-difference learning and policy search. The only requirement is that the three methods described in the previous section are implemented.

Updating the Makefile Now you must add the new source files to the Makefile. You must concatenate the name of the .cc file to the SRCS_PLAYER variable:

SRCS_PLAYER     = ${SRCS} \
                BasicPlayer.cc \
                KeepawayPlayer.cc \
                HandCodedAgent.cc \
                LearningAgent.cc \
                main.cc

Now you can compile using make. Before you run make the first time after adding the new file, you should run make depend.

Modifying main.cc The next step is to link the new agent code to the rest of the code. You will need to modify main.cc. There is already a hook in place for you to enter your code. If the keeper policy is selected to be learned the new learning agent class will be chosen. Here is what this section of main.cc should look like when you're done:

  if ( strlen( strPolicy ) > 0 && strPolicy[0] == 'l' ) {
    // (l)earned                                                                
    sa = new LearningAgent( numFeatures, numActions,
                            bLearn, loadWeightsFile, saveWeightsFile );
  }
  else {
    // (ha)nd (ho)ld (r)andom                                                   
    sa = new HandCodedAgent( numFeatures, numActions,
                             strPolicy );
  }

Also, you will need to include the new header at the top of main.cc:

#include "SenseHandler.h"
#include "ActHandler.h"
#include "KeepawayPlayer.h"
#include "HandCodedAgent.h"
#include "LearningAgent.h"

Now you are ready to make again. Hopefully, there are no errors.

Modifying the startup script Finally, we need to change a couple of options in the keepaway.sh script. First, we need to change:
keeper_learn=1

This option turns on the learning flag to tell the learner that this is a learning trial. Next, change:
keeper_policy="learned"

This option selects the new LearningAgent class instead of the HandCodedAgent class.

That's it! Now you run ./keepaway.sh to run the new learning players.

Please email questions or comments to the mailing list.