Knox and Stone, K-CAP 2009

Converting a working, basic policy iteration algorithm to TAMER

Instructions given or read to human trainers for experiments

Tetris Trainer Instructions

For non CS subjects (read aloud):

Before 1st practice period:

In this experiment you will train a learning agent to play Tetris through positive and negative reinforcement. Practicing should take about 10 minutes. The real experiment will take about 15 minutes. For the time that any one Tetris piece is falling, you can give feedback about the previous Tetris piece's placement. So be careful not to give reinforcement for a move until it is fully completed. As the display will say, pressing the key 'p' gives positive reinforcement and pressing 'n' gives negative reinforcement. You can press either button multiple times to add strength to your feedback. You can also choose to not give feedback for any move. Ask me if you want to change the speed at which the game is played. You will have one practice period for getting used to operating the training system. Then there will be additional practice time after I read you some strategic suggestions.

Before 2nd practice period:

As a trainer, your task is to let the Tetris player know which moves you approve and disapprove. Although you can use any strategy you like, we have two suggestions.

1) If the player has a pattern of behavior that you don't like, negatively reinforce anything related to that pattern.

2) If the player isn't doing something that you want it to, one strategy is to give negative reinforcement more often and positive reinforcement less often until you see the desired action. When you do see the desired action, we suggest heavily rewarding that action.

- Now you can practice as long as you want. If you want to practice with a fresh Tetris player, let me know and I'll restart it. Also, let me know when you are ready to start the real training session that will be included in my experimental data.

For AI students/post-docs (sent via email):

In this experiment you will train a learning agent to play Tetris to your preference. Starting it up and practicing should take as little as 5 minutes (more if you want to practice a lot). The real experiment will take about 15 minutes by my estimate.

For the integrity of the data, please read this carefully.

While sitting at a UTCS machine (don't use ssh -X unless you don't notice any lag):

cd /projects/xxxx/rl-library/projects/experiments/guiExperiment/

./runNetDynamicEnvStandardAgent.bash

In another, small window on the same machine:

cd /u/xxxx/projects/shaping/agents/tetrisagent/

./run.bash -t yournamepractice

(Of course, replace "yourname" with your actual name.)

To train a Tetris player:

- To start, in the RLVizApp window:

1. choose Tetris from the "Choose Environment" drop-down box,

2. then click "Load Experiment",

3. then move the "Simulation Speed" selector to around 150 (this can later be adjusted to your preference),

4. and then click "Start".

- As the agent plays, keep the second *terminal* window on top so that the agent can receive your keyboard-based feedback.

- ***** This part is easy to confuse. The window on top should be the terminal window with a box drawn around it. The text of the window starts "Reward keys". If you do not see the number after "Human reinforcement for previous tetromino:" change when you give reward, the agent is not receiving your feedback. *****

- For the time that one Tetris piece is falling, you can give feedback about the previous piece placement. **Be careful that you don't give feedback on a move until it is fully completed.**

- You can also choose to not give feedback for any move.

- Pressing 'p' gives positive reinforcement and pressing 'n' gives negative reinforcement. You can press either button multiple times to add strength to your feedback.

Play only long enough to get comfortable with the interface. Then restart the game for a fresh practice after reading the following strategy:

As a trainer, your task is to let the Tetris player know which moves you approve and disapprove. Although you can use any strategy you like, we have two suggestions:

1) If the player has a pattern of behavior that you don't like, negatively reinforce anything related to that pattern.

2) On the flipside, if the player isn't doing something that you want it to do, one strategy is to give negative reinforcement more often and positive reinforcement less often until you see the desired action. When you do, we suggest heavily rewarding that action.

At this point, you can practice as long as you want, restarting if you wish.

When you decide that you are ready to play for real (and for the prize!), restart the program, replacing "yournamepractice" in the terminal command with your name followed by the number 1 so that I know it's your first real training attempt. Your first attempt is the only one that I'll be using for sure, so do your best on this one. **An experimental run lasts 10 games.** Only at the end of the tenth game is any of your data saved to file. The current game number is printed above the Tetris board in the "E" part of "E/S/T". After the 10th game finishes, you can keep playing as long as you like, but no data will be recorded.

Just to be as clear as possible, I would type:

./run.bash -t me1

Whoever trains the best Tetris player on their first real run (no cheating or you'll be making me a fraud) wins their choice of some fine Russian vodka or some fine dining on me. The competition involves anyone from whom I collect data, including some non-CS people (who probably can't put up a fight). You can practice as much as you want beforehand, but once you do the yourname1 trial, please don't overwrite those data files. Runs are evaluated on the sum of reward

received over the 10 games. Also, please *don't share results with each other* for the sake of the data's integrity. I'll announce the results after the data is collected.

If you need to take a break, you can click "Stop" in the RLVizApp window. When you want to start, click "Start" and then put the correct window on top (see the 5 star message above).

If you get bored because your agent is too good and never loses, you can speed it up and stop giving reinforcement.

A little warning: since I didn't master Python's Curses, the second terminal window will become pretty useless after the program ends. It's annoying. Sorry.

Thanks to everyone who helps.

Mountain Car Trainer Instructions

For AI students/post-docs (sent via email, and an author was also present for questions):

As a subject for this study, you will train a computer agent to complete the "Mountain Car" task. Specifically, the agent is a car that can choose to accelerate left, right, or not at all. The task is for the car to get to the goal (a marker on the top of the right hill) in the least time possible.

Your job will be to train the car agent to perform the task. As the trainer, you will give the agent positive and negative reinforcement as it explores different strategies. Your reinforcement will shape its behavior toward a strategy that efficiently makes it to the goal.

-----------------------------------------------------------

The MDP environment and agent are started separately. Here's how to start the environment:

Log in to a department computer and enter a GUI environment. (Don't use ssh, since it will cause a lag that might disrupt the results.)

In a bash Terminal window, run:

export RLLIBRARY=/projects/xxxx/rl-library

cd /projects/xxxx/rl-library/projects/experiments/guiExperiment

./runNetDynamicEnvStandardAgent.bash

It will then wait until it detects an agent.

----------------------------------------------------------

Before acting as a trainer, you will first learn, for yourself, a good strategy by controlling the car.

To start a controllable agent, open another terminal window.

Give the following commands:

export RLLIBRARY=/projects/xxxx/rl-library

cd /u/xxxx/projects/shaping/agents/mcshapeagent

./runControl.bash -t yourname

This should connect an agent to the environment and bring up a GUI visualizer. Clicking the "Load Experiment" button on the RLVizApp window. Then set the "Simulation Speed" bar to 151 (any other speed will make the task harder or easier; sticking with 151 helps ensure the integrity of the results). Click "Start".

The red rectangle represents the car, and the baby blue vertical bar on the car indicates the acceleration that is currently being done by the car. The goal is the green marker on right hill.

For the agent to receive keyboard input, the most recently used terminal window has to be on top. Keys 'S', 'D', and 'F' accelerate the car left, none, and right, respectively. You only have to push the key once for the car to keep choosing that action until another key is pressed or until the end of the episode.

Control the car until you think you have a clear idea of what a near-optimal policy (i.e., the fastest way to the goal) would look like. You will want to pay attention to when precisely the car should change its direction. (One of the biggest dangers to my results is that trainers might decide the agent is "good enough" when it could be much better. In this case, merely getting the car to the goal is not good enough.)

Note the red numbers above at the top of the display. The first number is the game number. The second is the measurement of how long the car the game has been going on (this is what you're trying to minimize.)

(If you don't press anything at the beginning of the episode, an unrelated algorithm will take over until you exert control.)

------

[Start instruction loop (to be repeated 3 times)]

Now you will train the car agent, giving it positive and negative reinforcement to shape its strategy into a good one.

In the original window, press Ctrl-c to close the RL Visualizer application. You may have to do it twice. Issue command ./runNetDynamicEnvStandardAgent.bash again.

Close the newer window, which probably looks messed up now. Open a new window and run:

export RLLIBRARY=/projects/xxxx/rl-library

cd /u/xxxx/projects/shaping/agents/mcshapeagent

./runExperiment.bash -t yourname

You are about to start the environment again. Please read the rest of the instructions before you do, and then start training as quickly as possible after clicking "Start" to avoid wasted time steps.

The input for this phase is 'p' for positive reinforcement and 'n' for negative reinforcement. In other words, push 'p' immediately after behavior you approve of and push 'n' after behavior you disapprove of. Use single button pushes for reinforcement (don't hold it down).

** Strategic notes **

1) The agent learns best when reinforcement is both consistent and given very shortly after the action/event being reinforced.

2) Up to a certain level, more frequent feedback generally makes for better learning.

3) Be careful not to give feedback for something that hasn't happened yet. In other words, don't give reinforcement for an action that you anticipate but has not occurred.

Train the agent for 20 episodes (i.e., it reaches the goal 20 times) or until you are confident that you cannot train the agent to improve its policy any further.

Once you are done, speed up the agent (set the speed bar to the far left) and watch it for 50 episodes to make sure it doesn't get stuck somewhere. If it does, quickly slow it down and give it negative reinforcement and watch it for another 50. Repeat until it does not get stuck (this won't take long).

You will start the environment the same way (same series of clicks within the RLVizApp window: "Load Experiment", speed bar to 151, and "Start"). Again, with the newer window open, you can give the agent keyboard input.

[End instruction loop] You will train an agent 3 times this way. Expect to get better as a trainer as you progress (don't get frustrated!).

We encourage readers to test out TAMER for themselves. Linear, gradient-descent Sarsa(λ), found in Sutton and Barto's Reinforcement Learning: An Introduction, is a common algorithm that can easily be converted to TAMER algorithms, given that (1) time steps only occur every second or less often (if time steps are more frequent, see the last “guiding note” below) and that (2) the agent's state and action can be detected in a meaningful way — usually visually — in real time by a human. Starting with an agent that already learns effectively in the target domain (and thus has a well-tuned function approximator):

1.Set the discount factor (i.e., γ) to 0, removing bootstrapping from the update.
2.Make action selection fully greedy (e.g., ε := 0 if using ε-greedy). Non-greedy action selection is also an option, but our TAMER agents have all effectively used greedy selection, making it a good starting point.
3.Replace the environmental reward r with the scalar human feedback signal h. You will probably want feedback to be gathered by keyboard.
4.If h = 0, do not update the model.
5.Remove eligibility traces (i.e., λ := 0),
6.Consider lowering the step-size parameter α to 0.01 or less (as our agents have been parameterized).

After making these changes, your agent will be a specific implementation of Algorithm 1 (which is shown in the paper for which this website is a supplement). Other, similar algorithms might also be converted with only small changes to this list of modifications.

A few guiding notes:

•We expect that function approximators that generalize broadly to unseen states will perform best within TAMER.
•We have found that normalizing feature values to be in roughly similar ranges improves performance. This is likely because of the small step size and small number of samples collected by TAMER for learning.
•If the agent can communicate its action or its intent to the human, the quality of the human feedback will be better than if the human can merely observe state changes.
•If the task domain has frequent time steps, follow our credit assignment algorithm in the K-CAP09 paper.

If you try out this conversion to TAMER, please let us know about your experience, whether it is successful or not, and whether you suggest changes to these directions.