A paper that addresses
relationship between first-visit and every-visit MC (Singh and
Sutton, 1996). For some theoretical relationships see section
starting at section 3.3 (and referenced appendices).
The equivalence of MC and first visit TD(1) is proven starting in Section 2.4.
Autonomous helicopter flight via reinforcement learning.
Andrew Ng, H. Jin Kim, Michael Jordan and Shankar Sastry.
In S. Thrun, L. Saul, and B. Schoelkopf (Eds.), Advances in Neural Information Processing Systems (NIPS) 17, 2004.
Some useful slides
(part C) from Michael Bowling on game theory, stochastic games,
correlated equilibria; and (Part D) from Michael Littman with more on
stochastic games.
Safe Exploration in Markov Decision Processes
Moldovan and Abbeel, ICML 2012 (safe exploration in non-ergodic domains by favoring policies that maintain the ability to return to the start state)
This work addresses an improvement to finetuning by adding columns to a deep net and never removing the previously
learned weights and avoids catastrophic forgetting. Progressive Neural Networks