A paper with surprising results on the strict variance bounds of popular importance sampling techniques. (Yao Liu, Pierre-Luc Bacon, and Emma Brunskill. ICML 2020.)
A paper that addresses relationship between first-visit and every-visit MC (Singh and Sutton, 1996). For some theoretical relationships see section starting at section 3.3 (and referenced appendices). The equivalence of MC and first visit TD(1) is proven starting in Section 2.4.
Autonomous helicopter flight via reinforcement learning. Andrew Ng, H. Jin Kim, Michael Jordan and Shankar Sastry. In S. Thrun, L. Saul, and B. Schoelkopf (Eds.), Advances in Neural Information Processing Systems (NeurIPS) 17, 2004.
Safe Exploration in Markov Decision Processes Moldovan and Abbeel, ICML 2012 (safe exploration in non-ergodic domains by favoring policies that maintain the ability to return to the start state)
Some useful slides (part C) from Michael Bowling on game theory, stochastic games, correlated equilibria; and (Part D) from Michael Littman with more on stochastic games.