A paper with surprising results on the strict variance bounds of popular importance sampling techniques. (Yao Liu, Pierre-Luc Bacon, and Emma Brunskill. ICML 2020.)
The article, referred to in Sec. 5.1 of the book, that proves that every-visit MC converges to the correct value function in the limit (infinite experience).
A paper that shows that using an empirical estimate of the behavior policy works better than using the true behavior policy in the importance sampling ratio.
Model-Free Least-Squares Policy Iteration
Michail G. Lagoudakis and Ronald Parr
Proceedings of NIPS*2001: Neural Information Processing Systems:
Natural and Synthetic
Vancouver, BC, December 2001, pp. 1547-1554.
A paper that addresses relationship between first-visit and every-visit MC (Singh and Sutton, 1996). For some theoretical relationships see section starting at section 3.3 (and referenced appendices). The equivalence of MC and first visit TD(1) is proven starting in Section 2.4.
Autonomous helicopter flight via reinforcement learning. Andrew Ng, H. Jin Kim, Michael Jordan and Shankar Sastry. In S. Thrun, L. Saul, and B. Schoelkopf (Eds.), Advances in Neural Information Processing Systems (NeurIPS) 17, 2004.
Safe Exploration in Markov Decision Processes Moldovan and Abbeel, ICML 2012 (safe exploration in non-ergodic domains by favoring policies that maintain the ability to return to the start state)
Some useful slides (part C) from Michael Bowling on game theory, stochastic games, correlated equilibria; and (Part D) from Michael Littman with more on stochastic games.