Future AGIs will need to solve large reinforcement-learning problems
involving complex reward functions having multiple reward sources. One
way to make progress on such problems is to decompose them into
smaller regions that can be solved efficiently. We introduce a novel
modular version of Least Squares Policy Iteration (LSPI), called
M-LSPI, which 1. breaks up Markov decision problems (MDPs) into a set
of mutually exclusive regions; 2. iteratively solves each region by a
single matrix inversion and then combines the solutions by value
iteration. The resulting algorithm leverages regional decomposition to
efficiently solve the MDP. As the number of states increases, on both
structured and unstructured MDPs, M-LSPI yields substantial
improvements over traditional algorithms in terms of time to
convergence to the value function of the optimal policy, especially as
the discount factor approaches one.