A Least Squares Temporal Difference Actor-Critic Algorithm With Applications to Warehouse Management
This paper develops a new approximate dynamic programming algorithm for Markov decision problems and applies it to a vehicle dispatching problem arising in warehouse management. The algorithm is of the actor-critic type and uses a least squares temporal difference learning method. It operates on a sample-path of the system and optimizes the policy within a pre-specified class parameterized by a parsimonious set of parameters. The method is applicable to a partially observable Markov decision process setting where the measurements of state variables are potentially corrupted and the cost is only observed through the imperfect state observations. The authors show that under reasonable assumptions, the algorithm converges to a locally optimal parameter set.