Online fitted policy iteration based on extreme learning machines
Derechos de accesoclosedAccess
MetadataShow full item record
AuthorEscandell-Montero, Pablo; Lorente, Delia; Martínez-Martínez, José M.; Soria-Olivas, Emilio; Vila-Francés, Joan; Martín-Guerrero, José D.
Cita bibliográficaEscandell-Montero, P., Lorente, D., Martínez-Martínez, J. M., Soria-Olivas, E., Vila-Francés, J., & Martín-Guerrero, J. D. (2016). Online fitted policy iteration based on extreme learning machines. Knowledge-Based Systems, 100, 200-211.
Reinforcement learning (RL) is a learning paradigm that can be useful in a wide variety of real-world applications. However, its applicability to complex problems remains problematic due to different causes. Particularly important among these are the high quantity of data required by the agent to learn useful policies and the poor scalability to high-dimensional problems due to the use of local approximators. This paper presents a novel RL algorithm, called online fitted policy iteration (OFPI), that steps forward in both directions. OFPI is based on a semi-batch scheme that increases the convergence speed by reusing data and enables the use of global approximators by reformulating the value function approximation as a standard supervised problem. The proposed method has been empirically evaluated in three benchmark problems. During the experiments, OFPI has employed a neural network trained with the extreme learning machine algorithm to approximate the value functions. Results have demonstrated the stability of OFPI using a global function approximator and also performance improvements over two baseline algorithms (SARSA and Q-learning) combined with eligibility traces and a radial basis function network.