Estimating Q(s,s') with Deep Deterministic Dynamics Gradients
TL;DR
A new Q function for off-policy reinforcement learning that doesn't rely on actions.
Abstract
In this paper, we introduce a novel form of value function, Q(s,s′), that expresses the utility of transitioning from a state s to a neighboring state s′ and then acting optimally thereafter. In order to derive an optimal policy, we develop a forward dynamics model that learns to make next-state predictions that maximize this value. This formulation decouples actions from values while still learning off-policy. We highlight the benefits of this approach in terms of value function transfer, learning within redundant action spaces, and learning off-policy from state observations generated by sub-optimal or completely random policies.
Venue
In Proceedings of the 37th International Conference on Machine Learning (ICML 2020).
BibTeX
@article{edwards2020estimating,
title={Estimating Q(s,s’) with Deep Deterministic Dynamics Gradients},
author={Ashley D. Edwards and Himanshu Sahni and Rosanne Liu and Jane Hung and Ankit Jain and Rui Wang and Adrien Ecoffet and Thomas Miconi and Charles Isbell and Jason Yosinski},
year={2020},
eprint={2002.09505},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
title={Estimating Q(s,s’) with Deep Deterministic Dynamics Gradients},
author={Ashley D. Edwards and Himanshu Sahni and Rosanne Liu and Jane Hung and Ankit Jain and Rui Wang and Adrien Ecoffet and Thomas Miconi and Charles Isbell and Jason Yosinski},
year={2020},
eprint={2002.09505},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
Date
February, 2020