Section outline

    1. Ian Goodfellow and Yoshua Bengio and Aaron Courville , Deep Learning, MIT Press, Free online version
    2. David Barber, Bayesian Reasoning and Machine Learning, Cambridge University Press, Free online version
    3. CJCH Watkins, P Dayan, Q-learning, Machine Learning, 1992, PDF
    4. Mnih et al,Human-level control through deep reinforcement learning, Nature, 2015, PDF
    5. van Hasselt et al, Deep Reinforcement Learning with Double Q-learning, AAAI, 2015, PDF
    6. Wang et al, Dueling Network Architectures for Deep Reinforcement Learning, ICML, 2016, PDF
    7. Schaul et al, Prioritized Experience Replay, ICLR, 2016, PDF
    8. Williams, Simple statistical gradient-following algorithms for connectionist reinforcement learning, Machine Learning, 1992, PDF
    9. Sutton et al, Policy gradient methods for reinforcement learning with function approximation, NIPS, 2000, PDF
    10. Peters & Schaal, Reinforcement learning of motor skills with policy gradients, Neural Networks, 2008, PDF
    11. Mnih et al, Asynchronous methods for deep reinforcement learning, ICLR, 2016, PDF
    12. Lillicrap et al., Continuous control with deep reinforcement learning, ICLR, 2016, PDF
    13. Gu et al. Q-Prop: sample-efficient policy gradient with an off-policy critic, ICLR, 2017, PDF
    14. Schulman et al, Trust Region Policy Optimization, ICML, 2015, PDF
    15. Duan et al, Benchmarking Deep Reinforcement Learning for Continuous Control, ICML, 2016, PDF
    16. Kocsis and Szepesvari, Bandit based Monte-Carlo planning, ECML, 2006, PDF
    17. Gelly and Silver, Combining Online and Offline Knowledge in UCT, ICML, 2017, PDF
    18. Silver et al, Mastering the game of Go with deep neural networks and tree search, Nature, 2016, Online
    19. Silver et al, Mastering the game of Go without human knowledge, Nature, 2017, Online
    20. Auer et al, Finite-time Analysis of the Multiarmed Bandit Problem, Machine Learning, 2002, PDF
    21. Dudik et al, Efficient Optimal Learning for Contextual Bandits, ICML, 2011, PDF
    22. Agarwal et al, Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits, ICML, 2014, PDF
    23. Lihong Li, Generalized Thompson Sampling for Contextual Bandits, PDF
    24. Russo et al, A tutorial on Thompson Sampling, PDF
    25. Stadie et al, Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models, 2016, PDF
    26. Pomerleau, ALVINN: An autonomous Land vehicle in a neural Network”, NIPS 1989, PDF
    27. Bojarski et al., End to End Learning for Self-Driving Cars, 2016, PDF
    28. Ross et al, A Reduction of Imitation Learning and Structured Predictionto No-Regret Online Learning, AISTATS 2011, PDF
    29. Rusu et al, Policy distillation, ICLR 2016, PDF
    30. Levine and Koltun, Guided policy search, ICML 2013, PDF
    31. Reddy et al, SQIL: Imitation Learning via Reinforcement Learning with Sparse Rewards, ICLR 2020, PDF
    32. Ho and Ermon, Generative Adversarial Imitation Learning, NIPS 2016, PDF
    33. Ng and Russell, Algorithms for Inverse Reinforcement Learning, ICML 2000, PDF
    34. Wulfmeier et al, Maximum Entropy Deep Inverse Reinforcement Learning, 2015, PDF
    35. Finn et al, Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization, ICML 2016, PDF
    36. Hausman, Multi-Modal Imitation Learning from UnstructuredDemonstrations using Generative Adversarial Nets, NIPS 2017, PDF