16/11/2018

Intervenant(s) : Pr. Crai Boutilier

Reinforcement Learning in Recommender Systems: Some Foundational and Practical Questions Contributions I'll provide an overview of some recent work on several foundational questions in reinforcement learning (Q-learning specifically) and Markov decision processes, motivated in part by some practical issues that arise in the application of RL to online, user-facing applications like recommender systems. I'll first describe the notion of delusion in MDPs and RL, a problem that arises when using value- or Q-function approximation. Since any approximation limits the realizable policy class, methods like Q-learning and value iteration run the risk of fitting value estimates to inconsistent (i.e., unrealizable) labels, leading to a variety of pathological phenomena. We develop a policy-class consistent backup that resolves fully this problem and is guaranteed to find the optimal policy in the realizable policy class (and do so in polynomial time under some assumptions). I'll next outline an approach to slate decomposition to manage Q-learning in settings where a set (or slate) of items is presented to a user, as is often the case in recommender systems. This decomposition allows the Q-value of a slate to decompose into a specific function of item-level Q-values by making certain assumptions about user choice behavior. As a consequence, we avoid the need to explore and generalize over the combinatorial space of slates. Experiments in both small simulated domains and a large-scale live recommender system demonstrate the efficacy and robustness of the approach. Time permitting, I'll conclude with a few remarks about the role that reinforcement learning has to play in user-centric recommender systems and some of the interesting research challenges that face RL in this setting.

cedric.herpson (at) nulllip6.fr