PhD student
Team : MLIA
Arrival date : 10/01/2021
    Sorbonne Université - LIP6
    Boîte courrier 169
    Couloir 25-26, Étage 5, Bureau 510
    4 place Jussieu
    75252 PARIS CEDEX 05

Tel: +33 1 44 27 47 23, Nicolas.Castanet (at)

Supervision : Sylvain LAMPRIER

Co-supervision : SIGAUD Olivier

Automatic Curriculum methods for Reinforcement Learning in sparse reward setting

A major difficulty of reinforcement learning is the exploration/exploitation trade-off that must be addressed to efficiently navigate a very large search space and find policies adapted to a given task. Among the proposals to efficiently obtain agents (or groups of agents) capable of solving complex tasks, reinforcement learning curriculum proposes to decompose the problem into simpler sub-problems and to define a learning path adapted to the agents' abilities. The hope is that skills acquired on simpler tasks will accelerate learning on the final task. Very often, curriculum learning requires an expert knowledge of the targeted problem to define the successions of subtasks to consider. In this context, techniques such as reward shaping can be used to transfer knowledge learned on one problem to another, without biasing the optimal solution of the target problem. Some works propose to learn the sequences between subtasks, but the splitting is often done manually. To overcome this, other approaches propose a gradual increase of the agent's capacities rather than an alteration of the environment. However, this remains limited to specific environments, with agent architectures specified according to expert knowledge. The idea of this thesis is to study automatic curriculum methods, which allow progressive learning despite the constraints of the environment, by learning to define intrinsic reward functions guiding the evolution of the agent towards the targeted goals. In this framework, a first type of efficient approach proposes to pre-learn to explore the world, using states reached during sampled trajectories, which are considered as goals to be reached from which experience can be drawn (which allows to answer problems of parsimony of the environment rewards). Another attractive idea is to rely on adversarial architectures of pre-learning of the environment, where two similar agents with contradictory goals confront each other: an oracle agent tries to propose to a student agent problems that it is able to solve itself, and on which the student has difficulties. The two agents progress together, which leads the process to increase the complexity of the task as the learning process progresses. The idea is to determine the "zone of proximal development" of the student, inspired by human pedagogical methods. In the same vein, the GoalGAN approach leads to a generator of adapted goals, through the use of an adverse discriminator which aims at discriminating the goals according to their interest for the learning agent. Other approaches aim at encouraging curiosity, via the learning of intrinsic reward functions that promote exploration. Description The objective of the thesis will be to compare several of these approaches on different application environments, in order to show their advantages and disadvantages. Regarding the methods, based on a review of the field, we will study in particular HER, ICM-A3C, Asymmetric self-play and possibly its hierarchical version, GoalGAN, and the approaches based on Learning Progress. As far as environments are concerned, we will start with in-depth studies on very simple low-dimensional environments (e.g. MountainCar or CartPole). Depending on the results obtained, we will extend the approaches to more complex environments, which will be chosen according to the properties we want to highlight, such as robotic control problems like Mujoco Fetch, discrete state problems where the agent interacts with objects, like MazeBase, visual environments like VizDoom or collaborative multi-agent navigation, for example the MADDPG environment. In a second step, we will focus on model-based approaches, which have a strong potential in terms of sampling efficiency, and which have not yet been explored in contexts with parsimonious rewards, for the purpose of exploration and automatic curiculum. One avenue that we wish to pursue consists in learning simplified dynamics, adapted to the agent's current skills, which we would gradually approximate the real dynamics observed in the target environment. Other possibilities include hierarchical RL and the learning of controllers allowing the definition of adapted sub-goals