Project overview - Reinforcement Learning for Self Driving Cars

Motivation

The current performance of self driving cars is very good. They know all the
rules of the road, have a basic ability to recognize and understand the other actors on the road, and can drive in a way broadly similar to human drivers. However, they are not yet ready for the safety driver to be removed in all but the most restricted domains. Moreover, any deployment to new cities/countries requires tremendous engineering effort which somewhat defeats the purpose of having an autonomous car.

Our three main motivations/goals are as follows

End-to-end system : An end-to-end system is easier to redeploy and fine-tune
Exceeding the expert performance : Imitation learning from data can only take us so far. If we know the underlying reward function of the world, we could beat the expert and achieve better than human performance.
Verifiable performance : Any algorithm we develop can be rigorously tested in simulation to ensure that it works in rare circumstances.

Problem Statement

Our problem statement is to train an autonomous driving agent in the CARLA[1] simulator using reinforcement learning. We want to develop an off-policy algorithm that can be run on Argo’s self driving logs. Moreover, we want to aim for sample efficiency in our algorithm.

For inputs and outputs to our network, we want to only use RGB images with waypoints to learn low level control of the vehicle. We build upon the work of Tanmay Agarwal and Hitesh Arora, showcased on their website.

Approach

We propose the use of a low dimensional engineered state space and reward function to train an RL “expert” that can drive well using privileged information. Note that in we could also use imitation to train the expert and inverse reinforcement learning to get the reward formulation. Next, we transfer the knowledge of this expert agent to an RGB image based policy as describe in Learning by Cheating[2] by Chen et al. to get a feature extractor. Finally, we freeze the convolution layers, reset the fully connected layers and train on the original reward function using a modified n-step soft actor critic algorithm.

Technical details

For details on our approach please read the Spring 2020 and Fall 2020 pages. Please visit the Video Demonstration page for our qualitative results. You could also watch our final presentation.