top of page

Learning To Drive

Objective And Introduction

The goal of this project is to learn environment agnostic policies that drive a vehicle with just monocular images as observations. For that, this project explores the use of famous Deep-Reinforcement Learning algorithm, Proximal Policy Optimization for learning generalize-able  Deep Neural Network policies that use encoding from Variational Autoencoder as observations.





Fig.1. The Carla simulator, being used to simulate the car and urban environment. (b) Top down view of the same environment with semantic segmentation.


Deep Reinforcement Learning

  • A paradigm to learn policies parameterized by Neural-Networks, that maximize rewards by making actions that results in higher reward accumulation more probable.

  • Given the state, the agent takes an action in the environment. Based on the action the environment transitions the agent to the next state with some transition probability.

  • Based on the state the agent ends up, the agent receives a reward.


Fig.2. The process flow-diagram for an RL agent.

  • Now, since, the objective needs to be maximized, the weights of the policy network need to be shifted in the direction of the gradient of the objective w.r.t the weights of the network. Equation (4) shows the equation for the gradient.

  • Equation (4) can be simplified to equation(5). This update is intuitive, as it shifts the weights in the direction that makes actions leading to higher rewards more probable.





  • But this gradient estimate has a high variance because, some actions might result in a high Q vale and some might result in a low Q value.

  • It can be shown that subtracting a baseline function that does not rely on actions can reduce the noise, while keeping the gradient estimates unbiased.

  • Hence, we can subtract the states Value Function from the Q Functions and thus reduce the variance, while keeping the estimates unbiased. This Q(s, a)-V(s) is called the Advantage Function A(s, a). This is shown in Equations (8) and (9).

  • The trajectory    is the sequence of states and actions that the agent experiences during an episode. It is shown by equation (1).

  • P(   ) is the probability that a given trajectory will occur.  It is given by equation(2).

  •             is the objective that needs to be maximized.





  • A more intuitive update would be multiplying the probability of each action with the expcted sum of rewards acculturated after that particular action. This is shown in equation (6). This expected value  of accumulated rewards given the state and action is also called the Q Function Q(a, s), as shown in equation (7).





Proximal Policy Optimization

  • Now, It might be intuitive to shift the weights in the direction of the gradient equation (9). This is vanilla Policy Gradient Update.

  • But doing that leads to unstable policies, because we are using the estimate of gradients from multiple trajectories. These trajectories might differ a lot during the start of training, because we update the policy that was used to collect the data.

  • Proximal Policy Optimization overcomes this issue by constraining the policy updates by clipping the objective, so that the updated policy is not too far away from the old policy.




Variational Auto-Encoder

  • Images are very rich representations. Training CNNs directly with RL algorithms is challenging because there are a lot of parameters to optimize and the gradient estimates are noisey.

  • A Variational auto-encoder, hown in Fig. 3,  reduces the dimensions of the image. This encoding is the observation for the policy Network.


Fig.3. This figure shows t he Variational Auto-Encoder architecture. The encoder predicts the mean and variance for the image-encodings. This latent space representation is used as observations for the policy network.





  • We don't want the network to just learn a one-to-one mapping between the images and the encoding space. So, loss for training the variational auto-encoder trained with a loss that is a linear combination of KL divergence of the encoding space and unit normal (equation 11) and an L1 loss between the reconstructed and the real image, (equation 12).

Proposed Architecture

  • The gradients estimates in Policy gradient algorithms are noisy. So, it is hard to train big networks with a lot of parameters.

  • To mitigate this issue, we use a variational auto-encoder that encodes a 256x256 semantically segmented image of the front view into a 512 dimensional encoding.

  • This encoding serves as the observations for the policy network.

  • The policy network is a fully connected network with 2 hidden-layers with 64 neurons each. The outputs are car's Steer angle and Throttle value.

Fig.4. The VAE encodes the images into a 512 dimensional encoding. That is used as the observation for the policy network, that outputs steer and throttle values. Top right image is the reconstructed image form the auto-encoder.

  • Following shows the algorithm for training a VAE alongside training the Policy Network using PPO.

  • This is pretty similar to PPO algorithm, but pushes the latest frame of image into a VAE buffer every time a step is taken.

  • This buffer is used for random sampling to train the VAE for a certain number of steps in the same callback that runs policy gradient update on the Policy Network.


Algorithm 1. This is the algorithm for training a Neural-Network agent with a Variational Auto-Encoder, using PPO.

Results - Simulation

Video 1. This video shows the trained agent driving through urban environment. Top-left is the input image to the VAE and top-right is the reconstruction from the VAE. The observations for the agent are the encodings from the VAE.

  • This video shows the car driving along the lane after the training is complete.

  • The agent receives a +1 reward for every time step and a -100  reward for colliding.

  • Each collision causes the agent to reset and restart the episode.

  • The overall reward is a linear combination of collision penalty, staying-alive reward, steer penalty and lane-invasion penalty.

  • The ultimate goal of this project is to learn end-to-end policies for vision based navigation and deploy them in the real world, whether it be agricultural environment or urban environment.

  • To that effect, the following video shows the algorithm learning on the Gazebo simulation of a real robot (Clearpath Jackal). The robot gets only the front-camera feed as its observations and learns to avoid the barricades.

  • The robot resets with a -100 penalty whenever it strays away from a nominal path and a +1 reward for every time-step it is alive.

Video 2. This video shows the agent navigating through a ROS based environment. The policy learns to control a Jackal robot, whose observations are just monocular images. The robot still needs a reset whenever it is about to crash, which is not possible in the real world.

  • But, learning directly on the real robot can be dangerous, because the policy is randomly initialized at first. So, going a step further, I implemented an optimal controller that would reset the robot by bringing to the closest way-point whenever the robot strays away from the trajectory.

  • This is seen in the video below. We can see the controller kicking in when the robot starts moving smoothly towards the next goal. As soon as it reaches the goal, the policy starts acting again. This can be seen as the robot starts taking noisy actions.

  • The next step is to implement the algorithm on hardware.

Video 3. This video shows the trained agent controlling a Jackal robot, with image observations. An optimal controller kicks in whenever the robot strays away from the nominal path. Thus eliminating the need for a manual or full-reset. But this requires the controller to have full state observations, i.e. location and velocity of the robot. The next step is to implement this on hardware.

Hardware Implementation - Indoors

  • This video shows the robot navigating indoor environments.

  • The robot gets a +1 reward for every time step and gets a -100 reward for every time it is about to crash. At which point the human takes over control and resets the robot.

  • Next step is to remove the human expert, by using an optimal controller that has access to full state space for resetting the robot.



Video 4. (a) This video shows the robot navigating through a hallway, making a turn. (b) This video shows the corresponding input images and the reconstruction of the VAE.

  • Videos 5 (a) and (b) show the robot navigating through tight corners and avoiding obstacles in different environments.



Video 5. (a) This video shows the robot navigating tight spaces. (b) This video shows the robot actively avoiding obstacles.

Hardware Implementation - Outdoors

  • We trained the robot on a jogging track as the first experiment. The track is 1300 feet long. The track is shown in Fig.5.

  • Since outdoors have variable lighting conditions, we train a semantic segmentation netwrok for segmenting the input images into traversable vs non-traversable regions. This is shown in Fig. 6.

  • We compare the performance of the robot with RGB input images and Semantically Segmented images in Table 1.


Fig.5 The outdoor jigging track where the robot was trained and deployed





Fig.6 The images show use of semantic segmentation for making input images more consistent across scenes. (a) shows the RGB input image, (b) shows the output of the semantic segmentation network and (c) shows the semantic segmentation overlaid on the input image for sanity check.

Video 6 This video shoes the robot navigating on a jogging track.


Table 1 Results for training on a jogging track.

Hardware Implementation - Vineyards

  • The robot was trained and deployed in a vineyard in Erie, NY.

  • Fig. 7 shows the agent's reward curves for semantically segmented vs RGB input images. As expected the semantic segmented images do better than RGB images.

  • Table 2 shows the results for both the experiments.

  • Video 7 shows the robot navigating in the field.

  • Video 8 shows the input image, Semantic segmentation and the VAE reconstruction


Fig. 7 Reward curves for the agent.


Table 2. Results for training in a Vineyard

Video 7. Robot navigating in a vineyard.

Video 8. Left - The input images. Right - Bottom - Semantic Segmentation. Top - VAE's Reconstruction

Hardware Implementation - Hops Plantation

  • The robot was trained and deployed in a Hops plantation in Erie, NY.

  • Fig. 8 shows the agent's reward curves for RGB input images. 

  • Table 3 shows the results for both the experiments.

  • Video 9 shows the robot navigating in the field.

  • Video 10 shows the input image and the VAE reconstruction.

  • The robot was automatically reset every time it was about to crash, using an MPC running on-board.


Fig. 8 Reward curve for the agent.


Table 3. Results for training in a Hops

Video 9. Robot navigating in a Hops plantation.

Video 10 . Left - The input images. Right - VAE's Reconstruction


  • In this project we demonstrated the use of Deep RL for vision based navigation. We demonstrated the same algorithm from simulation of urban environments all the way to agricultural fields.

  • We tested the algorithm in different outdoor settings to demonstrate generalizability.

  • We demonstrated the use of semantically segmented images for more stable learning, as the inputs become more consistent.

  • We showed that RL can be implemented to train agents from scratch on real robots in novel environments, without any human supervision.

bottom of page