Deep RL notes (Very Rough)

4 minute read


(Very rough notes taken from Sergey Levine’s course video lectures)

Lecture 2 - Supervised Learning of Behavior

  1. Sequntial Decision Problems
  2. Imitation learning
    • Does it work?
    • How to improve?
  3. Case studies and what’s missing?

From RL to Supervised Learning

  1. is state at time . by nature latent
  2. is action taken by teacher.
  3. is observation

is policy.

Imitation Learning

Take as dataset. Learn using usual supervised methods.

It doesn’t really work theoretically. state trajectory depends on actions. So the training trajectory deviates from expected policy trajectory and we start using the wrong actions for the states. Here trajectory is basically the sequence of states (NOT observations) that occurs because of actions.

Best example where this works well in practice is NVIDIA’s self-driving car training. They used two extra cameras mounted looking 45 degrees left and right and use those images as an extra source of supervision. This makes sure that the deviation of the training trajectory doesn’t deviate much.

DAgger : Dataset Aggregation

Collect actions for observations that occured training.

  1. Train policy network fromon data from humans.
  2. Collect observations that occured to training.
  3. Get labels for this new set of observations.
  4. create new dataset is the union for the old data and the newly labelled dataset.

Provable bounded regret.


  1. Non-markovian Behavior. Actions depends on more than immediate history. Pass all the input through(Use RNNs/LSTMs to propogate information).
    • state are not always observable.
  2. Multimodal behavior.
    • Example : self-driving car has to avoid a tree in the middle of the road. Can go right or left, but NOT average of the two.
    • Discrete action-spaces aren’t affected much by this. For high-dimensional discrete or continuous we wil have to average the actions taken. So a uni-modal distributions will model the mean which is bad.
    • Solutions :
      • Mixture of gaussians : Fit multi-modal distributions. Number of modes notalways obvious.
      • Implicit Density model : output is unimodal BUT input + noise is modelled. The network has to model the added noise. MUCH harder to train. (VAEs, GANs, Stein Variational Gradient Descent.)
      • Autoregressive Discretization : Discretization of continuous actions works but not in large number of dimensions. Solution is to do serial prediction of actions in each dimension. Sample discretized action in first dimension, then use this to predict the distribution for actions in second dimension and so on. For training, instead of sampling we use the true action. In summary, an independent network to predict the distribution over actions using all the action distributions before it.

Why is imitation not enough?

  1. Large amounts of data to be labelled by humans which is usually hard, but sometimes downright impossible.
  2. Some kinds of actions hard to provide supervision for.
  3. Humans also learn by themselves not just by imitation.

Going beyond this

Cost/Reward functions required to define a learning task of maximizing cost/reward conditioned on the sequence of the agent’s actions.

  • Problems : Rewards are not easy to define. For example, if we need a robotic arm to grad an object the only true reward is 1 when it reaches and 0 when it hasn’t. This is a sparse reward and doesn’t help.
  • Reward can be define for imitations learning. It’s the log-probability of matching the expert’s action.

Extras :

Stein Variational Gradient Descent (SVGD)

Bayesian inference needs fast variational approximations and mean-fiels is not enough. So more complex distributions that are still quickly differentiable are needed. One idea is to invertibly transform simple distributions into complicated ones and hopefully do this in such a way that the determinant of the jacobian can be computed.

SVGD uses a simple transformation where $f = F’$. So for in the learning of each transformation, one gradient step is the update. This simplifies the gradient step for the KL-divergence and the transforms are learned in the sequence of increasing . The interesting thing is that the gradient is computed by MCMC methods and the transformation itself of $F_k$ is done by taking gradient steps on the samples.

VAEs (Variational Autoencoders)

Take standard Autoencoder. There’s the usual generation loss (or reconstruction loss). In VAEs, the enc takes an images as input BUT outputs a mean/covariance of a multi-variate gaussian. The dec on the other hand takes a vector sampled from the aforementioned Gaussiand tries to regenerate the original image. A new loss is used that is KL-divergence between the latent variable and the unit gaussian. One possible intuition is that we attempt to divide the latent space into gaussian balls each capturing the essence of an image.