To run all experiments and plot figures for the report, run
bash run_1.sh
bash run_2.sh
I chose the value function as the baseline, since the state dependent value function is unbiased and is convenient to use.
(Copied from http://stillbreeze.github.io/REINFORCE-vs-Reparameterization-trick/)
Properties | REINFORCE | Reparameterization |
---|---|---|
Differentiability requirements | Can work with a non-differentiable model | Needs a differentiable model |
Gradient variance | High variance; needs variance reduction techniques | Low variance due to implicit modeling of dependencies |
Type of distribution | Works for both discrete and continuous distributions | In the current form, only valid for continuous distributions |
Family of distribution | Works for a large class of distributions of x | It should be possible to reparameterize x |
Reparameterization needs a differentiable model, and is only valid for continuous variables.
4. We can minimize the policy loss in Equation 9 using off-policy data. Why is this not the case for standard actor-critic methods based on policy gradients, which require on-policy data?
Standard actor-critic methods require on-policy training to improve stability, while the maximum entropy formulation provides a substantial improvement in exploration and robustness, and both value estimators and the policy can be trained entirely on off-policy data in the soft actor-critic algorithm.
The reparameterization method has lower variance and provides better stability.
Double Q-learning mitigates positive bias in the policy improvement step, and thus results in better performance.
Original code from Tuomas Haarnoja, Soroush Nasiriany, and Aurick Zhou for CS294-112 Fall 2018
Dependencies:
- Python 3.4.5
- Numpy version 1.15.2
- TensorFlow version 1.10.0
- tensorflow-probability version 0.4.0
- OpenAI Gym version 0.10.8
- MuJoCo version 1.50 and mujoco-py 1.50.1.59
- seaborn version 0.9.0
You will implement sac.py
, and nn.py
.
See the HW5 PDF for further instructions.