Updated basic APG algorithm #476

Andrew-Luo1 · 2024-04-15T20:14:49Z

The goal of this proposed update is to provide a simple APG algorithm that can solve non-trivial tasks, as a first step for researchers and practitioners to explore Brax's differentiable simulation. It has been tested on MJX. Notes:

A demonstration of this algorithm is shown here.
I have not benchmarked this algorithm against Brax's RL algorithms such as PPO and SAC, since a) the environments would need differentiable rewards, and b) this simple implementation aims to be a basic reference to extend upon.

1: Algorithm Update

This fork contains an APG implementation that is about as simple as the current one, but reflects a common thread between several recent results that have used differentiable simulators to achieve locomotion: 1, 2, 3.

Brax's current APG algorithm is roughly equivalent to the following pseudocode:

for i in range(n_epochs):
    reset state
    policy_grads = []
    for j in range(episode_length // short_horizon))
        state, policy_grad = jax.grad(unroll(state, policy, short_horizon))
        policy_grads.append(policy_grad)
    optimizer.step(mean(policy_grads))

In contrast, the cited results update the policy gradient much more frequently, using the observation that policy gradients that differentiate through the simulator have low variance. Hence, unrolling for an entire episode before updating has limited use. That additional samples past a certain point do not help is seen in that convergence does not increase with with massive parallelization [2]. The proposed APG algorithm essentially performs live stochastic gradient descent on the policy, unrolling it for a short window, doing a gradient update, then continuing where it left off:

reset state
for i in range(n_epochs):
    state, policy_grad = jax.grad(unroll(state, policy, short_horizon))
    optimizer.step(policy_grad)

Note that n_epochs can be much larger in this case. This modification allows the algorithm to relatively quickly learn quadruped locomotion, albeit with a particular training pipeline and reward design. (Notebook)

Additional notes:

This fork uses a learning schedule to improve training stability.
The particular choices of Adam optimizer parameters come from [2] and significantly improve training outcomes.

2: Supporting Updates

Configurable initial policy variance: When hotstarting a policy, it benefits to explore around its induced state-space vicinity. This can be done by initializing the policy network weights small. Currently, the softplus disables this possibility, so this fork adds a scaling parameter.

Layer norm: I have found that using layer normalization in the policy neural network has greatly improved the training stability of APG methods and is seen in other implementations.

google-cla · 2024-04-15T20:14:53Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

erikfrey

Looks great, just one nit.

erikfrey · 2024-04-16T18:17:05Z

brax/training/distribution.py

@@ -131,7 +131,7 @@ def forward_log_det_jacobian(self, x):
 class NormalTanhDistribution(ParametricDistribution):
  """Normal distribution followed by tanh."""

-  def __init__(self, event_size, min_std=0.001):
+  def __init__(self, event_size, min_std=0.001, var_scale=1):


nit: add var_scale to Args section of docstring

erikfrey · 2024-04-16T18:25:42Z

Oh - it looks like you need to update brax/training/agents/apg/train_test.py - changes should hopefully be minimal - please review the failing test.

Also, please do sign the CLA. Thank you!

Andrew-Luo1 · 2024-04-17T09:11:37Z

Hi @erikfrey, I have updated the tests and they pass on my local setup. I've also fixed the nit. I've signed the CLA, and the Checks tab is saying that my signing went through. Please let me know if there's anything missing.

erikfrey

Almost there!

erikfrey · 2024-04-17T22:11:20Z

brax/training/agents/apg/train_test.py

@@ -22,6 +22,9 @@
 from brax.training.agents.apg import networks as apg_networks
 from brax.training.agents.apg import train as apg
 import jax
+from jax import config
+config.update("jax_enable_x64", True)


these config changes are process-wide, so it's breaking other tests that expect default float width and precision. we run all the tests in a single process.

the envs in this test are simple enough that hopefully they don't need these config changes. can you try removing these config changes or otherwise tweak the tests so that jax_enable_64 and the precision change are not needed?

Andrew-Luo1 · 2024-04-18T03:09:33Z

I removed the double precision toggle and the tests still run fine on my local setup. Let's see if this works :)

erikfrey · 2024-04-18T18:51:01Z

Amazing, thank you!

Andrew-Luo1 added 4 commits April 11, 2024 13:40

initial changes

a3bb9a3

untested new apg implementation

d63eb55

working

7e67df2

clean up, fix the sps calculations

e768157

erikfrey requested changes Apr 16, 2024

View reviewed changes

Andrew-Luo1 added 2 commits April 17, 2024 10:19

update tests and docstring

7514426

passing tests

04276a9

Andrew-Luo1 mentioned this pull request Apr 17, 2024

Tutorial for how to use the jacobian of mjx.step google-deepmind/mujoco#1601

Merged

erikfrey requested changes Apr 17, 2024

View reviewed changes

change default option on train, remove float64 toggle in train_test

e17bc3f

erikfrey merged commit b45760c into google:main Apr 18, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated basic APG algorithm #476

Updated basic APG algorithm #476

Andrew-Luo1 commented Apr 15, 2024

google-cla bot commented Apr 15, 2024

erikfrey left a comment

erikfrey Apr 16, 2024

erikfrey commented Apr 16, 2024

Andrew-Luo1 commented Apr 17, 2024

erikfrey left a comment

erikfrey Apr 17, 2024

Andrew-Luo1 commented Apr 18, 2024

erikfrey commented Apr 18, 2024

Updated basic APG algorithm #476

Updated basic APG algorithm #476

Conversation

Andrew-Luo1 commented Apr 15, 2024

google-cla bot commented Apr 15, 2024

erikfrey left a comment

Choose a reason for hiding this comment

erikfrey Apr 16, 2024

Choose a reason for hiding this comment

erikfrey commented Apr 16, 2024

Andrew-Luo1 commented Apr 17, 2024

erikfrey left a comment

Choose a reason for hiding this comment

erikfrey Apr 17, 2024

Choose a reason for hiding this comment

Andrew-Luo1 commented Apr 18, 2024

erikfrey commented Apr 18, 2024