Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Differences between softlearning implementation and formula 18 in paper of alpha loss #177

Open
Maggern3 opened this issue Apr 28, 2021 · 0 comments

Comments

@Maggern3
Copy link

Opening a new issue because the old issue was closed but didn't really explain the differences in the softlearning implementation and the other issue was not reopened on request(yet, 1wk).


self.expected_entropy = -torch.prod(torch.tensor(action_space.shape).to(self.device)).item() # unsure if this is right for multidiscrete env
print('target entropy', self.expected_entropy) # gives target entropy -4
self.log_alpha = torch.tensor(0.0, requires_grad=True, device=self.device)
self.alpha = self.log_alpha.exp() #0.2#, requires_grad=True, device=self.device)#0.2
self.alpha_optimizer = optim.Adam([self.log_alpha], lr=0.003)


# my impl based on formula 18 from paper, crashes
#alpha_loss = (-self.alpha * (log_prob - self.expected_entropy).detach()).mean()
# rail-berkeley/softlearning, crashes
#alpha_loss2 = -1.0 * (self.alpha * (log_prob + self.expected_entropy).detach()).mean()
# cyoon1729/Policy-Gradient-Methods, alpha is less than 0.0 in 80 episodes
#alpha_loss3 = (self.log_alpha * (-log_prob - self.expected_entropy).detach()).mean()    
# vitchyr/rlkit, alpha is less than 0.0 in 52 episodes
# p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch alpha is less than 0.0 in 50 episodes
alpha_loss4 = -(self.log_alpha * (next_log_prob_selected_actions + self.expected_entropy).detach()).mean()
self.alpha = self.log_alpha.exp()
self.alpha_optimizer.zero_grad()
alpha_loss4.backward()
self.alpha_optimizer.step() 

If you could explain the difference in the math between the paper(my implementation above) and softlearning above I'd appreciate it. Why are you using -1 as a multiplier? Why add log_prob and self.expected_entropy, in the paper it's subtracted?

In addition, what would be a good value for expected_entropy or the target entropy for a MultiDiscrete([3 3 2 3]) action space, the Obstacle tower environment?
Depending on how I calculate I get -4, -11 or -54 but I'm not sure what would be a good value. Ascertaining from the other post you linked, -4 or -11 should work. But right now they're not working. Could be due to the the alpha loss function though.

If you need to see full source code it's here

Also one question about the intuition of alpha. If it should be different for each state(the entropy), shouldn't we create it's own neural net? How can one tensor encapsulate different entropy values for each different state? Is it achieved through how we use alpha together with the other losses? It doesn't make sense

Much appreciated, thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant