You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Opening a new issue because the old issue was closed but didn't really explain the differences in the softlearning implementation and the other issue was not reopened on request(yet, 1wk).
self.expected_entropy = -torch.prod(torch.tensor(action_space.shape).to(self.device)).item() # unsure if this is right for multidiscrete env
print('target entropy', self.expected_entropy) # gives target entropy -4
self.log_alpha = torch.tensor(0.0, requires_grad=True, device=self.device)
self.alpha = self.log_alpha.exp() #0.2#, requires_grad=True, device=self.device)#0.2
self.alpha_optimizer = optim.Adam([self.log_alpha], lr=0.003)
# my impl based on formula 18 from paper, crashes
#alpha_loss = (-self.alpha * (log_prob - self.expected_entropy).detach()).mean()
# rail-berkeley/softlearning, crashes
#alpha_loss2 = -1.0 * (self.alpha * (log_prob + self.expected_entropy).detach()).mean()
# cyoon1729/Policy-Gradient-Methods, alpha is less than 0.0 in 80 episodes
#alpha_loss3 = (self.log_alpha * (-log_prob - self.expected_entropy).detach()).mean()
# vitchyr/rlkit, alpha is less than 0.0 in 52 episodes
# p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch alpha is less than 0.0 in 50 episodes
alpha_loss4 = -(self.log_alpha * (next_log_prob_selected_actions + self.expected_entropy).detach()).mean()
self.alpha = self.log_alpha.exp()
self.alpha_optimizer.zero_grad()
alpha_loss4.backward()
self.alpha_optimizer.step()
If you could explain the difference in the math between the paper(my implementation above) and softlearning above I'd appreciate it. Why are you using -1 as a multiplier? Why add log_prob and self.expected_entropy, in the paper it's subtracted?
In addition, what would be a good value for expected_entropy or the target entropy for a MultiDiscrete([3 3 2 3]) action space, the Obstacle tower environment?
Depending on how I calculate I get -4, -11 or -54 but I'm not sure what would be a good value. Ascertaining from the other post you linked, -4 or -11 should work. But right now they're not working. Could be due to the the alpha loss function though.
Also one question about the intuition of alpha. If it should be different for each state(the entropy), shouldn't we create it's own neural net? How can one tensor encapsulate different entropy values for each different state? Is it achieved through how we use alpha together with the other losses? It doesn't make sense
Much appreciated, thanks
The text was updated successfully, but these errors were encountered:
Opening a new issue because the old issue was closed but didn't really explain the differences in the softlearning implementation and the other issue was not reopened on request(yet, 1wk).
If you could explain the difference in the math between the paper(my implementation above) and softlearning above I'd appreciate it. Why are you using -1 as a multiplier? Why add log_prob and self.expected_entropy, in the paper it's subtracted?
In addition, what would be a good value for expected_entropy or the target entropy for a MultiDiscrete([3 3 2 3]) action space, the Obstacle tower environment?
Depending on how I calculate I get -4, -11 or -54 but I'm not sure what would be a good value. Ascertaining from the other post you linked, -4 or -11 should work. But right now they're not working. Could be due to the the alpha loss function though.
If you need to see full source code it's here
Also one question about the intuition of alpha. If it should be different for each state(the entropy), shouldn't we create it's own neural net? How can one tensor encapsulate different entropy values for each different state? Is it achieved through how we use alpha together with the other losses? It doesn't make sense
Much appreciated, thanks
The text was updated successfully, but these errors were encountered: