-
Hey, I have a simple question. Why in your implementation of policy gradient algorithms like |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 6 replies
-
Because A2C and PPO are stochastic policies, the log probability of actions are no longer the same after you update actors once. So theoretically you cannot even update their actors multiple times in each update function (determined by DQN is not a stochastic algorithm, so it is not constrained by that. |
Beta Was this translation helpful? Give feedback.
Because A2C and PPO are stochastic policies, the log probability of actions are no longer the same after you update actors once. So theoretically you cannot even update their actors multiple times in each update function (determined by
actor_update_times
in init call), but in practice update for a few times works and learns better so I kept that design.DQN is not a stochastic algorithm, so it is not constrained by that.