Buffer clearing in policy gradient methods #18

ilyalasy · 2021-04-30T15:49:15Z

ilyalasy
Apr 30, 2021

Hey, I have a simple question. Why in your implementation of policy gradient algorithms like A2C, PPO etc. replay buffer is cleared after an update, while in DQN it is not?

Answered by iffiX

Apr 30, 2021

Because A2C and PPO are stochastic policies, the log probability of actions are no longer the same after you update actors once. So theoretically you cannot even update their actors multiple times in each update function (determined by actor_update_times in init call), but in practice update for a few times works and learns better so I kept that design.

DQN is not a stochastic algorithm, so it is not constrained by that.

View full answer

iffiX · 2021-04-30T15:52:57Z

iffiX
Apr 30, 2021
Maintainer

Because A2C and PPO are stochastic policies, the log probability of actions are no longer the same after you update actors once. So theoretically you cannot even update their actors multiple times in each update function (determined by actor_update_times in init call), but in practice update for a few times works and learns better so I kept that design.

DQN is not a stochastic algorithm, so it is not constrained by that.

6 replies

iffiX Apr 30, 2021
Maintainer

The problem is the action distribution in previous samples may deviate a lot from the current output distribution, which will cause nan values to occur since the new log probability of old actions is now negative infinity.

As a side comment, this is also the reason why on-stochastic policies are so sample inefficient compared to algorithms with replayable memories such as DQN, DDPG, etc. DQN is a converging bootstrapping process on estimate Q values with no actors, DDPG uses a directly differentiable actor, so their numerical properties are different from on-stochastic policies.

If you have a way to deal with the negative infinite log probability problem then there is no need to clear the buffer.

ilyalasy May 3, 2021
Author

Now I think I understand, thanks.
After realizing all this I have new question in mind: what algorithm should I choose if I want both hybrid action space and experience replay? Is SAC is the way to go? Does your SAC implementation supports discrete action space?

iffiX May 3, 2021
Maintainer

SAC is the way to go if you want to use stochastic policies with a replay memory, since while training the actor, it does not use actions in old samples but generates new ones instead (the MC sampling process to calculate expectation).

Usually, for discrete actions, there is another version of discrete-SAC mentioned in this paper, it does some modifications on continuous SAC and is not compatible with hybrid action space.

Therefore, I would recommend you output a probability tensor (since its continuous) along with other continuous actions instead of using integer samples from a categorical distribution.

The biggest problem is to define a distribution from which you can sample probability vectors with sum of elements equal to 1, you can chain a softmax function after a normal distribution to do this, but the log probability of the new distribution needs to be inferred using the rule for deducing p.d.f and c.d.f of a function of a random variable

If using a multidimensional categorical distribution is too difficult, you can consider about using a binomial distribution as your final discrete output source, and let your network control the p parameter of your bionmial distribution, since p is single dimensional and has a value range of [0,1], you can now do a little modification to the default tanh capped gussian provided in framework examples of TRPO and change tanh to sigmoid.

ilyalasy May 7, 2021
Author

Thanks for the comprehensive answer, I'll think about it. Meanwhile I have another question, I'll open another discussion :)

ilyalasy May 21, 2021
Author

SAC is the way to go if you want to use stochastic policies with a replay memory, since while training the actor, it does not use actions in old samples but generates new ones instead (the MC sampling process to calculate expectation).

Usually, for discrete actions, there is another version of discrete-SAC mentioned in this paper, it does some modifications on continuous SAC and is not compatible with hybrid action space.

Therefore, I would recommend you output a probability tensor (since its continuous) along with other continuous actions instead of using integer samples from a categorical distribution.

The biggest problem is to define a distribution from which you can sample probability vectors with sum of elements equal to 1, you can chain a softmax function after a normal distribution to do this, but the log probability of the new distribution needs to be inferred using the rule for deducing p.d.f and c.d.f of a function of a random variable

If using a multidimensional categorical distribution is too difficult, you can consider about using a binomial distribution as your final discrete output source, and let your network control the p parameter of your bionmial distribution, since p is single dimensional and has a value range of [0,1], you can now do a little modification to the default tanh capped gussian provided in framework examples of TRPO and change tanh to sigmoid.

Just wanted to add here: It seems there are couple of distributions provided by PyTorch that we can use instead of Categorical: ContinuousBernoully, RelaxedOneHotCategorical
Both of them generate values in [0,1] and has rsample() (btw torch binomial implementation doesn't support rsample). @iffiX What do you think of those?
Update: it seems RelaxedOneHotCategorical is buggy and produces positive log probs, so instead one should use Gumbel-softmax (torch implementation)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Buffer clearing in policy gradient methods #18

{{title}}

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Buffer clearing in policy gradient methods #18

ilyalasy Apr 30, 2021

Replies: 1 comment · 6 replies

iffiX Apr 30, 2021 Maintainer

iffiX Apr 30, 2021 Maintainer

ilyalasy May 3, 2021 Author

iffiX May 3, 2021 Maintainer

ilyalasy May 7, 2021 Author

ilyalasy May 21, 2021 Author

ilyalasy
Apr 30, 2021

Replies: 1 comment 6 replies

iffiX
Apr 30, 2021
Maintainer

iffiX Apr 30, 2021
Maintainer

ilyalasy May 3, 2021
Author

iffiX May 3, 2021
Maintainer

ilyalasy May 7, 2021
Author

ilyalasy May 21, 2021
Author