Skip to content

YuffieHuang/Monte-Carlo-Policy-Gradient-REINFORCE

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Monte-Carlo Policy Gradient (REINFORCE)

Solve the CartPole-v0 with Monte-Carlo Policy Gradient (REINFORCE)!

How does it work

Take a look at the boxed pseudocode below.

pseudocode

: return (cumulative discounted reward) following time T

: probability of taking action in state

My interpretation of this method is that the actions selected more frequently are the more beneficial choices, thus we try to repeat these actions if similar states are visited.

Result

Note

The box pseudocode is from Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto.

In my code, the loop of updating weights iterates from to , and the cumulative discounted reward is computed by .

Besides, the term is ommited in .

Since the optimizer will minimize the loss, we will need to multiply the product of and by .

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%