Monte-Carlo Policy Gradient (REINFORCE)

Solve the CartPole-v0 with Monte-Carlo Policy Gradient (REINFORCE)!

How does it work

Take a look at the boxed pseudocode below.

: return (cumulative discounted reward) following time T

: probability of taking action in state

My interpretation of this method is that the actions selected more frequently are the more beneficial choices, thus we try to repeat these actions if similar states are visited.

Result

Note

The box pseudocode is from Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto.

In my code, the loop of updating weights iterates from to , and the cumulative discounted reward is computed by .

Besides, the term is ommited in .

Since the optimizer will minimize the loss, we will need to multiply the product of and by .

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
demo		demo
.gitignore		.gitignore
README.md		README.md
REINFORCE.py		REINFORCE.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Monte-Carlo Policy Gradient (REINFORCE)

How does it work

Result

Note

About

Releases

Packages

Languages

YuffieHuang/Monte-Carlo-Policy-Gradient-REINFORCE

Folders and files

Latest commit

History

Repository files navigation

Monte-Carlo Policy Gradient (REINFORCE)

How does it work

Result

Note

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages