-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failing to fully replicate Pong with A3C-LSTM #15
Comments
@revilokeb When I recorded the graph, there were few differences compared with current master.
The commit when I recorded the graph was this point. 9f97b2b However I think, both 1) and 2) are not related with the score. The video I recorded when recorded the A3C-LSTM after 24 hours was like this. Now my machine is occupied with other task, so I'll try again once machine becomes available. Thank you for the trial. |
@miyosuda That video seems like the agent memorized the environment. I think the paper authors use random starts to create a fairer evaluation. They sample a number from 0 to 20 and perform that many no-op activation at the beginning of that episode, before the agent kicks in. |
@danijar The reset function game_state.py handles the no-ops (max number defined in constants.py) at each reset |
I see, so the agent just learned a good behavior that results in very repetitive episodes. |
I've seen that you always pick a random action without an exploration factor. How can you reach a so high score without argmax? Are you still picking a random action when the graph is showing a score of 20? It's terribly weird and interesting! I think that using minimal actions (maybe 3) the chance to get the right one with a random choice is relatively high when considering a sequence (so that an error can be corrected), but I still don't understand how that result can be achieved. |
Hey @giuseppebonaccorso! It's not fully random, but rather based on a a weighted probability where the action with the highest value also has the highest probability of being selected (think soft max, almost) :) |
@babaktr You're right! I was looking at random.choice but without considering probabilities. :( It becomes almost deterministic if the entropy is enough low. |
Hey all, I have been running the latest version after Miyoshi ported it to tf 1.0, I removed gradient clipping, to test on asteroids, where I was worried that it wasn't converging and I also tried with pong: I get this performance on pong: edit: as for asteroids... I think that the reason for non convergence is that the ship randomly (? ) disappears from the screen for an indefinite (?) number of frames. |
I can also confirm good results with tf 1.0 on a simple MacBook Pro 13" |
hello, why I still stuck in -21 scores when step is 8M. I am confused. Is that ok to directly run a3c.py? I am using tensorflow 1.2 |
@miyosuda I have been trying to replicate your very nice Pong A3C-LSTM chart (https://github.com/miyosuda/async_deep_reinforce/blob/master/docs/graph_24h_lstm.png). So far unfortunately I have not really succeeded.
I have been using parameter settings in constants.py (setting USE_LSTM=True and USE_GPU=True). I have also been setting frame_skip=4 in ale.cfg (using master branch, tf r0.10, [email protected] - 8 threads, Nvidia Titan X).
My question: should the above setting of parameters be sufficient to reproduce your A3C-LSTM chart?
When doing the above my charts are looking as follows (I have done multiple runs, also simulating for more than 60M steps):

In my case learning seems to be much slower and saturating at around -5 to 0 (better seen on my simulations running for more than 60M steps).
Comparing with http://arxiv.org/abs/1602.01783 their Figure 3 (Pong, A3C - 8 threads) it seems your result is faster in terms of number of steps to reach score 20: you require ~20-25M steps to reach score 20, whereas DeepMind on average seemd to need ~50-60M (but then theirs is an average value and from their paper I dont know if it has been A3C-FF or A3C-LSTM).
My second question: Have you run A3C-FF / A3C-LSTM multiple times and were results similar? And do you have an explanation why A3C-FF is not reaching score 20? (I have also run A3C-FF which is looking similar to yours...)
Many many thanks for the code!!
The text was updated successfully, but these errors were encountered: