Failing to fully replicate Pong with A3C-LSTM #15

revilokeb · 2016-09-20T15:14:32Z

@miyosuda I have been trying to replicate your very nice Pong A3C-LSTM chart (https://github.com/miyosuda/async_deep_reinforce/blob/master/docs/graph_24h_lstm.png). So far unfortunately I have not really succeeded.

I have been using parameter settings in constants.py (setting USE_LSTM=True and USE_GPU=True). I have also been setting frame_skip=4 in ale.cfg (using master branch, tf r0.10, [email protected] - 8 threads, Nvidia Titan X).

My question: should the above setting of parameters be sufficient to reproduce your A3C-LSTM chart?

When doing the above my charts are looking as follows (I have done multiple runs, also simulating for more than 60M steps):

In my case learning seems to be much slower and saturating at around -5 to 0 (better seen on my simulations running for more than 60M steps).

Comparing with http://arxiv.org/abs/1602.01783 their Figure 3 (Pong, A3C - 8 threads) it seems your result is faster in terms of number of steps to reach score 20: you require ~20-25M steps to reach score 20, whereas DeepMind on average seemd to need ~50-60M (but then theirs is an average value and from their paper I dont know if it has been A3C-FF or A3C-LSTM).

My second question: Have you run A3C-FF / A3C-LSTM multiple times and were results similar? And do you have an explanation why A3C-FF is not reaching score 20? (I have also run A3C-FF which is looking similar to yours...)

Many many thanks for the code!!

miyosuda · 2016-09-20T16:14:30Z

@revilokeb
Thank you for trying my code.
I tried pong with master branch recently, and I couldn't reach score +20.0 like when I recorded the graph for A3C-LSTM.

When I recorded the graph, there were few differences compared with current master.

I was using TensorFlow r0.8 or r0.9. (I couldn't remember precisely 0.8 or 0.9, but it was not r0.10, because from r0.10 TensorBoards graph color became red from blue.)
(The speed (frames per second) with r0.10 is now faster than the speed written in README.md on same machine, so something in TensorFlow changed from r0.10.)
I added log_pi clipping for avoiding NaN after recording that graph.
41f2d75

The commit when I recorded the graph was this point. 9f97b2b
(I used 8 threads with USE_LSTM, and USE_GPU frag on.)

However I think, both 1) and 2) are not related with the score.
I can't tell whether I was just super lucky when recorded the graph or not.

The video I recorded when recorded the A3C-LSTM after 24 hours was like this.
https://www.youtube.com/watch?v=KJt1X-tRCbw

Now my machine is occupied with other task, so I'll try again once machine becomes available.

Thank you for the trial.

revilokeb · 2016-09-20T19:01:56Z

@miyosuda Thank you for your detailed reply. I might give 9f97b2b another try later. In case I am going to find additional insights into this I will post them here, too.

danijar · 2016-09-30T17:23:05Z

@miyosuda That video seems like the agent memorized the environment. I think the paper authors use random starts to create a fairer evaluation. They sample a number from 0 to 20 and perform that many no-op activation at the beginning of that episode, before the agent kicks in.

joabim · 2016-09-30T17:42:22Z

@danijar The reset function game_state.py handles the no-ops (max number defined in constants.py) at each reset

danijar · 2016-09-30T20:21:48Z

I see, so the agent just learned a good behavior that results in very repetitive episodes.

giuseppebonaccorso · 2017-01-05T10:50:17Z

I've seen that you always pick a random action without an exploration factor. How can you reach a so high score without argmax? Are you still picking a random action when the graph is showing a score of 20? It's terribly weird and interesting! I think that using minimal actions (maybe 3) the chance to get the right one with a random choice is relatively high when considering a sequence (so that an error can be corrected), but I still don't understand how that result can be achieved.

babaktr · 2017-01-05T15:00:31Z

Hey @giuseppebonaccorso! It's not fully random, but rather based on a a weighted probability where the action with the highest value also has the highest probability of being selected (think soft max, almost) :)

giuseppebonaccorso · 2017-01-05T15:38:14Z

@babaktr You're right! I was looking at random.choice but without considering probabilities. :( It becomes almost deterministic if the entropy is enough low.

itane13 · 2017-03-28T21:52:52Z

Hey all,

I have been running the latest version after Miyoshi ported it to tf 1.0, I removed gradient clipping, to test on asteroids, where I was worried that it wasn't converging and I also tried with pong: I get this performance on pong:
All settings as they are on the repo, except actions = 6 (ale.get_minimal_actionset)

edit: as for asteroids... I think that the reason for non convergence is that the ship randomly (? ) disappears from the screen for an indefinite (?) number of frames.

mklissa · 2017-04-01T14:49:54Z

I can also confirm good results with tf 1.0 on a simple MacBook Pro 13"

1601214542 · 2017-11-01T14:25:49Z

hello, why I still stuck in -21 scores when step is 8M. I am confused. Is that ok to directly run a3c.py? I am using tensorflow 1.2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failing to fully replicate Pong with A3C-LSTM #15

Failing to fully replicate Pong with A3C-LSTM #15

revilokeb commented Sep 20, 2016

miyosuda commented Sep 20, 2016 •

edited

Loading

revilokeb commented Sep 20, 2016

danijar commented Sep 30, 2016

joabim commented Sep 30, 2016

danijar commented Sep 30, 2016

giuseppebonaccorso commented Jan 5, 2017

babaktr commented Jan 5, 2017

giuseppebonaccorso commented Jan 5, 2017 •

edited

Loading

itane13 commented Mar 28, 2017 •

edited

Loading

mklissa commented Apr 1, 2017

1601214542 commented Nov 1, 2017

Failing to fully replicate Pong with A3C-LSTM #15

Failing to fully replicate Pong with A3C-LSTM #15

Comments

revilokeb commented Sep 20, 2016

miyosuda commented Sep 20, 2016 • edited Loading

revilokeb commented Sep 20, 2016

danijar commented Sep 30, 2016

joabim commented Sep 30, 2016

danijar commented Sep 30, 2016

giuseppebonaccorso commented Jan 5, 2017

babaktr commented Jan 5, 2017

giuseppebonaccorso commented Jan 5, 2017 • edited Loading

itane13 commented Mar 28, 2017 • edited Loading

mklissa commented Apr 1, 2017

1601214542 commented Nov 1, 2017

miyosuda commented Sep 20, 2016 •

edited

Loading

giuseppebonaccorso commented Jan 5, 2017 •

edited

Loading

itane13 commented Mar 28, 2017 •

edited

Loading