You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I really enjoyed this tutorial, overall. It shows the power of Agent. I think it's a little hard to read in some places, and I wanted to flag some of these issues:
It's a non-standard bandit task, because there's a third action, HINT. It would be nice to indicate this in the intro - or to refer to a version of the task which includes the hint action, because there's no indication in the only reference, which is the Wikipedia link.
Going from tutorial 1 to tutorial 2, there's an unexplained shift in the definition of A. In tutorial 1, it was a two-dimensional array p(o|s), but now it's a four-dimensional array. I realized a bit late in the process that this was all explained in the pymdp fundamentals part of the doc - a backlink would be nice.
Generally I find using raw arrays to define states difficult to read. I presume that this is an artifact of the history of the package (Matlab cell arrays). It would be helpful to remind people the definition of the axes as we go along. OTOH, I did find the diagrams very helpful.
I found the last part about manipulating the reward very confusing. I had to go back to definitions to figure out that the last manipulation changed the loss into a reward (+5) - if I just look at the output, it looks like the agent always picks the arm resulting in a loss.
One thing that's not explained here is how the agent's inference mechanism differs from those of other, more well-known agents from the RL literature. Having read the first few chapters of Richard Sutton's book eons ago, I wondered whether the inference was equivalent to a finite horizon dynamic programming solution, or similar in spirit to an agent that uses a UCB heuristic or Thompson sampling. If you could have a few references in the tutorial about this, it would be great.
The text was updated successfully, but these errors were encountered:
Hi @patrickmineault thanks a lot for these helpful suggestions. I've addressed points 1-4.
That bit you discovered in Point 4 with the reward manipulation was a mistake on my part, I didn't mean to have the user flip the 'valence' of reward and punishment like that, although the outcome you observed makes perfect sense. Because in that case, the 'meaning' of the reward and loss observations are simply switched, in terms of the agent's prior over observations (they now "expect themselves" to see Loss more than Reward, meaning the observation formerly referred to as "Loss" is technically the reward now). I've now changed that section of the demo to make the agent more risk-averse, but still actively "Loss-seeking" as it was before.
I'm going to leave this issue open for now, because I haven't added in text to address your last point, which is also a great suggestion and should be incorporated: "I wondered whether the inference was equivalent to a finite horizon dynamic programming solution, or similar in spirit to an agent that uses a UCB heuristic or Thompson sampling. If you could have a few references in the tutorial about this, it would be great." This is related to this paper I think: https://arxiv.org/abs/2009.08111.
I will have to do some more research on these exact relationships and then incorporate that into the text of that demo.
I really enjoyed this tutorial, overall. It shows the power of Agent. I think it's a little hard to read in some places, and I wanted to flag some of these issues:
One thing that's not explained here is how the agent's inference mechanism differs from those of other, more well-known agents from the RL literature. Having read the first few chapters of Richard Sutton's book eons ago, I wondered whether the inference was equivalent to a finite horizon dynamic programming solution, or similar in spirit to an agent that uses a UCB heuristic or Thompson sampling. If you could have a few references in the tutorial about this, it would be great.
The text was updated successfully, but these errors were encountered: