Two questions: Tmaze demo + gradient descent on VFE via auto-differentiation demo #160

gpagnon · 2024-11-05T13:22:20Z

gpagnon
Nov 5, 2024

Hello,

First of all my compliments and thanks for developing this resource, it's been incredibly useful to get a working knowledge of active inference notions. I went through all the demo scripts, but there are a couple of them that show sort of an unexpected behavior (at least for me).

First, the Tmaze demo. Towards the end of the script, when the active inference loop is run for 5 time steps, if I run that cell repeatedly (locally or on colab), the agent never seem to visit again the Cue Location (except perhaps the 1st time); furthermore, sometimes the agent stays in the wrong T-maze arm (the one with loss) for the whole of the 5 time steps, which seems quite counterintuitive. Is this behavior correct? (especially in the light of the comment following the code snippet in the demo script that says that the agent will start by visiting the Cue Location because it "knows" that it contains valuable information)

Second, the computation of gradient descent on VFE via autodifferentiation. In the demo, it is explained that the update of the posterior over states is performed according to the formula:

Q(s)_{t+1} = Q(s)_t - learning_rate * ∂F/∂Q(s)_t

however, since this formula could produce negative values for Q(s), the gradient descent seems to be performed on log Q(s). Now if we substitute log Q(s) for Q(s) in the above formula, we get:

ln(Q(s)_{t+1}) = ln(Q(s)_t) - learning_rate * ∂F/∂(logQ(s)_t)

which, by using the chain rule, becomes:

(1) ln(Q(s)_{t+1}) = ln(Q(s)_t) - learning_rate* Q(s)_t * ∂F/∂Q(s)_t

In the script however, the following formula is used, which does not seem strictly correct:

(2) ln(Q(s)_{t+1}) = ln(Q(s)_t) - learning_rate* ∂F/∂Q(s)_t

In fact, if we plot the updates of VFE using the formulas (1) or (2), we see that the the curve computed using (1) (in red) stays above the one computed using (2) (in blue). See picture:

I am puzzled by this, I was expecting at least a convergence of the two methods in the long run. But more importantly, isn't (1) the correct formula? And if it is, why (2) is used?

I am attaching the modified script where I used both formulas, in case it is helpful (I changed the extension from .py to .txt, because otherwise I couldn't upload it).

6_my_compute_VEF.txt

thank you in advance for any comments!

gpagnon · 2024-11-07T07:51:41Z

gpagnon
Nov 7, 2024
Author

Alright, I think I can answer to my own 1st question :-)

In the code cell, the environment is reset every time the cell is run, but the agent's beliefs about the reward condition are not. Therefore, if sufficient evidence has accumulated for either Reward on Right or Reward on Left, and the actual environment reward condition on the next cell run is the opposite one, it may take much more than 5 time steps to change such a strong belief. In fact, if I add the following line just before the active inference loop:

reset posterior beliefs about reward condition

agent.qs[1] = [0.5, 0.5]

then, the agent goes to the cue location at the first time step every time I run the cell.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Two questions: Tmaze demo + gradient descent on VFE via auto-differentiation demo #160

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Two questions: Tmaze demo + gradient descent on VFE via auto-differentiation demo #160

gpagnon Nov 5, 2024

Replies: 1 comment

gpagnon Nov 7, 2024 Author

reset posterior beliefs about reward condition

gpagnon
Nov 5, 2024

gpagnon
Nov 7, 2024
Author