Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug]: WTCL: received error code CodePermanentFailure in StateUpdateReply for seqnum=xxx #7949

Closed
MButcho opened this issue Sep 2, 2023 · 17 comments · Fixed by #8239
Closed
Labels
bug Unintended code behaviour P2 should be fixed if one has time watchtower

Comments

@MButcho
Copy link

MButcho commented Sep 2, 2023

Background

WTCL is working fine but I receive this error every 5 mins on the same seqnum:
[ERR] WTCL: (anchor) SessionQueue(xxx) unable to send state update: received error code CodePermanentFailure in StateUpdateReply for seqnum=570

Your environment

  • lnd version 0.16.4-beta commit=v0.16.4-beta
  • Linux 5.15.0-82-generic 91-Ubuntu SMP (22.04)
  • Bitcoin Core version v25.0.0

Steps to reproduce

It happens continually

Expected behaviour

The error should not be in log

Actual behaviour

Error in log

@MButcho MButcho added bug Unintended code behaviour needs triage labels Sep 2, 2023
@ellemouton
Copy link
Collaborator

Hi @MButcho - thanks for the report. Are your debug logs on? if not, mind turning them on for a bit to see if there is more information provided in the debug logs?

Also note that once #7545 is in, you will be able to mark a session as "borked" so that it replays the update onto a different session

@saubyk saubyk added P3 might get fixed, nice to have P2 should be fixed if one has time and removed needs triage P3 might get fixed, nice to have labels Sep 5, 2023
@MButcho
Copy link
Author

MButcho commented Sep 6, 2023

Hey,
here it is:
2023-09-06 13:12:48.154 [INF] WTCL: (anchor) SessionQueue(02a0a6a24489427882a03c077d556d5981c59478715a4fb4dce6f0f75336d67d59) attempting to dial tower at x.x.x.x:9911
2023-09-06 13:12:48.241 [DBG] WTCL: (anchor) SessionQueue(02a0a6a24489427882a03c077d556d5981c59478715a4fb4dce6f0f75336d67d59) reprocessing committed state update for backup(28312160ca8a24059c86358195d52bfae30792143c53a7850747dd7c8e8726dd, 180585) seqnum=570
2023-09-06 13:12:48.244 [DBG] WTCL: (anchor) Sending Init to 032b32484f7f279638c3fa58f12d4dc255b6d7cbb08bff1f843a8e4ff2b48f79ab@x.x.x.x:9911
2023-09-06 13:12:48.279 [DBG] WTCL: (anchor) Received Init from 032b32484f7f279638c3fa58f12d4dc255b6d7cbb08bff1f843a8e4ff2b48f79ab@x.x.x.x:9911
2023-09-06 13:12:48.280 [DBG] WTCL: (anchor) Sending MsgStateUpdate(seqnum=570 last_applied=569 is_complete=0 hint=af90e258d9ed85417091694ac48aa72c) to 032b32484f7f279638c3fa58f12d4dc255b6d7cbb08bff1f843a8e4ff2b48f79ab@x.x.x.x:9911
2023-09-06 13:12:48.320 [DBG] WTCL: (anchor) Received MsgStateUpdateReply(code=50 last_applied=0) from 032b32484f7f279638c3fa58f12d4dc255b6d7cbb08bff1f843a8e4ff2b48f79ab@x.x.x.x:9911
2023-09-06 13:12:48.320 [WRN] WTCL: (anchor) SessionQueue(02a0a6a24489427882a03c077d556d5981c59478715a4fb4dce6f0f75336d67d59) unable to upload state update to tower=032b32484f7f279638c3fa58f12d4dc255b6d7cbb08bff1f843a8e4ff2b48f79ab@x.x.x.x:9911: received error code CodePermanentFailure in StateUpdateReply for seqnum=570
2023-09-06 13:12:48.321 [ERR] WTCL: (anchor) SessionQueue(02a0a6a24489427882a03c077d556d5981c59478715a4fb4dce6f0f75336d67d59) unable to send state update: received error code CodePermanentFailure in StateUpdateReply for seqnum=570

@ellemouton
Copy link
Collaborator

Thanks @MButcho - and do you have access to the watchtower's logs too?

@MButcho
Copy link
Author

MButcho commented Sep 6, 2023

No I don't, but may be able to contact the person providing watchtower. Let me ping him over Telegram.

@fbrz76
Copy link

fbrz76 commented Sep 6, 2023

Well, the only lines on my side (watchtower server) with that session string that you mentioned is actually written every 5 minutes and that's it:

2023-09-06 18:29:14.305 [INF] WTWR: Accepted incoming peer 02a0a6a24489427882a03c077d556d5981c59478715a4fb4dce6f0f75336d67d59@x.x.x.x:49682
2023-09-06 18:29:14.346 [ERR] WTWR: Unable to handle StateUpdate from 2a0a6a24489427882a03c077d556d5981c59478715a4fb4dce6f0f75336d67d59: connection with 02a0a6a24489427882a03c077d556d5981c59478715a4fb4dce6f0f75336d67d59 failed with code=CodePermanentFailure
2023-09-06 18:29:14.346 [INF] WTWR: Releasing incoming peer 02a0a6a24489427882a03c077d556d5981c59478715a4fb4dce6f0f75336d67d59@x.x.x.x:49682

The cause I can't really imagine. probably an error has invalidated some data.

On my side info are very poor pherhaps a debug log should help more,

@fbrz76
Copy link

fbrz76 commented Sep 6, 2023

lncli version 0.16.2-beta commit=v0.16.2-beta

And this is the log with command
lncli debuglevel --level=WTWR=debug

2023-09-06 19:39:16.654 [INF] WTWR: Accepted incoming peer 02a0a6a24489427882a03c077d556d5981c59478715a4fb4dce6f0f75336d67d59@x.x.x.x:38540
2023-09-06 19:39:16.661 [DBG] WTWR: Received Init from 02a0a6a24489427882a03c077d556d5981c59478715a4fb4dce6f0f75336d67d59@x.x.x.x:38540
2023-09-06 19:39:16.661 [DBG] WTWR: Sending Init to 02a0a6a24489427882a03c077d556d5981c59478715a4fb4dce6f0f75336d67d59@x.x.x.x:38540
2023-09-06 19:39:16.705 [DBG] WTWR: Received MsgStateUpdate(seqnum=570 last_applied=569 is_complete=0 hint=af90e258d9ed85417091694ac48aa72c) from 02a0a6a24489427882a03c077d556d5981c59478715a4fb4dce6f0f75336d67d59@x.x.x.x:38540
2023-09-06 19:39:16.705 [DBG] WTWR: Sending MsgStateUpdateReply(code=50 last_applied=0) to 02a0a6a24489427882a03c077d556d5981c59478715a4fb4dce6f0f75336d67d59@x.x.x.x:38540
2023-09-06 19:39:16.705 [ERR] WTWR: Unable to handle StateUpdate from 02a0a6a24489427882a03c077d556d5981c59478715a4fb4dce6f0f75336d67d59: connection with 02a0a6a24489427882a03c077d556d5981c59478715a4fb4dce6f0f75336d67d59 failed with code=CodePermanentFailure
2023-09-06 19:39:16.705 [INF] WTWR: Releasing incoming peer 02a0a6a24489427882a03c077d556d5981c59478715a4fb4dce6f0f75336d67d59@x.x.x.x:38540

@saubyk saubyk assigned ellemouton and unassigned ellemouton Sep 13, 2023
@MButcho
Copy link
Author

MButcho commented Sep 13, 2023

After server restart today, another seqnum started to apear:
2023-09-13 09:03:15.443 [ERR] WTCL: (anchor) SessionQueue(02a0a6a24489427882a03c077d556d5981c59478715a4fb4dce6f0f75336d67d59) unable to send state update: received error code CodePermanentFailure in StateUpdateReply for seqnum=570
2023-09-13 09:03:15.460 [ERR] WTCL: (anchor) SessionQueue(0298f40defbbee418bac2f2cf96cf8842d83678aeb891e0406c630e0b0624722a8) unable to send state update: received error code CodePermanentFailure in StateUpdateReply for seqnum=697

@ellemouton
Copy link
Collaborator

Thanks for the extra logs @MButcho 🙏

The first thing I think we should do is prio getting this PR in as this will allow the client to just mark the problem session(s) as borked so that it can continue its operations.

From a quick look at the code, the server will return CodePermanentFailure in one of two cases. The first is if the session being referred to by the client does not exist and the second is if the client tries to send an update using a sequence number that the client claims to already have received an ACK from the server for.

I think the first scenario is very unlikely. So if it is the second scenario, then there is likely some race happening somewhere similar to #6397. I will dig into this a bit.

But I think in the meantime, we should get #7545 in so that things can continue when investigations like this need to happen.

@MButcho
Copy link
Author

MButcho commented Sep 14, 2023

@ellemouton Should I stop lnd, remove wtclient.db and start lnd?

@fbrz76
Copy link

fbrz76 commented Sep 14, 2023

Hi, I read around, that someone does this. Afterwards you have to reconnect the watchtower

@ellemouton
Copy link
Collaborator

Should I stop lnd, remove wtclient.db and start lnd?

Yes @MButcho - I think that is a good idea for now (and yes, like @fbrz76 says - you will need to reconnect to the tower if you do that). It will allow you to get unstuck and continue to backup updates. And your previous backups will remain on the tower 👍

Then when we have this fix in then there will be a better way to move past something like this without having to delete your db.

@MButcho
Copy link
Author

MButcho commented Sep 14, 2023

Stopped lnd, removed wtclient.db, started lnd and added new tower:

lncli wtclient towers

{
    "towers": [
        {
            "pubkey": "somepubkey",
            "addresses": [
                "x.x.x.x:9911"
            ],
            "active_session_candidate": true,
            "num_sessions": 0,
            "sessions": [
            ],
            "session_info": [
                {
                    "active_session_candidate": true,
                    "num_sessions": 0,
                    "sessions": [
                    ],
                    "policy_type": "ANCHOR"
                },
                {
                    "active_session_candidate": true,
                    "num_sessions": 0,
                    "sessions": [
                    ],
                    "policy_type": "LEGACY"
                }
            ]
        }
    ]
}

lncli wtclient stats

{
    "num_backups": 0,
    "num_pending_backups": 0,
    "num_failed_backups": 0,
    "num_sessions_acquired": 0,
    "num_sessions_exhausted": 0
}

2023-09-14 22:26:48.687 [INF] WTCL: (legacy) Client stats: tasks(received=0 accepted=0 ineligible=0) sessions(acquired=0 exhausted=0)
2023-09-14 22:26:48.687 [INF] WTCL: (anchor) Client stats: tasks(received=0 accepted=0 ineligible=0) sessions(acquired=0 exhausted=0)

Lets see if it starts backuping again

@MButcho
Copy link
Author

MButcho commented Sep 15, 2023

Not even trying to connect tower now, no connection from the tower as well. Very strange, tower is present but I get following:

2023-09-15 12:00:31.620 [DBG] WTCL: (anchor) Unable to get new tower candidate, retrying after 5m0s -- reason: exhausted all tower candidates
2023-09-15 12:00:31.621 [DBG] WTCL: (legacy) Unable to get new tower candidate, retrying after 5m0s -- reason: exhausted all tower candidates

Later

2023-09-15 12:05:32.340 [DBG] WTCL: (anchor) Request for session negotiation with tower= failed, trying again -- reason: EOF
2023-09-15 12:05:32.340 [DBG] WTCL: (anchor) Session negotiation with tower=pubkey failed, trying again -- reason: session negotiation unsuccessful
2023-09-15 12:05:32.343 [DBG] WTCL: (legacy) Request for session negotiation with [email protected]:9911 failed, trying again -- reason: EOF
2023-09-15 12:05:32.343 [DBG] WTCL: (legacy) Session negotiation with tower=pubkey failed, trying again -- reason: session negotiation unsuccessful

I have restored old wtclient.db and backing up again. I keep it with the error for now. Better error in log than no backup ;)
lncli wtclient policy

{
    "max_updates": 1024,
    "sweep_sat_per_byte": 10,
    "sweep_sat_per_vbyte": 10
}

lncli wtclient stats

{
    "num_backups": 55,
    "num_pending_backups": 0,
    "num_failed_backups": 0,
    "num_sessions_acquired": 0,
    "num_sessions_exhausted": 0
}

@ellemouton
Copy link
Collaborator

Not even trying to connect tower now, no connection from the tower as well. Very strange, tower is present but I get following:

You should defs connect to a tower though.

Unable to get new tower candidate, retrying after 5m0s -- reason: exhausted all tower candidates

This just means that that the tower you've connected to is not responding. So you perhaps need to give it another address or something?

@MButcho
Copy link
Author

MButcho commented Sep 15, 2023

You maybe missed this part, I have added that to a comment later:

2023-09-15 12:05:32.340 [DBG] WTCL: (anchor) Request for session negotiation with tower= failed, trying again -- reason: EOF
2023-09-15 12:05:32.340 [DBG] WTCL: (anchor) Session negotiation with tower=pubkey failed, trying again -- reason: session negotiation unsuccessful
2023-09-15 12:05:32.343 [DBG] WTCL: (legacy) Request for session negotiation with [email protected]:9911 failed, trying again -- reason: EOF
2023-09-15 12:05:32.343 [DBG] WTCL: (legacy) Session negotiation with tower=pubkey failed, trying again -- reason: session negotiation unsuccessful

@MButcho
Copy link
Author

MButcho commented Sep 15, 2023

Update, there was DNS issue with the tower, I connected with new wtclient.db and all seems fine

@ellemouton
Copy link
Collaborator

Hi all,

Once #8239 makes it into a release (likely 0.18.0), then if you experience something like this issue, you should be able to manually terminate the problem session using the new lncli session terminate command.

Please comment here if the issue persists (once 0.18.0 has been released and the new command has been tried)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Unintended code behaviour P2 should be fixed if one has time watchtower
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants