Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Node AppHash'ing after v19.x.x upgrade #3344

Open
1 task done
a26nine opened this issue Sep 18, 2024 · 11 comments
Open
1 task done

[Bug]: Node AppHash'ing after v19.x.x upgrade #3344

a26nine opened this issue Sep 18, 2024 · 11 comments
Labels
status: waiting-triage This issue/PR has not yet been triaged by the team. type: bug Issues that need priority attention -- something isn't working

Comments

@a26nine
Copy link

a26nine commented Sep 18, 2024

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

Our cosmoshub-4 archive nodes stopped progressing after the v19 upgrade. So, we downloaded the archive snapshot from QuickSync. The nodes progressed smoothly for a while, but then it AppHash'd. We waited for a few days and downloaded another snapshot from the same source, and the results were same again; the node AppHash'd after some time. Once more, we waited for a few days for a new snapshot, got it, and got AppHash'd again.

The most recent AppHash happened on v19.2.0:

Sep 18 08:42:30 cosmovisor[7778]: 8:42AM ERR Error in validation err="wrong Block.Header.AppHash.  Expected FD8867280ABAD4DF361A7285B14822FC3ABE20CB4BDF8B87DDDF152C9A772450, got A4E3A67B70A3FF6C5C6EA6C9B2BDA694E886C6D2B1710F51182843C5AA5A887B" module=blocksync

I am not sure who/what is the culprit here—the snapshot, the binary, or something else?

We rolled back a few times and cleared the wasm directory before starting gaiad. We also tried running with the pre-built binaries supplied in the Releases section. But, none of it helped.

Our build process:

Long Version:

./gaiad version --long
build_deps:
- cloud.google.com/[email protected]
- cloud.google.com/go/[email protected]
- cloud.google.com/go/auth/[email protected]
- cloud.google.com/go/compute/[email protected]
- cloud.google.com/go/[email protected]
- cloud.google.com/go/[email protected]
- cosmossdk.io/[email protected] => github.com/informalsystems/cosmos-sdk/[email protected]
- cosmossdk.io/client/[email protected]
- cosmossdk.io/[email protected]
- cosmossdk.io/[email protected]
- cosmossdk.io/[email protected]
- cosmossdk.io/[email protected]
- cosmossdk.io/[email protected]
- cosmossdk.io/[email protected]
- cosmossdk.io/[email protected]
- cosmossdk.io/[email protected]
- cosmossdk.io/tools/[email protected]
- cosmossdk.io/tools/[email protected]
- cosmossdk.io/x/[email protected]
- cosmossdk.io/x/[email protected]
- cosmossdk.io/x/[email protected]
- cosmossdk.io/x/[email protected]
- cosmossdk.io/x/[email protected]
- cosmossdk.io/x/[email protected]
- filippo.io/[email protected]
- github.com/99designs/[email protected] => github.com/cosmos/[email protected]
- github.com/CosmWasm/[email protected]
- github.com/CosmWasm/wasmvm/[email protected]
- github.com/DataDog/[email protected]+incompatible
- github.com/aws/[email protected]
- github.com/beorn7/[email protected]
- github.com/bgentry/[email protected]
- github.com/bgentry/[email protected]
- github.com/bits-and-blooms/[email protected]
- github.com/btcsuite/btcd/btcec/[email protected]
- github.com/cenkalti/backoff/[email protected]
- github.com/cespare/xxhash/[email protected]
- github.com/chzyer/[email protected]
- github.com/cockroachdb/apd/[email protected]
- github.com/cockroachdb/[email protected]
- github.com/cockroachdb/[email protected]
- github.com/cockroachdb/[email protected]
- github.com/coinbase/[email protected]
- github.com/cometbft/[email protected]
- github.com/cometbft/[email protected]
- github.com/cosmos/[email protected]
- github.com/cosmos/[email protected]
- github.com/cosmos/[email protected]
- github.com/cosmos/[email protected] => github.com/cosmos/[email protected]
- github.com/cosmos/[email protected]
- github.com/cosmos/[email protected]
- github.com/cosmos/[email protected]
- github.com/cosmos/[email protected]
- github.com/cosmos/ibc-apps/middleware/packet-forward-middleware/[email protected]
- github.com/cosmos/ibc-apps/modules/rate-limiting/[email protected]
- github.com/cosmos/ibc-go/modules/[email protected]
- github.com/cosmos/ibc-go/[email protected]
- github.com/cosmos/ics23/[email protected]
- github.com/cosmos/interchain-security/[email protected]
- github.com/cosmos/[email protected]
- github.com/creachadair/[email protected]
- github.com/creachadair/[email protected]
- github.com/davecgh/[email protected]
- github.com/decred/dcrd/dcrec/secp256k1/[email protected]
- github.com/desertbit/[email protected]
- github.com/distribution/[email protected]
- github.com/dvsekhvalnov/[email protected]
- github.com/emicklei/[email protected]
- github.com/fatih/[email protected]
- github.com/felixge/[email protected]
- github.com/fsnotify/[email protected]
- github.com/getsentry/[email protected]
- github.com/go-kit/[email protected]
- github.com/go-kit/[email protected]
- github.com/go-logfmt/[email protected]
- github.com/go-logr/[email protected]
- github.com/go-logr/[email protected]
- github.com/godbus/[email protected]
- github.com/gogo/[email protected]
- github.com/gogo/[email protected]
- github.com/golang/[email protected]
- github.com/golang/[email protected]
- github.com/golang/[email protected]
- github.com/golang/[email protected]
- github.com/google/[email protected]
- github.com/google/[email protected]
- github.com/google/[email protected]
- github.com/google/[email protected]
- github.com/google/[email protected]
- github.com/google/[email protected]
- github.com/googleapis/[email protected]
- github.com/googleapis/gax-go/[email protected]
- github.com/gorilla/[email protected]
- github.com/gorilla/[email protected]
- github.com/gorilla/[email protected]
- github.com/grpc-ecosystem/[email protected]
- github.com/grpc-ecosystem/[email protected]
- github.com/gsterjov/[email protected]
- github.com/hashicorp/[email protected]
- github.com/hashicorp/[email protected]
- github.com/hashicorp/[email protected]
- github.com/hashicorp/[email protected]
- github.com/hashicorp/[email protected]
- github.com/hashicorp/[email protected]
- github.com/hashicorp/[email protected]
- github.com/hashicorp/[email protected]
- github.com/hashicorp/[email protected]
- github.com/hashicorp/golang-lru/[email protected]
- github.com/hashicorp/[email protected]
- github.com/hashicorp/[email protected]
- github.com/hdevalence/[email protected]
- github.com/huandu/[email protected]
- github.com/iancoleman/[email protected]
- github.com/iancoleman/[email protected]
- github.com/improbable-eng/[email protected]
- github.com/jmespath/[email protected]
- github.com/klauspost/[email protected]
- github.com/kr/[email protected]
- github.com/kr/[email protected]
- github.com/lib/[email protected]
- github.com/magiconair/[email protected]
- github.com/manifoldco/[email protected]
- github.com/mattn/[email protected]
- github.com/mattn/[email protected]
- github.com/minio/[email protected]
- github.com/mitchellh/[email protected]
- github.com/mitchellh/[email protected]
- github.com/mitchellh/[email protected]
- github.com/mtibben/[email protected]
- github.com/oasisprotocol/[email protected]
- github.com/oklog/[email protected]
- github.com/opencontainers/[email protected]
- github.com/pelletier/go-toml/[email protected]
- github.com/pkg/[email protected]
- github.com/pmezard/[email protected]
- github.com/prometheus/[email protected]
- github.com/prometheus/[email protected]
- github.com/prometheus/[email protected]
- github.com/prometheus/[email protected]
- github.com/rakyll/[email protected]
- github.com/rcrowley/[email protected]
- github.com/rogpeppe/[email protected]
- github.com/rs/[email protected]
- github.com/rs/[email protected]
- github.com/sagikazarmark/[email protected]
- github.com/skip-mev/[email protected]
- github.com/spf13/[email protected]
- github.com/spf13/[email protected]
- github.com/spf13/[email protected]
- github.com/spf13/[email protected]
- github.com/spf13/[email protected]
- github.com/stretchr/[email protected]
- github.com/subosito/[email protected]
- github.com/syndtr/[email protected] => github.com/syndtr/[email protected]
- github.com/tendermint/[email protected]
- github.com/tidwall/[email protected]
- github.com/ulikunitz/[email protected]
- [email protected]
- go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/[email protected]
- go.opentelemetry.io/contrib/instrumentation/net/http/[email protected]
- go.opentelemetry.io/[email protected]
- go.opentelemetry.io/otel/[email protected]
- go.opentelemetry.io/otel/[email protected]
- golang.org/x/[email protected]
- golang.org/x/[email protected]
- golang.org/x/[email protected]
- golang.org/x/[email protected]
- golang.org/x/[email protected]
- golang.org/x/[email protected]
- golang.org/x/[email protected]
- golang.org/x/[email protected]
- golang.org/x/[email protected]
- google.golang.org/[email protected]
- google.golang.org/[email protected]
- google.golang.org/genproto/googleapis/[email protected]
- google.golang.org/genproto/googleapis/[email protected]
- google.golang.org/[email protected]
- google.golang.org/[email protected]
- gopkg.in/[email protected]
- gopkg.in/[email protected]
- gopkg.in/[email protected]
- gotest.tools/[email protected]
- nhooyr.io/[email protected]
- pgregory.net/[email protected]
- sigs.k8s.io/[email protected]
build_tags: netgo
commit: 1d52b3d434d5d78561c3628ef351b88890ec7f7f
cosmos_sdk_version: v0.50.9-lsm
go: go version go1.22.6 linux/amd64
name: gaia
server_name: gaiad
version: v19.2.0

Gaia Version

v19.2.0

How to reproduce?

  • Download archive snapshot from QuickSync
  • Unpack it
  • Build & Install relevant gaiad binary
  • Start gaiad

The node will AppHash after some time.

@a26nine a26nine added status: waiting-triage This issue/PR has not yet been triaged by the team. type: bug Issues that need priority attention -- something isn't working labels Sep 18, 2024
@github-project-automation github-project-automation bot moved this to 🩹 F1: Triage in Cosmos Hub Sep 18, 2024
@MSalopek
Copy link
Contributor

Thank you for raising this concern, I'm sorry you are facing issues.

Have you tried any other node (default, pruned) from quicksync?

We can try and get in contact with quicksync and help them debug.

It is weird if this is happening frequently but we have had reports of apphashes tied to wasm directories ever since we have introduced cosmwasm.

If possible, it would be beneficial to try and replicate this on a smaller node (to reduce debugging times). If this issue happens with other quicksync snapshots but not on polkachu or nodestake it could point to a slight misconfiguration in quicksync's export procedure that can be mitigated.

@a26nine
Copy link
Author

a26nine commented Sep 18, 2024

Thank you for raising this concern, I'm sorry you are facing issues.

Have you tried any other node (default, pruned) from quicksync?

We can try and get in contact with quicksync and help them debug.

It is weird if this is happening frequently but we have had reports of apphashes tied to wasm directories ever since we have introduced cosmwasm.

If possible, it would be beneficial to try and replicate this on a smaller node (to reduce debugging times). If this issue happens with other quicksync snapshots but not on polkachu or nodestake it could point to a slight misconfiguration in quicksync's export procedure that can be mitigated.

I forgot to mention, we downloaded Polkachu's pruned snapshot, and it's running fine with the same binary without any issues.

@a26nine
Copy link
Author

a26nine commented Sep 23, 2024

@MSalopek, did you get a chance to check with the QuickSync team?

@mayank-daga
Copy link

@MSalopek even i am getting similar issues

@MSalopek
Copy link
Contributor

MSalopek commented Oct 3, 2024

ChainLayer has been contacted. Updates will be posted as they reach us.

@MSalopek
Copy link
Contributor

The issue seems to be solved on Quicksync's end.

Feel free to resync from the newest snapshot.

@mayank-daga @a26nine

@a26nine
Copy link
Author

a26nine commented Oct 21, 2024

The issue seems to be solved on Quicksync's end.

Feel free to resync from the newest snapshot.

@mayank-daga @a26nine

No, it's not resolved. We are running pruned nodes for now.

@jgrebowicz-ledger
Copy link

jgrebowicz-ledger commented Oct 28, 2024

I can confirm that it's not resloved yet. I've redownloaded archive for 2 of our RPC nodes after message that it got fixed on Quicksync's end, but if keeps on failing.
We've run pruned node as a backup, but it tends to fail too after a while...

@MSalopek
Copy link
Contributor

Sorry to hear that this is stil persisting.

We could provide instructions for a stop-gap solution that you could execute. The solution would require syncing an old gaia node instance and performing upgrades at designated block heights.

Unfortunately, we do not have other action we can perform here other than checking in with quicksync to help troubleshoot.

I will keep this issue open and close all other related issues.

@jgrebowicz-ledger
Copy link

I'll reach our to you if we would decide to follow stop-gap solution.
Unfortunately, we're having problems with apphash no matter which snapshot we use. We still have one node that's running for a long time on default snapshot and it runs fine, but if we spin up new node with the same config and download new snapshot - it fails after a while. (Actually it's the same with archival).

Yesterday once again we downloaded latest archival snapshot, but it failed after a while
12:07AM INF finalized block block_app_hash=24F4B044B767AFD73F14A5DC1E930CD2E685A80B93347E1216C636E762BDCC75 height=22838854 module=state num_txs_res=4 num_val_updates=1 12:07AM INF executed block app_hash=24F4B044B767AFD73F14A5DC1E930CD2E685A80B93347E1216C636E762BDCC75 height=22838854 module=state 12:07AM INF updates to validators module=state updates=5A59DC8746FD727FDDD5CBF5CBB90C6F616CCF9B:3596564 12:07AM INF committed state block_app_hash=0261BEC7EC8EFFF3ABB850402C54B78A81D0B4ABAC9418D2DE3E2D495E09AEA6 height=22838854 module=state 12:07AM ERR Error in validation err="wrong Block.Header.AppHash. Expected 24F4B044B767AFD73F14A5DC1E930CD2E685A80B93347E1216C636E762BDCC75, got 05012E467D0657717BD073AE4A25E3F71B3C85BDF1C1FC8AE35B6AE9391CB372" module=blocksync 12:07AM ERR Stopping peer for error err="reactor validation error: wrong Block.Header.AppHash. Expected 24F4B044B767AFD73F14A5DC1E930CD2E685A80B93347E1216C636E762BDCC75, got 05012E467D0657717BD073AE4A25E3F71B3C85BDF1C1FC8AE35B6AE9391CB372" module=p2p peer="Peer{MConn{141.94.73.39:37656} 2bda8bff758a39916a528c6b70eefad9148d09ce out}" 12:07AM

@jgrebowicz-ledger
Copy link

jgrebowicz-ledger commented Nov 18, 2024

@a26nine
@MSalopek
The issue was fact, that once wasm was introduced, it started to be pat of snapshot too.
We were unpacking everything into /data directory, therefore we were losing "wasm" directory each time we spinned up node from snapshot. After fixing that, it's working without issues again.

In downloaded snapshot before wasm there was just one dir - data.
Now there are two, data and wasm, so if somebody runs into that issue - please verify how you unpack the snapshot ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: waiting-triage This issue/PR has not yet been triaged by the team. type: bug Issues that need priority attention -- something isn't working
Projects
Status: 🩹 F1: Triage
Development

No branches or pull requests

4 participants