Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate and fix panic #1208

Open
karlem opened this issue Nov 19, 2024 · 3 comments
Open

Investigate and fix panic #1208

karlem opened this issue Nov 19, 2024 · 3 comments

Comments

@karlem
Copy link
Contributor

karlem commented Nov 19, 2024

thread 'tokio-runtime-worker' panicked at /app/fendermint/vm/interpreter/src/fvm/state/snapshot.rs:252:56:
blocktore stores IPLD encoded data: Error { description: "InvalidUtf8(Utf8Error { valid_up_to: 3, error_len: Some(1) })", protocol: Cbor }
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: <fendermint_vm_interpreter::fvm::state::snapshot::StateTreeStreamer<BS> as futures_core::stream::Stream>::poll_next
   4: <tokio_stream::stream_ext::fuse::Fuse<T> as futures_core::stream::Stream>::poll_next
   5: <tokio_stream::stream_ext::merge::Merge<T,U> as futures_core::stream::Stream>::poll_next
   6: futures_util::stream::stream::StreamExt::poll_next_unpin
   7: fendermint_vm_interpreter::fvm::state::snapshot::Snapshot<BS>::write_car::{{closure}}::{{closure}}
   8: tokio::runtime::task::core::Core<T,S>::poll
   9: tokio::runtime::task::harness::Harness<T,S>::poll
  10: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
  11: tokio::runtime::scheduler::multi_thread::worker::Context::run
  12: tokio::runtime::context::runtime::enter_runtime
  13: tokio::runtime::scheduler::multi_thread::worker::run
  14: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
  15: tokio::runtime::task::core::Core<T,S>::poll
  16: tokio::runtime::task::harness::Harness<T,S>::poll
  17: tokio::runtime::blocking::pool::Inner::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
2024-11-10T04:30:04.469282Z ERROR fendermint/app/src/main.rs:24: panicking stacktrace="   0: fendermint::init_panic_handler::{{closure}}\n   1: std::panicking::rust_panic_with_hook\n   2: std::panicking::begin_panic_handler::{{closure}}\n   3: std::sys::backtrace::__rust_end_short_backtrace\n   4: rust_begin_unwind\n   5: core::panicking::panic_fmt\n   6: core::result::unwrap_failed\n   7: <fendermint_vm_interpreter::fvm::state::snapshot::StateTreeStreamer<BS> as futures_core::stream::Stream>::poll_next\n   8: <tokio_stream::stream_ext::fuse::Fuse<T> as futures_core::stream::Stream>::poll_next\n   9: <tokio_stream::stream_ext::merge::Merge<T,U> as futures_core::stream::Stream>::poll_next\n  10: futures_util::stream::stream::StreamExt::poll_next_unpin\n  11: fendermint_vm_interpreter::fvm::state::snapshot::Snapshot<BS>::write_car::{{closure}}::{{closure}}\n  12: tokio::runtime::task::core::Core<T,S>::poll\n  13: tokio::runtime::task::harness::Harness<T,S>::poll\n  14: tokio::runtime::scheduler::multi_thread::worker::Context::run_task\n  15: tokio::runtime::scheduler::multi_thread::worker::Context::run\n  16: tokio::runtime::context::runtime::enter_runtime\n  17: tokio::runtime::scheduler::multi_thread::worker::run\n  18: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll\n  19: tokio::runtime::task::core::Core<T,S>::poll\n  20: tokio::runtime::task::harness::Harness<T,S>::poll\n  21: tokio::runtime::blocking::pool::Inner::run\n  22: std::sys::backtrace::__rust_begin_short_backtrace\n  23: core::ops::function::FnOnce::call_once{{vtable.shim}}\n  24: std::sys::pal::unix::thread::Thread::new::thread_start\n  25: <unknown>\n  26: __clone\n" info="panicked at /app/fendermint/vm/interpreter/src/fvm/state/snapshot.rs:252:56:\nblocktore stores IPLD encoded data: Error { description: \"InvalidUtf8(Utf8Error { valid_up_to: 3, error_len: Some(1) })\", protocol: Cbor }"
2024-11-10T04:30:04.470016Z  WARN fendermint/vm/snapshot/src/manager.rs:141: failed to create snapshot error=failed to write CAR file

Caused by:
    task 606061 panicked with message "blocktore stores IPLD encoded data: Error { description: \"InvalidUtf8(Utf8Error { valid_up_to: 3, error_len: Some(1) })\", protocol: Cbor }"

Stack backtrace:
   0: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
   1: fendermint_vm_snapshot::manager::SnapshotManager<BS>::create_snapshot::{{closure}}
   2: fendermint_vm_snapshot::manager::SnapshotManager<BS>::run::{{closure}}
   3: tokio::runtime::task::core::Core<T,S>::poll
   4: tokio::runtime::task::harness::Harness<T,S>::poll
   5: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
   6: tokio::runtime::scheduler::multi_thread::worker::Context::run
   7: tokio::runtime::context::runtime::enter_runtime
   8: tokio::runtime::scheduler::multi_thread::worker::run
   9: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
  10: tokio::runtime::task::core::Core<T,S>::poll
  11: tokio::runtime::task::harness::Harness<T,S>::poll
  12: tokio::runtime::blocking::pool::Inner::run

It panics here https://github.com/consensus-shipyard/ipc/blob/main/fendermint/vm/interpreter/src/fvm/state/snapshot.rs#L252

@raulk
Copy link
Contributor

raulk commented Nov 20, 2024

This panic happens during snapshot generation. Fendermint traverses the state tree to dump it into a CAR file. Since we use the DAG-CBOR codec extensively, we traverse links within objects addressed by CIDs with the DAG-CBOR multicodec in order to make sure we save all reachable objects.

What's happening here is that an object was saved with the state tree that happened to be invalid DAG-CBOR, yet it was addressed with that multicodec. So our assertion fails, which is correct, although it's rather disruptive to panic here.

Instead of panicking, we should log the error as a warning and abort snapshot generation.

That won't solve the original problem though. From what I understand this was reported by the Basin team (cc @sanderpick), so it's very likely that they're saving state tree objects under a DAG-CBOR multicodec that are not proper CBOR.

@raulk
Copy link
Contributor

raulk commented Nov 27, 2024

@sam701 @stbrody were you able to hunt down the root cause and confirm that it relates to user code and not IPC code?

@stbrody
Copy link

stbrody commented Nov 27, 2024

@raulk not yet. @joewagner is looking in this but hasn't been able to reproduce it. He's going to sync up with @sam701 soon to take another look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
Development

No branches or pull requests

3 participants