Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: thread has overflowed its stack #816

Open
enitrat opened this issue Sep 26, 2024 · 8 comments
Open

bug: thread has overflowed its stack #816

enitrat opened this issue Sep 26, 2024 · 8 comments
Labels
bug Something isn't working

Comments

@enitrat
Copy link

enitrat commented Sep 26, 2024

Repro:
scarb 2.8.3
clone https://github.com/kkrt-labs/kakarot-ssj on main
scarb build -p contracts - builds the required contract with some debug print logs

clone https://github.com/kkrt-labs/ef-tests
ensure correct branch git checkout feat/cairo-native
make
make setup-kakarot
copy the contracts built previously:cp ../kakarot-ssj/target/dev/contracts_* build/v1
cargo test test_baseFeeDiffPlaces_d34g0v0_Cancun --features v1,native -- --nocapture

output:

thread 'stEIP1559::test_baseFeeDiffPlaces_d34g0v0_Cancun' has overflowed its stack
fatal runtime error: stack overflow
@azteca1998
Copy link
Collaborator

Could you check where is the stack overflowing? Depending on where it is, it could mean different things:

  • If it's overflowing in LLVM it may be a bug on their part.
  • If it's overflowing in the generated library then the contract probably has a recursive function that is not tail-recursive.

In the first case we should report it to LLVM and wait until it gets solved (or try to solve it ourselves, of course).

The latter case is just a limitation of the current architectures and there's not much more we can do other than just invoking the contract with a bigger stack, but you could end up with gigabytes of stack depending on the program.

Some lore about this: Sierra doesn't have the concept of loops; all loops are implemented as recursive functions. Every iteration of the recursive function will use part of the stack. We implemented a tail recursion to iteration conversion for tail-recursive functions so that we can run loop with more than a few hundred/thousand iterations, but this transformation only works for tail-recursive functions. Non-tail-recursive functions are still implemented normally. More information in src/metadata/tail_recursion.rs.

A workaround for tests which are known to fail with stack overflow can be to run the test in a separate thread with a custom size stack. Rust provides an API to do that.

@enitrat
Copy link
Author

enitrat commented Sep 26, 2024

@azteca1998 im not super familiar with debugging rust panics like this: what's the usual process to find where the stack is overflowing?

@enitrat
Copy link
Author

enitrat commented Sep 26, 2024

ok - more update on this:

I ran the test with export RUST_MIN_STACK=16777216 to increase the stack size.
Now, it passes.

More details:

this specific test has a lot of recursions, running 1024 nested ethereum calls, and is quite heavy. Thus I think that's why we hit the default rust stack limits.

@azteca1998
Copy link
Collaborator

Oh, I see. If I understood correctly it's not that there's a non-tail-recursive loop but rather that there's a lot of nested contract calls, right? We hadn't seen any crash because of that yet.

This is (more or less) the same case as a stack overflow due to non-tail-recursive loop; that is, we can't do much more than increase the stack size. There is one difference though: we can do something about it. Once rust stabilizes coroutines we could make the runner use the coroutine, which should have its own separate stack instead of the main one. A different approach would be to start them in threads but they're not as cheap.

@enitrat
Copy link
Author

enitrat commented Sep 26, 2024

Oh, I see. If I understood correctly it's not that there's a non-tail-recursive loop but rather that there's a lot of nested contract calls, right? We hadn't seen any crash because of that yet.

Actually, there's only a single Starknet contract call to the Kakarot main contract. However, there are roughly 1024 EVM contract calls executed inside this test, which is perform a lot of operations. So in the end it's just a very heavy computation

@azteca1998
Copy link
Collaborator

Then it might be that the VM's main loop is not tail recursive.

@rodrigo-pino
Copy link
Contributor

I have a question, if the VM is more resource intensive than Native (thinking on the fact that segments can only grow), shouldn't VM execution crash as well?

@azteca1998
Copy link
Collaborator

Not really. As you said, the VM uses segments which are just Vec<_>. Every .push(_) or (safe) operation will check the vector length and act accordingly.

This is not the case in native because we're using the CPU's stack, which is just a pointer. The program doesn't know how much stack it has remaining, and even if we know I'm not sure how we could manipulate the MLIR to check that and avoid crashes.

What could work is hooking into the SIGSEGV process signal and "guess" if it was a stack overflow or a different cause for the segfault, then try to recover the execution from there. I think I tried something like that once for a personal project, but that's definitely 100% undefined behavior which we do not want in native.

@igaray igaray added the bug Something isn't working label Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants