Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy callstack API #4033

Open
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

g0djan
Copy link
Contributor

@g0djan g0djan commented Jan 17, 2025

New WAMR public API to copy runtime call stack frames.

CAUTION: this APIs is not thread safe, that's why it's hidden behind feature flag for now. If you need to call it from another thread ensure the passed exec_env is suspended.

Our use case

Sometimes WAMR runtime gets stuck in production and we have no data where in the code compiled to WASM it happens. We currently only track such situations in a separate native thread. To increase visibility into the problem we developed internal solution that requires presence of this API in WAMR. If a separate thread finds that the WASM VM thread has stuck, it interrupts it with a user defined signal and calls this API to collect callstack. The main complexity is maintaining async-signal-safety and avoiding segfaults. For that we're maintaining atomic copies of exec_env, exec_env->module_inst, exec_env->module_inst->module. Those copies are always set to NULL before the referenced memory is freed. Before a call to this API those copies are always checked for validity. In our use case scenario we guarantee ourselves only absence of crashes but we realize that the frame data that we collect might be invalidated due to a signal interruption. However it's highly unlikely and is not a concern for us.

Have we tried existing WAMR APIs for our usecase?

Yes, we've tried suggested by maintainers wasm_cluster_suspend_thread and wasm_runtime_terminate.

  1. In our production runtime often recovers from being stuck, so wasm_runtime_terminate is not a good option for us to report the call stack
  2. The wasm_cluster_suspend_thread doesn't suit us either. Even if it did we'd still need API to iterate over stackframes.

@g0djan
Copy link
Contributor Author

g0djan commented Jan 28, 2025

@loganek addressed all your comments and rebased to fix the checks. The last failing check is CI issue and there's another PR that will fix it.
Let me know if you have more comments

Copy link
Collaborator

@loganek loganek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, if possible, consider adding tests.

@g0djan
Copy link
Contributor Author

g0djan commented Jan 28, 2025

@lum1n0us could you take a look at this PR?

@g0djan
Copy link
Contributor Author

g0djan commented Jan 30, 2025

@yamt could you please take a look at this PR?

* interruption from another thread if next variables hold valid pointers
* - exec_env
* - exec_env->module_inst
* - exec_env->module_inst->module
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, making this complex functionality async-signal-safe is too much maintenance burden, especially when wamr itself doesn't rely on the property at all.

given that you need to suspend the target thread anyway, why don't you call this from another thread?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

async-signal-safe is too much maintenance burden

Yeah, we understand that and I kept it simple as much as possible. Basically non async-signal-safe implementation would be different only in a few checks removed and there wouldn't be comments that I added in the code.

Particularly this comment about validity of pointers is a theoretical problem atm. We don't know any platform yet where updating pointer variable might be interrupted by a signal, possibly after launch we will see that it never happens.

given that you need to suspend the target thread anyway, why don't you call this from another thread?

Do you mean using wasm_cluster_suspend_thread?
I tried that and there're 2 problems for us:

  • Now there’s no awaiting till thread actually gets suspended
  • Suspension happens only after certain checks so we're not getting stacktraces that we need. E.g. if there's sleep somewhere, there won't be sleep in stacktrace reported, it'd report calls after sleep has finished. If there's a deadlock stacktrace won't be reported at all

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@g0djan g0djan Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yamt @TianlongLiang @wenyongh Hey guys could someone reply here or provide further review please? Sorry for pinging multiple times

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

async-signal-safe is too much maintenance burden

Yeah, we understand that and I kept it simple as much as possible. Basically non async-signal-safe implementation would be different only in a few checks removed and there wouldn't be comments that I added in the code.

Particularly this comment about validity of pointers is a theoretical problem atm. We don't know any platform yet where updating pointer variable might be interrupted by a signal, possibly after launch we will see that it never happens.

given that you need to suspend the target thread anyway, why don't you call this from another thread?

Do you mean using wasm_cluster_suspend_thread?

i was not thinking about a specific way to suspend a thread.
this api is not expected to work on a running thread, is it?

I tried that and there're 2 problems for us:

* Now there’s no awaiting till thread actually gets suspended

* Suspension happens only after certain checks so we're not getting stacktraces that we need. E.g. if there's sleep somewhere, there won't be sleep in stacktrace reported, it'd report calls after sleep has finished. If there's a deadlock stacktrace won't be reported at all

wrt how to suspend a thread, i thought you were planning to use a signal to interrupt blocking system calls, right?
to avoid the signal affecting the target application behavior, depending on the kind of system calls, you might need to somehow wrap the system calls.
you might want to adapt the existing blocking-syscall wrapper api: https://github.com/bytecodealliance/wasm-micro-runtime/blob/main/core/iwasm/common/wasm_blocking_op.c
after all, i suspect we will end up with something which shares the guts of the underlying machinery with wasm_runtime_terminate.

or, maybe things like ptrace can be more flexible for your purpose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this api is not expected to work on a running thread, is it?

It's not. But I don't see that much benefit in calling this API from another thread compared to signal handler. Ensuring validity of pointers that are about to be dereferenced is the main restriction and it remains due to asynchronous nature of signal delivery

Copy link
Contributor Author

@g0djan g0djan Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

depending on the kind of system calls, you might need to somehow wrap the system calls

is there an example that I could check out?

@lum1n0us
Copy link
Collaborator

  • It looks like we could refactor the existing APIs wasm_runtime_dump_call_stack_to_buf() and wasm_runtime_dump_call_stack() using the new API wasm_iterate_callstack(). At least print.
  • Rather than "interrupting it with a user-defined signal and calling this API to collect the call stack," I would prefer if the runtime, whether interpreter or AOT, could receive the signal and begin dumping the call stack on its own. Even better would be if the runtime could detect an infinite loop, terminate it, and then dump the call stack. With something like --fuel or --timeout. Allow finer control of start function on instantiation #4072 (comment)
  • I strongly hope this API can be made thread-safe. As a general solution, it would be extremely valuable.

@g0djan
Copy link
Contributor Author

g0djan commented Feb 14, 2025

It looks like we could refactor the existing APIs wasm_runtime_dump_call_stack_to_buf() and wasm_runtime_dump_call_stack() using the new API wasm_iterate_callstack(). At least print.

Should be possible yeah, I can try

I would prefer if the runtime, whether interpreter or AOT, could receive the signal and begin dumping the call stack on its own.

It's possible to bring this logic to WAMR, but if we are to do this I would like to have wasm_iterate_callstack API available first and then open a separate PR for dumping callstack on signal interruption

I strongly hope this API can be made thread-safe

This doesn't go well with async-signal-safety, that's why it's not part of it. But it seems you'd like to have a general approach to inspect stack in a safe manner rather than thread-safety itself

@g0djan
Copy link
Contributor Author

g0djan commented Feb 17, 2025

It looks like we could refactor the existing APIs wasm_runtime_dump_call_stack_to_buf() and wasm_runtime_dump_call_stack() using the new API wasm_iterate_callstack(). At least print.

I started refactoring that, wasm_runtime_dump_call_stack can't just call proposed wasm_iterate_callstack as taking lock and callback both assume aot/wasm specifics.

So possible change comes down to replacing this loop for wasm_interp
or aot with [aot/wasm_interp]_iterate_callstack call and internals of the loop are moved inside of the callback.

But it doesn't make code simpler imo, it just adds more boilerplate internally. Do you see it useful/meaningful to do so @lum1n0us ?

@g0djan g0djan changed the title Iterate callstack API Copy callstack API Feb 24, 2025
@g0djan
Copy link
Contributor Author

g0djan commented Feb 24, 2025

@lum1n0us I addressed security concerns and made API read-only by reducing it to a copy API instead of using user defined callback. Also I hid the feature behind a feature flag to avoid accidental use of a non thread safe API

Could you please take a look at it?

Copy link
Collaborator

@lum1n0us lum1n0us left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few questions regarding the design of the APIs

Copy link
Collaborator

@lum1n0us lum1n0us left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a few comments.

bool is_top_index_in_range =
top_boundary >= top && top >= (bottom + sizeof(AOTTinyFrame));
if (!is_top_index_in_range) {
return count;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely a critical error; please make sure to log it.

bool is_top_aligned_with_bottom =
(unsigned long)(top - bottom) % sizeof(AOTTinyFrame) == 0;
if (!is_top_aligned_with_bottom) {
return count;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely a critical error; please make sure to log it.

if (!cur_frame->function) {
cur_frame = cur_frame->prev_frame;
continue;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I observed that the PR only skips this case in interpreter mode. Is this intentional?

@g0djan
Copy link
Contributor Author

g0djan commented Feb 27, 2025

just a few comments.

addressed all

@g0djan g0djan requested a review from lum1n0us February 27, 2025 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants