Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gateway segfault when stressing the system #5

Open
suraj44 opened this issue Nov 7, 2021 · 7 comments
Open

Gateway segfault when stressing the system #5

suraj44 opened this issue Nov 7, 2021 · 7 comments

Comments

@suraj44
Copy link

suraj44 commented Nov 7, 2021

I have 4 machines each with 12 CPU cores and 64GB RAM. I deploy the Nightcore gateway on one of them and on each of the other three, I deploy an instance of the engine and a launcher for a hello-world function.

I have 3 other machines which act as clients and invoke the hello-world function by sending http POST requests to the gateway. The segfault occurs only when there are a large number of client threads (10 or 14 client threads on each client machine). What happens is that in the middle of the experiment, the gateway returns the following error in the log:

When I first encountered the problem, the segfault happened at uv__count_bufs but in my latest attempt to produce the error I got the following message from gdb:

Thread 3 "gateway" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff7228700 (LWP 275299)]
__GI___libc_free (mem=0xffffffff00000000) at malloc.c:3102

and in the tail of the log, there was this message:

3102    malloc.c: No such file or directory.

Any ideas as to what causes this problem and how to resolve it?

Some more info that might be useful: When I deploy the engine and launcher on only 2 other machines (instead of 3), then this error does not show up regardless of the number of client threads I use to stress the system. The minWorkers and maxWorkers config parameters for the function are 20 and 80 respectively.

@zhipeng-jia
Copy link
Member

This is an interesting issue. I cannot immediately think which part of the code can segfault under heavy load.

A question: what is the (rough) aggregated QPS from 3 client machines? I'll try to find if I can re-produce this problem with my machines.

@suraj44
Copy link
Author

suraj44 commented Nov 11, 2021

When using 10 threads on each client machine, the aggregated QPS was about 17k. The segfault occurs once in a while with these many threads, and if I increase the number of threads to 14 per client machine, it happens every time I rerun the experiment and about 10 to 20 seconds into each experiment.

Thanks for looking into this!

@zhipeng-jia
Copy link
Member

Could you verify if the segfault is caused by rlimit, e.g, number of max file descriptors?

@DCsunset
Copy link

DCsunset commented Aug 11, 2023

Hi @zhipeng-jia, I encountered the same error. I tried increasing the number of max file descriptors (e.g. ulimit -n 102400) but the error still occurs. There are only two lines of the error message:

PC: @     0x55d4462a8180  (unknown)  uv__count_bufs
    @ ... and at least 1 more frames

It seems that it's related to uv library. Any ideas about what causes it?

Thanks

@DCsunset
Copy link

DCsunset commented Aug 16, 2023

Hi, I enabled the debug mode and address sanitizer and I'm able to locate the bug now:

=================================================================
==3106988==ERROR: AddressSanitizer: heap-use-after-free on address 0x61b0000f51b0 at pc 0x563fc372d9b6 bp 0x7fc2f66fb340 sp 0x7fc2f66fb330
READ of size 8 at 0x61b0000f51b0 thread T3
    #0 0x563fc372d9b5 in faas::server::IOWorker::PipeWriteCallback(uv_write_s*, int) src/server/io_worker.cpp:223
    #1 0x563fc38575b0 in uv__write_callbacks (nightcore/bin/debug/gateway+0x28c5b0)
    #2 0x563fc38580af in uv__stream_io (nightcore/bin/debug/gateway+0x28d0af)
    #3 0x563fc38550ec in uv_run (nightcore/bin/debug/gateway+0x28a0ec)
    #4 0x563fc372a984 in faas::server::IOWorker::EventLoopThreadMain() src/server/io_worker.cpp:168
    #5 0x563fc374c37c in decltype (((*((declval<faas::server::IOWorker*&>)())).*((declval<void (faas::server::IOWorker::*&)()>)()))()) absl::lts_2020_02_25::base_internal::MemFunAndPtr::Invoke<void (faas::server::IOWorker::*&)(), faas::server::IOWorker*&>(void (faas::server::IOWorker::*&)(), faas::server::IOWorker*&) d
eps/out/include/absl/base/internal/invoke.h:105
    #6 0x563fc374966e in decltype (absl::lts_2020_02_25::base_internal::Invoker<void (faas::server::IOWorker::*&)(), faas::server::IOWorker*&>::type::Invoke((declval<void (faas::server::IOWorker::*&)()>)(), (declval<faas::server::IOWorker*&>)())) absl::lts_2020_02_25::base_internal::Invoke<void (faas::server::IOWorker:
:*&)(), faas::server::IOWorker*&>(void (faas::server::IOWorker::*&)(), faas::server::IOWorker*&) deps/out/include/absl/base/internal/invoke.h:180
    #7 0x563fc3742f2a in void absl::lts_2020_02_25::functional_internal::Apply<void, absl::lts_2020_02_25::container_internal::CompressedTuple<void (faas::server::IOWorker::*)(), faas::server::IOWorker*>&, 0ul, 1ul>(absl::lts_2020_02_25::container_internal::CompressedTuple<void (faas::server::IOWorker::*)(), faas::serv
er::IOWorker*>&, absl::lts_2020_02_25::integer_sequence<unsigned long, 0ul, 1ul>) deps/out/include/absl/functional/internal/front_binder.h:36
    #8 0x563fc37397e7 in void absl::lts_2020_02_25::functional_internal::FrontBinder<void (faas::server::IOWorker::*)(), faas::server::IOWorker*>::operator()<, void>() & deps/out/include/absl/functional/internal/front_binder.h:56
    #9 0x563fc3734187 in std::_Function_handler<void (), absl::lts_2020_02_25::functional_internal::FrontBinder<void (faas::server::IOWorker::*)(), faas::server::IOWorker*> >::_M_invoke(std::_Any_data const&) /usr/include/c++/9/bits/std_function.h:300
    #10 0x563fc3731c07 in std::function<void ()>::operator()() const /usr/include/c++/9/bits/std_function.h:688
    #11 0x563fc3808558 in faas::base::Thread::Run() src/base/thread.cpp:41
    #12 0x563fc38086ae in faas::base::Thread::StartRoutine(void*) src/base/thread.cpp:90
    #13 0x7fc2fb274608 in start_thread /build/glibc-eX1tMB/glibc-2.31/nptl/pthread_create.c:477
    #14 0x7fc2fae47292 in __clone (/lib/x86_64-linux-gnu/libc.so.6+0x122292)

0x61b0000f51b0 is located 304 bytes inside of 1608-byte region [0x61b0000f5080,0x61b0000f56c8)
freed by thread T9 here:
    #0 0x7fc2fb39f025 in operator delete(void*, unsigned long) (/lib/x86_64-linux-gnu/libasan.so.5+0x111025)
    #1 0x563fc364a430 in faas::gateway::HttpConnection::~HttpConnection() (nightcore/bin/debug/gateway+0x7f430)
    #2 0x563fc37ff6ff in std::_Sp_counted_ptr<faas::gateway::HttpConnection*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() /usr/include/c++/9/bits/shared_ptr_base.h:377
    #3 0x563fc365befe in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() /usr/include/c++/9/bits/shared_ptr_base.h:155
    #4 0x563fc36577ef in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count() /usr/include/c++/9/bits/shared_ptr_base.h:730
    #5 0x563fc3714457 in std::__shared_ptr<faas::server::ConnectionBase, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr() /usr/include/c++/9/bits/shared_ptr_base.h:1169
    #6 0x563fc3714477 in std::shared_ptr<faas::server::ConnectionBase>::~shared_ptr() /usr/include/c++/9/bits/shared_ptr.h:103
    #7 0x563fc371e847 in std::pair<int, std::shared_ptr<faas::server::ConnectionBase> >::~pair() /usr/include/c++/9/bits/stl_pair.h:208
    #8 0x563fc371e86b in void __gnu_cxx::new_allocator<std::pair<int const, std::shared_ptr<faas::server::ConnectionBase> > >::destroy<std::pair<int, std::shared_ptr<faas::server::ConnectionBase> > >(std::pair<int, std::shared_ptr<faas::server::ConnectionBase> >*) /usr/include/c++/9/ext/new_allocator.h:153
    #9 0x563fc371d127 in decltype (({parm#2}.destroy)({parm#3})) absl::lts_2020_02_25::allocator_traits<std::allocator<std::pair<int const, std::shared_ptr<faas::server::ConnectionBase> > > >::destroy_impl<std::allocator<std::pair<int const, std::shared_ptr<faas::server::ConnectionBase> > >, std::pair<int, std::shared_
ptr<faas::server::ConnectionBase> > >(int, std::allocator<std::pair<int const, std::shared_ptr<faas::server::ConnectionBase> > >&, std::pair<int, std::shared_ptr<faas::server::ConnectionBase> >*) deps/out/include/absl/memory/memory.h:587
    #10 0x563fc371c7ab in void absl::lts_2020_02_25::allocator_traits<std::allocator<std::pair<int const, std::shared_ptr<faas::server::ConnectionBase> > > >::destroy<std::pair<int, std::shared_ptr<faas::server::ConnectionBase> > >(std::allocator<std::pair<int const, std::shared_ptr<faas::server::ConnectionBase> > >&,
std::pair<int, std::shared_ptr<faas::server::ConnectionBase> >*) (nightcore/bin/debug/gateway+0x1517ab)
    #11 0x563fc371bc86 in void absl::lts_2020_02_25::container_internal::map_slot_policy<int, std::shared_ptr<faas::server::ConnectionBase> >::destroy<std::allocator<std::pair<int const, std::shared_ptr<faas::server::ConnectionBase> > > >(std::allocator<std::pair<int const, std::shared_ptr<faas::server::ConnectionBase>
 > >*, absl::lts_2020_02_25::container_internal::map_slot_type<int, std::shared_ptr<faas::server::ConnectionBase> >*) (nightcore/bin/debug/gateway+0x150c86)
    #12 0x563fc371a458 in void absl::lts_2020_02_25::container_internal::FlatHashMapPolicy<int, std::shared_ptr<faas::server::ConnectionBase> >::destroy<std::allocator<std::pair<int const, std::shared_ptr<faas::server::ConnectionBase> > > >(std::allocator<std::pair<int const, std::shared_ptr<faas::server::ConnectionBas
e> > >*, absl::lts_2020_02_25::container_internal::map_slot_type<int, std::shared_ptr<faas::server::ConnectionBase> >*) deps/out/include/absl/container/flat_hash_map.h:561
    #13 0x563fc3718128 in void absl::lts_2020_02_25::container_internal::hash_policy_traits<absl::lts_2020_02_25::container_internal::FlatHashMapPolicy<int, std::shared_ptr<faas::server::ConnectionBase> >, void>::destroy<std::allocator<std::pair<int const, std::shared_ptr<faas::server::ConnectionBase> > > >(std::alloca
tor<std::pair<int const, std::shared_ptr<faas::server::ConnectionBase> > >*, absl::lts_2020_02_25::container_internal::map_slot_type<int, std::shared_ptr<faas::server::ConnectionBase> >*) deps/out/include/absl/container/internal/hash_policy_traits.h:84
    #14 0x563fc3717885 in absl::lts_2020_02_25::container_internal::raw_hash_set<absl::lts_2020_02_25::container_internal::FlatHashMapPolicy<int, std::shared_ptr<faas::server::ConnectionBase> >, absl::lts_2020_02_25::hash_internal::Hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, std::shared_ptr<faas:
:server::ConnectionBase> > > >::erase(absl::lts_2020_02_25::container_internal::raw_hash_set<absl::lts_2020_02_25::container_internal::FlatHashMapPolicy<int, std::shared_ptr<faas::server::ConnectionBase> >, absl::lts_2020_02_25::hash_internal::Hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, std::shar
ed_ptr<faas::server::ConnectionBase> > > >::iterator) deps/out/include/absl/container/internal/raw_hash_set.h:1175
    #15 0x563fc3715c7a in unsigned long absl::lts_2020_02_25::container_internal::raw_hash_set<absl::lts_2020_02_25::container_internal::FlatHashMapPolicy<int, std::shared_ptr<faas::server::ConnectionBase> >, absl::lts_2020_02_25::hash_internal::Hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, std::sh
ared_ptr<faas::server::ConnectionBase> > > >::erase<int>(int const&) deps/out/include/absl/container/internal/raw_hash_set.h:1152
    #16 0x563fc37b7683 in faas::gateway::Server::OnConnectionClose(faas::server::ConnectionBase*) src/gateway/server.cpp:125
    #17 0x563fc3767935 in operator() src/server/server_base.cpp:132
    #18 0x563fc376c2fe in _M_invoke /usr/include/c++/9/bits/std_function.h:300
    #19 0x563fc376c5ef in std::function<void (faas::server::ConnectionBase**)>::operator()(faas::server::ConnectionBase**) const /usr/include/c++/9/bits/std_function.h:688
    #20 0x563fc376a3f0 in void faas::utils::ReadMessages<faas::server::ConnectionBase*>(faas::utils::AppendableBuffer*, char const*, unsigned long, std::function<void (faas::server::ConnectionBase**)>) src/utils/appendable_buffer.h:128
    #21 0x563fc3767bf3 in faas::server::ServerBase::OnReturnConnection(long, uv_buf_t const*) src/server/server_base.cpp:129
    #22 0x563fc376764c in faas::server::ServerBase::ReturnConnectionCallback(uv_stream_s*, long, uv_buf_t const*) src/server/server_base.cpp:121
    #23 0x563fc3857cce in uv__read (nightcore/bin/debug/gateway+0x28ccce)

previously allocated by thread T9 here:
    #0 0x7fc2fb39d947 in operator new(unsigned long) (/lib/x86_64-linux-gnu/libasan.so.5+0x10f947)
    #1 0x563fc37beecb in faas::gateway::Server::OnHttpConnection(int) src/gateway/server.cpp:430
    #2 0x563fc37be9b7 in faas::gateway::Server::HttpConnectionCallback(uv_stream_s*, int) src/gateway/server.cpp:424
    #3 0x563fc38584fa in uv__server_io (nightcore/bin/debug/gateway+0x28d4fa)

It seems that it tries to write to a destructed HTTP connection and causes the bug. Do you have any ideas how to fix that?

@zhipeng-jia
Copy link
Member

My guess is there is ongoing write, but the connection is closed. Maybe try not to destruct the connection class (by removing line 125 in gateway/server.cpp)

@DCsunset
Copy link

Yeah I tried the same a few days ago and it did fix the crash issue. At least it could work now! Thanks for your reply anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants