-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gateway segfault when stressing the system #5
Comments
This is an interesting issue. I cannot immediately think which part of the code can segfault under heavy load. A question: what is the (rough) aggregated QPS from 3 client machines? I'll try to find if I can re-produce this problem with my machines. |
When using 10 threads on each client machine, the aggregated QPS was about 17k. The segfault occurs once in a while with these many threads, and if I increase the number of threads to 14 per client machine, it happens every time I rerun the experiment and about 10 to 20 seconds into each experiment. Thanks for looking into this! |
Could you verify if the segfault is caused by rlimit, e.g, number of max file descriptors? |
Hi @zhipeng-jia, I encountered the same error. I tried increasing the number of max file descriptors (e.g.
It seems that it's related to uv library. Any ideas about what causes it? Thanks |
Hi, I enabled the debug mode and address sanitizer and I'm able to locate the bug now:
It seems that it tries to write to a destructed HTTP connection and causes the bug. Do you have any ideas how to fix that? |
My guess is there is ongoing write, but the connection is closed. Maybe try not to destruct the connection class (by removing line 125 in gateway/server.cpp) |
Yeah I tried the same a few days ago and it did fix the crash issue. At least it could work now! Thanks for your reply anyway. |
I have 4 machines each with 12 CPU cores and 64GB RAM. I deploy the Nightcore gateway on one of them and on each of the other three, I deploy an instance of the engine and a launcher for a hello-world function.
I have 3 other machines which act as clients and invoke the hello-world function by sending http POST requests to the gateway. The segfault occurs only when there are a large number of client threads (10 or 14 client threads on each client machine). What happens is that in the middle of the experiment, the gateway returns the following error in the log:
When I first encountered the problem, the segfault happened at
uv__count_bufs
but in my latest attempt to produce the error I got the following message from gdb:and in the tail of the log, there was this message:
Any ideas as to what causes this problem and how to resolve it?
Some more info that might be useful: When I deploy the engine and launcher on only 2 other machines (instead of 3), then this error does not show up regardless of the number of client threads I use to stress the system. The minWorkers and maxWorkers config parameters for the function are 20 and 80 respectively.
The text was updated successfully, but these errors were encountered: