-
Notifications
You must be signed in to change notification settings - Fork 591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long delay from publish to consumer event #1252
Comments
@bording @bollhals @stebet @danielmarbach - pinging you just to see if anything immediately comes to mind. The issue is that the first call to
The time taken to establish the connection is not part of this calculation, of course. |
The first time all the code will have to be jitted by the runtime, this will for sure take some time. To test this you'd have to send one message, disconnect and build up the connection again in the same process before starting the test sends. |
A workload that opens a connection per message published should not be considered typical for any RabbitMQ client. Should we spend our time on workloads that use reasonably long-lived connections and services? |
The Native Image Generator (Ngen.exe) tool could be used to verify this assumption |
Thanks @bollhals and @boskjoett, I'll start investigating there. |
@lukebakken Since you mention that the python version of this test doesn't show a delay, have you tried mixing the python and C# versions to see which pairs also demonstrate the problem? Should be able to help narrow down where the problem is that way. |
@lukebakken could it be related to the flow control synchronization? |
No but that's an excellent suggestion. Thank you. |
Indeed it could. It is exactly those lines you highlighted that we suspect are causing the delays. |
If clients wish to do something on `channel.flow`, the handler should be used. This library should not do anything on its own. Fixes #1252
I see the additional latency when combining Python / .NET both ways, but the latency is greater when the publisher is .NET and the consumer is Python. Removing the wait event around flow control didn't have an effect (commit), but it's something I'm going to remove anyway since no other client tries to handle the I'm going to try using |
Hm, using
|
Next I'm going to test what @bollhals suggests in this comment:
|
This is the consumer output from this branch where the consumer and producer both do a "first iteration" where a connection/channel are created to send one message, then both closed, then the "real" test begins. I did not use Notice that the first iteration of the real test still shows more latency, but not nearly as much as in the above test or in @boskjoett's initial report. Seems like the "warm up" of the library has something to do with this. I'm going to port these tests to Java to see if that library and runtime environment shows the same behavior.
|
I have updated my test programs in this repo It shows two things:
Output from the test
|
How do we define "a lot"? Publisher rate will vary, in particular with classic queues v1, because of both flow control and because they move messages to disk in batches due to certain internal metric. Use a quorum queue or a CQv2 to reduce this variability to primarily internal flow control. |
You are right. It is probably not fair to say "a lot" when it is in the milliseconds range. Our queues are not durable and messages are transient, so I don't expect writes to disk. |
Thanks for the additional testing code @boskjoett. I will update my Python code to do the same to see if the same behavior exists. |
As I suspected, the Python code is not affected by the same sort of delays - lukebakken/rabbitmq-dotnet-client-1060@e847411 At this time I'll try @boskjoett's latest code with the |
What was the output for it? |
We have discovered that the long random delays (more than a second) we have experienced in our code using RabbitMQ were caused by lack of threads in the .NET Threadpool. Creating new threads on demand can take several hundred milliseconds. Inspired by this line of code in your test application |
Threads are not blocked for the basicpublish method, nor anywhere in the receiving. So if this fixes the issue for you, you either do more than just basicPublish (e.g. QueueDeclare) or they're blocked by something else in your application. Do you know what blocks them? |
Generally changing the thread pool limits is something that I have rarely ever seen being useful in production. On the flip side I've seen the problems actually getting worse in certain scenarios because people who override it no longer benefit from the default "smarts" of the ramp up of the thread pool and the optimizations done over time in the thread pool. Of course the thread pool is general purpose and mileage may vary but that being said the defaults are battle tested against hundreds of scenarios. There is a reason why there is a caution warning on the page in the remarks section Have you tried switching to the async consumer? Have you tried stopping copying the body and see of things are related to GC pauses? Also how is that synthetic benchmark you have in the sample representative of the problem you are trying to solve? What are you trying to achieve that a few MS of delay would actually matter? In your real prod system what are you planning to do there inside the receiving code? |
The threadpool issue is not in the RabbitMQ .NET Client, but in our application code using it, where it sends and receives messages. We have code that waits on ManualResetEvents, etc. You are right that the correct solution is not to fiddle with the Threadpool size, but to use async/await properly. |
@bollhals this is the edited output from the first two batches of messages sent via the Python producer/consumer programs:
|
You may close this issue. It is not a real issue, as long as it only occurs during startup. |
This is something to investigate for 7.0 |
I'm re-opening this just to be sure that version 7.0 fixes this issue: |
Hi everyone, I have just tagged version https://github.com/rabbitmq/rabbitmq-dotnet-client/releases/tag/v7.0.0-alpha.1 Once this is pushed to NuGet by GitHub Actions, it would be really really great if you could confirm that the async API resolves this issue in your environment (provided that it's relatively easy to update your test code to the new API). I will take some time today to re-run my test project with Thanks! |
It looks like the initial delay is still there using 7.0.0-alpha.1. The delay is reduced as @bollhals theorized here and as can be demonstrated in this branch. It looks like this initial latency is due to runtime startup, so I'm going to close this issue. |
Have you tried playing around with the JIT settings to further investigate if it is indeed the JIT overhead? https://learn.microsoft.com/en-us/dotnet/core/runtime-config/compilation You can also control tiered JIT as part of the csproj <Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<TieredCompilation>false</TieredCompilation>
</PropertyGroup>
</Project>
|
I have not! I'll give that a try. |
Removing from the version 7 milestone since it is not version 7 specific. |
Reported by @boskjoett
Discussed in #1249
Originally posted by boskjoett September 9, 2022
We are consistently, but with random frequency, experiencing long delays (more than a second) from the time a publisher delivers a message on the bus with BasicPublish() until a consumer (EventingBasicConsumer) is notified by a Received event.
We have tried both with and without auto-ack and using multiple consumers, but without any effect.
Our message bus is not overloaded, we can reproduce it with very few short messages.
We don't believe it is network delays because we can also reproduce the random delays in a docker-compose stack where both the producer, the consumer and RabbitMQ is running in containers in the same stack and the same docker network.
Code to reproduce:
The text was updated successfully, but these errors were encountered: