Long delay from publish to consumer event #1252

lukebakken · 2022-09-16T19:47:18Z

Discussed in #1249

^{Originally posted by boskjoett September 9, 2022}
We are consistently, but with random frequency, experiencing long delays (more than a second) from the time a publisher delivers a message on the bus with BasicPublish() until a consumer (EventingBasicConsumer) is notified by a Received event.
We have tried both with and without auto-ack and using multiple consumers, but without any effect.
Our message bus is not overloaded, we can reproduce it with very few short messages.
We don't believe it is network delays because we can also reproduce the random delays in a docker-compose stack where both the producer, the consumer and RabbitMQ is running in containers in the same stack and the same docker network.

Code to reproduce:

https://github.com/boskjoett/rabbitmq-dotnet-client-1060
Python code that does not show the issue - https://github.com/lukebakken/rabbitmq-dotnet-client-1060/tree/lukebakken/python

lukebakken · 2022-09-16T20:57:17Z

@bording @bollhals @stebet @danielmarbach - pinging you just to see if anything immediately comes to mind. The issue is that the first call to BasicPublish appears to introduce latency that does not happen with subsequent calls. I suppose this additional latency could be on the consumer side for the first time a message is received. I'll get a packet trace to see if I can determine on "which side" the issue lies.

The time taken to establish the connection is not part of this calculation, of course.

bollhals · 2022-09-17T06:39:30Z

The first time all the code will have to be jitted by the runtime, this will for sure take some time.

To test this you'd have to send one message, disconnect and build up the connection again in the same process before starting the test sends.

michaelklishin · 2022-09-17T12:54:25Z

A workload that opens a connection per message published should not be considered typical for any RabbitMQ client. Should we spend our time on workloads that use reasonably long-lived connections and services?

boskjoett · 2022-09-19T06:31:56Z

The first time all the code will have to be jitted by the runtime, this will for sure take some time.

The Native Image Generator (Ngen.exe) tool could be used to verify this assumption
https://learn.microsoft.com/en-us/previous-versions/dotnet/netframework-2.0/6t9t5wcf(v=vs.80)?redirectedfrom=MSDN

lukebakken · 2022-09-19T15:28:02Z

Thanks @bollhals and @boskjoett, I'll start investigating there.

bording · 2022-09-19T15:34:59Z

@lukebakken Since you mention that the python version of this test doesn't show a delay, have you tried mixing the python and C# versions to see which pairs also demonstrate the problem? Should be able to help narrow down where the problem is that way.

danielmarbach · 2022-09-19T15:35:07Z

@lukebakken could it be related to the flow control synchronization?

lukebakken · 2022-09-19T15:41:17Z

have you tried mixing the python and C# versions to see which pairs also demonstrate the problem?

No but that's an excellent suggestion. Thank you.

boskjoett · 2022-09-19T19:36:51Z

@lukebakken could it be related to the flow control synchronization?

Indeed it could. It is exactly those lines you highlighted that we suspect are causing the delays.

If clients wish to do something on `channel.flow`, the handler should be used. This library should not do anything on its own. Fixes #1252

lukebakken · 2022-09-19T22:53:36Z

I see the additional latency when combining Python / .NET both ways, but the latency is greater when the publisher is .NET and the consumer is Python.

Removing the wait event around flow control didn't have an effect (commit), but it's something I'm going to remove anyway since no other client tries to handle the channel.flow method in this manner.

I'm going to try using ngen next.

lukebakken · 2022-09-19T23:17:42Z

Hm, using crossgen doesn't seem to affect the initial message latency. Here's what I did:

Enable PublishReadyToRun in both csproj files (https://github.com/lukebakken/rabbitmq-dotnet-client-1060/blob/boskjoett-main/consumer/consumer.csproj#L8, for instance)

In the producer and consumer dirs of the above repo & branch, run the following:

git clean -xffd
dotnet publish --configuration Release --use-current-runtime --no-self-contained --verbosity detailed

See producer-publish.txt for the output which shows crossgen was run.

Run consumer then producer (producer run in the same manner):

$ cd consumer/bin/Release/net6.0/arch-x64/
$ dotnet consumer.dll

Same initial latency 🤔

lbakken@shostakovich ~/development/lukebakken/rabbitmq-dotnet-client-1060/consumer (boskjoett-main =)
$ cd bin/Release/net6.0/arch-x64/
lbakken@shostakovich ~/development/lukebakken/rabbitmq-dotnet-client-1060/consumer/bin/Release/net6.0/arch-x64 (boskjoett-main =)
$ dotnet consumer.dll
CONSUMER: waiting for messages...
CONSUMER received at 09/19/2022 16:16:43.550872, sent at 09/19/2022 16:16:43.537436 - iteration: 1, delay: 00:00:00.0134368
CONSUMER received at 09/19/2022 16:16:48.550556, sent at 09/19/2022 16:16:48.548227 - iteration: 2, delay: 00:00:00.0023299
CONSUMER received at 09/19/2022 16:16:53.551818, sent at 09/19/2022 16:16:53.549104 - iteration: 3, delay: 00:00:00.0027149
^CCTRL-C pressed, exiting!

lukebakken · 2022-09-19T23:20:03Z

Next I'm going to test what @bollhals suggests in this comment:

To test this you'd have to send one message, disconnect and build up the connection again in the same process before starting the test sends.

lukebakken · 2022-09-19T23:51:24Z

This is the consumer output from this branch where the consumer and producer both do a "first iteration" where a connection/channel are created to send one message, then both closed, then the "real" test begins. I did not use crossgen in this case because I am not running dotnet publish first.

Notice that the first iteration of the real test still shows more latency, but not nearly as much as in the above test or in @boskjoett's initial report.

Seems like the "warm up" of the library has something to do with this. I'm going to port these tests to Java to see if that library and runtime environment shows the same behavior.

lbakken@shostakovich ~/development/lukebakken/rabbitmq-dotnet-client-1060/consumer (boskjoett-main *=)
$ dotnet run
CONSUMER: waiting for messages...
CONSUMER first iteration - waiting for one message
CONSUMER first iteration - waiting for one message
CONSUMER first iteration - waiting for one message
CONSUMER first iteration - waiting for one message
CONSUMER first iteration - waiting for one message
CONSUMER first iteration - waiting for one message
CONSUMER first iteration - waiting for one message
CONSUMER first iteration done (message: 09/19/2022 16:44:54.564770), disconnecting and re-connecting...
CONSUMER: waiting for messages...
CONSUMER received at 09/19/2022 16:45:04.595576, sent at 09/19/2022 16:45:04.592378 - iteration: 1, delay: 00:00:00.0031985
CONSUMER received at 09/19/2022 16:45:07.595046, sent at 09/19/2022 16:45:07.593387 - iteration: 2, delay: 00:00:00.0016592
CONSUMER received at 09/19/2022 16:45:10.595347, sent at 09/19/2022 16:45:10.593740 - iteration: 3, delay: 00:00:00.0016079
CONSUMER received at 09/19/2022 16:45:13.595644, sent at 09/19/2022 16:45:13.594106 - iteration: 4, delay: 00:00:00.0015383
CONSUMER received at 09/19/2022 16:45:16.595780, sent at 09/19/2022 16:45:16.594453 - iteration: 5, delay: 00:00:00.0013271
^CCTRL-C pressed, exiting!

boskjoett · 2022-09-20T07:43:35Z

I have updated my test programs in this repo
https://github.com/boskjoett/rabbitmq-dotnet-client-1060
so that the producer now sends a burst of 1 to 9 messages to the consumer at a time.

It shows two things:

The delivery time of the first burst of messages is 4 to 5 times longer than the subsequent message bursts.
The delivery time of the subsequent bursts varies a lot. More than I would expect. Is this due to TCP flow control?

Output from the test

CONSUMER: waiting 5 seconds to try initial connection
CONSUMER: waiting for messages...
CONSUMER received message 1 at 09-20-2022 09:23:52.473, delay: 30,5195 ms
CONSUMER received message 2 at 09-20-2022 09:23:52.510, delay: 49,3794 ms
CONSUMER received message 3 at 09-20-2022 09:23:52.510, delay: 49,8232 ms
CONSUMER received message 4 at 09-20-2022 09:23:57.469, delay: 5,5229 ms
CONSUMER received message 5 at 09-20-2022 09:23:57.470, delay: 5,2937 ms
CONSUMER received message 6 at 09-20-2022 09:24:02.497, delay: 12,1088 ms
CONSUMER received message 7 at 09-20-2022 09:24:02.502, delay: 9,989 ms
CONSUMER received message 8 at 09-20-2022 09:24:02.505, delay: 8,349 ms
CONSUMER received message 9 at 09-20-2022 09:24:02.512, delay: 15,5252 ms
CONSUMER received message 10 at 09-20-2022 09:24:02.521, delay: 23,2979 ms
CONSUMER received message 11 at 09-20-2022 09:24:02.521, delay: 22,9266 ms
CONSUMER received message 12 at 09-20-2022 09:24:02.522, delay: 22,1546 ms
CONSUMER received message 13 at 09-20-2022 09:24:02.522, delay: 20,4214 ms
CONSUMER received message 14 at 09-20-2022 09:24:07.517, delay: 3,8164 ms
CONSUMER received message 15 at 09-20-2022 09:24:07.518, delay: 2,6562 ms
CONSUMER received message 16 at 09-20-2022 09:24:07.518, delay: 2,971 ms
CONSUMER received message 17 at 09-20-2022 09:24:07.519, delay: 3,1154 ms
CONSUMER received message 18 at 09-20-2022 09:24:12.535, delay: 13,0199 ms
CONSUMER received message 19 at 09-20-2022 09:24:12.541, delay: 12,4516 ms
CONSUMER received message 20 at 09-20-2022 09:24:12.543, delay: 8,7846 ms
CONSUMER received message 21 at 09-20-2022 09:24:12.545, delay: 4,7128 ms
CONSUMER received message 22 at 09-20-2022 09:24:12.549, delay: 3,7252 ms
CONSUMER received message 23 at 09-20-2022 09:24:12.560, delay: 6,016 ms
CONSUMER received message 24 at 09-20-2022 09:24:12.564, delay: 3,1196 ms
CONSUMER received message 25 at 09-20-2022 09:24:12.567, delay: 3,7524 ms

michaelklishin · 2022-09-20T13:47:01Z

How do we define "a lot"? Publisher rate will vary, in particular with classic queues v1, because of both flow control and because they move messages to disk in batches due to certain internal metric.

Use a quorum queue or a CQv2 to reduce this variability to primarily internal flow control.

boskjoett · 2022-09-20T13:57:11Z

You are right. It is probably not fair to say "a lot" when it is in the milliseconds range. Our queues are not durable and messages are transient, so I don't expect writes to disk.

lukebakken · 2022-09-20T14:00:27Z

Thanks for the additional testing code @boskjoett. I will update my Python code to do the same to see if the same behavior exists.

lukebakken · 2022-09-20T15:25:16Z

As I suspected, the Python code is not affected by the same sort of delays - lukebakken/rabbitmq-dotnet-client-1060@e847411

At this time I'll try @boskjoett's latest code with the main branch of this repo. There's a chance this issue has been addressed. If not, we can discuss next steps.

bollhals · 2022-09-20T18:03:33Z

As I suspected, the Python code is not affected by the same sort of delays - lukebakken/rabbitmq-dotnet-client-1060@e847411

What was the output for it?

boskjoett · 2022-09-20T18:55:19Z

We have discovered that the long random delays (more than a second) we have experienced in our code using RabbitMQ were caused by lack of threads in the .NET Threadpool. Creating new threads on demand can take several hundred milliseconds. Inspired by this line of code in your test application
https://github.com/rabbitmq/rabbitmq-dotnet-client/blob/main/projects/TestApplications/MassPublish/Program.cs#:~:text=ThreadPool.SetMinThreads(16%20*%20Environment.ProcessorCount%2C%2016%20*%20Environment.ProcessorCount)%3B
we increased the Threadpool size and now the long random delays are gone.
There is still the initial delay, when sending the first message, but we can live with that.

bollhals · 2022-09-20T19:10:04Z

We have discovered that the long random delays (more than a second) we have experienced in our code using RabbitMQ were caused by lack of threads in the .NET Threadpool. Creating new threads on demand can take several hundred milliseconds. Inspired by this line of code in your test application

https://github.com/rabbitmq/rabbitmq-dotnet-client/blob/main/projects/TestApplications/MassPublish/Program.cs#:~:text=ThreadPool.SetMinThreads(16%20*%20Environment.ProcessorCount%2C%2016%20*%20Environment.ProcessorCount)%3B

we increased the Threadpool size and now the long random delays are gone.

Threads are not blocked for the basicpublish method, nor anywhere in the receiving. So if this fixes the issue for you, you either do more than just basicPublish (e.g. QueueDeclare) or they're blocked by something else in your application.

Do you know what blocks them?

danielmarbach · 2022-09-20T19:19:36Z

Generally changing the thread pool limits is something that I have rarely ever seen being useful in production. On the flip side I've seen the problems actually getting worse in certain scenarios because people who override it no longer benefit from the default "smarts" of the ramp up of the thread pool and the optimizations done over time in the thread pool. Of course the thread pool is general purpose and mileage may vary but that being said the defaults are battle tested against hundreds of scenarios. There is a reason why there is a caution warning on the page in the remarks section

https://learn.microsoft.com/en-us/dotnet/api/system.threading.threadpool.setminthreads?view=net-6.0#remarks

Have you tried switching to the async consumer? Have you tried stopping copying the body and see of things are related to GC pauses? Also how is that synthetic benchmark you have in the sample representative of the problem you are trying to solve? What are you trying to achieve that a few MS of delay would actually matter? In your real prod system what are you planning to do there inside the receiving code?

boskjoett · 2022-09-20T19:41:51Z

The threadpool issue is not in the RabbitMQ .NET Client, but in our application code using it, where it sends and receives messages. We have code that waits on ManualResetEvents, etc. You are right that the correct solution is not to fiddle with the Threadpool size, but to use async/await properly.

lukebakken · 2022-09-20T21:17:15Z

@bollhals this is the edited output from the first two batches of messages sent via the Python producer/consumer programs:

CONSUMER received at 2022-09-20 14:12:27.305867, sent at 2022-09-20 14:12:27.305110 - iteration 1, delay: 0:00:00.000757
CONSUMER received at 2022-09-20 14:12:27.306376, sent at 2022-09-20 14:12:27.305300 - iteration 2, delay: 0:00:00.001076
CONSUMER received at 2022-09-20 14:12:27.306736, sent at 2022-09-20 14:12:27.305443 - iteration 3, delay: 0:00:00.001293
CONSUMER received at 2022-09-20 14:12:27.306982, sent at 2022-09-20 14:12:27.305633 - iteration 4, delay: 0:00:00.001349
CONSUMER received at 2022-09-20 14:12:27.307101, sent at 2022-09-20 14:12:27.305835 - iteration 5, delay: 0:00:00.001266
CONSUMER received at 2022-09-20 14:12:27.307210, sent at 2022-09-20 14:12:27.306037 - iteration 6, delay: 0:00:00.001173
CONSUMER received at 2022-09-20 14:12:27.307314, sent at 2022-09-20 14:12:27.306245 - iteration 7, delay: 0:00:00.001069
CONSUMER received at 2022-09-20 14:12:27.307416, sent at 2022-09-20 14:12:27.306449 - iteration 8, delay: 0:00:00.000967
CONSUMER received at 2022-09-20 14:12:27.307517, sent at 2022-09-20 14:12:27.306659 - iteration 9, delay: 0:00:00.000858
CONSUMER received at 2022-09-20 14:12:32.314942, sent at 2022-09-20 14:12:32.312136 - iteration 10, delay: 0:00:00.002806
CONSUMER received at 2022-09-20 14:12:32.316449, sent at 2022-09-20 14:12:32.313078 - iteration 11, delay: 0:00:00.003371
CONSUMER received at 2022-09-20 14:12:32.317333, sent at 2022-09-20 14:12:32.313794 - iteration 12, delay: 0:00:00.003539
CONSUMER received at 2022-09-20 14:12:32.318486, sent at 2022-09-20 14:12:32.314793 - iteration 13, delay: 0:00:00.003693
CONSUMER received at 2022-09-20 14:12:32.318616, sent at 2022-09-20 14:12:32.315836 - iteration 14, delay: 0:00:00.002780
CONSUMER received at 2022-09-20 14:12:32.318728, sent at 2022-09-20 14:12:32.316799 - iteration 15, delay: 0:00:00.001929
CONSUMER received at 2022-09-20 14:12:32.318833, sent at 2022-09-20 14:12:32.317764 - iteration 16, delay: 0:00:00.001069

boskjoett · 2022-11-17T20:45:22Z

You may close this issue. It is not a real issue, as long as it only occurs during startup.

lukebakken · 2022-11-17T21:45:57Z

This is something to investigate for 7.0

lukebakken · 2023-06-02T13:39:29Z

I'm re-opening this just to be sure that version 7.0 fixes this issue:

https://groups.google.com/g/rabbitmq-users/c/V_foZnLFVjs

lukebakken · 2023-11-20T18:52:13Z

Hi everyone,

I have just tagged version 7.0.0-alpha.1, which includes async methods for the full public API -

https://github.com/rabbitmq/rabbitmq-dotnet-client/releases/tag/v7.0.0-alpha.1

Once this is pushed to NuGet by GitHub Actions, it would be really really great if you could confirm that the async API resolves this issue in your environment (provided that it's relatively easy to update your test code to the new API).

I will take some time today to re-run my test project with 7.0.0-alpha.1 - https://github.com/lukebakken/rabbitmq-dotnet-client-1060/tree/main

Thanks!

lukebakken · 2023-11-20T21:01:18Z

It looks like the initial delay is still there using 7.0.0-alpha.1. The delay is reduced as @bollhals theorized here and as can be demonstrated in this branch.

It looks like this initial latency is due to runtime startup, so I'm going to close this issue.

danielmarbach · 2023-11-21T11:22:40Z

Have you tried playing around with the JIT settings to further investigate if it is indeed the JIT overhead?

https://learn.microsoft.com/en-us/dotnet/core/runtime-config/compilation

You can also control tiered JIT as part of the csproj

<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <TieredCompilation>false</TieredCompilation>
  </PropertyGroup>

</Project>

lukebakken · 2023-11-21T14:11:18Z

I have not! I'll give that a try.

lukebakken · 2024-05-09T22:39:06Z

Removing from the version 7 milestone since it is not version 7 specific.

lukebakken added this to the 6.4.1 milestone Sep 16, 2022

lukebakken self-assigned this Sep 16, 2022

rabbitmq deleted a comment from bollhals Sep 17, 2022

lukebakken added a commit that referenced this issue Sep 19, 2022

Remove flow control handling in client

b35f150

If clients wish to do something on `channel.flow`, the handler should be used. This library should not do anything on its own. Fixes #1252

lukebakken closed this as completed Nov 17, 2022

lukebakken modified the milestones: 6.4.1, 7.0.0 Nov 17, 2022

lukebakken reopened this Jun 2, 2023

lukebakken closed this as not planned Won't fix, can't repro, duplicate, stale Nov 20, 2023

lukebakken reopened this Nov 21, 2023

lukebakken removed this from the 7.0.0 milestone May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long delay from publish to consumer event #1252

Long delay from publish to consumer event #1252

lukebakken commented Sep 16, 2022

lukebakken commented Sep 16, 2022

bollhals commented Sep 17, 2022

michaelklishin commented Sep 17, 2022

boskjoett commented Sep 19, 2022 •

edited

Loading

lukebakken commented Sep 19, 2022

bording commented Sep 19, 2022

danielmarbach commented Sep 19, 2022 •

edited

Loading

lukebakken commented Sep 19, 2022

boskjoett commented Sep 19, 2022

lukebakken commented Sep 19, 2022

lukebakken commented Sep 19, 2022 •

edited

Loading

lukebakken commented Sep 19, 2022

lukebakken commented Sep 19, 2022

boskjoett commented Sep 20, 2022 •

edited

Loading

michaelklishin commented Sep 20, 2022

boskjoett commented Sep 20, 2022

lukebakken commented Sep 20, 2022

lukebakken commented Sep 20, 2022

bollhals commented Sep 20, 2022

boskjoett commented Sep 20, 2022

bollhals commented Sep 20, 2022

danielmarbach commented Sep 20, 2022

boskjoett commented Sep 20, 2022

lukebakken commented Sep 20, 2022

boskjoett commented Nov 17, 2022

lukebakken commented Nov 17, 2022

lukebakken commented Jun 2, 2023

lukebakken commented Nov 20, 2023 •

edited

Loading

lukebakken commented Nov 20, 2023

danielmarbach commented Nov 21, 2023

lukebakken commented Nov 21, 2023

lukebakken commented May 9, 2024

Long delay from publish to consumer event #1252

Long delay from publish to consumer event #1252

Comments

lukebakken commented Sep 16, 2022

Discussed in #1249

lukebakken commented Sep 16, 2022

bollhals commented Sep 17, 2022

michaelklishin commented Sep 17, 2022

boskjoett commented Sep 19, 2022 • edited Loading

lukebakken commented Sep 19, 2022

bording commented Sep 19, 2022

danielmarbach commented Sep 19, 2022 • edited Loading

lukebakken commented Sep 19, 2022

boskjoett commented Sep 19, 2022

lukebakken commented Sep 19, 2022

lukebakken commented Sep 19, 2022 • edited Loading

lukebakken commented Sep 19, 2022

lukebakken commented Sep 19, 2022

boskjoett commented Sep 20, 2022 • edited Loading

michaelklishin commented Sep 20, 2022

boskjoett commented Sep 20, 2022

lukebakken commented Sep 20, 2022

lukebakken commented Sep 20, 2022

bollhals commented Sep 20, 2022

boskjoett commented Sep 20, 2022

bollhals commented Sep 20, 2022

danielmarbach commented Sep 20, 2022

boskjoett commented Sep 20, 2022

lukebakken commented Sep 20, 2022

boskjoett commented Nov 17, 2022

lukebakken commented Nov 17, 2022

lukebakken commented Jun 2, 2023

lukebakken commented Nov 20, 2023 • edited Loading

lukebakken commented Nov 20, 2023

danielmarbach commented Nov 21, 2023

lukebakken commented Nov 21, 2023

lukebakken commented May 9, 2024

boskjoett commented Sep 19, 2022 •

edited

Loading

danielmarbach commented Sep 19, 2022 •

edited

Loading

lukebakken commented Sep 19, 2022 •

edited

Loading

boskjoett commented Sep 20, 2022 •

edited

Loading

lukebakken commented Nov 20, 2023 •

edited

Loading