Prevent OOM by introducing a max on total buffer capacity #944

erikvanoosten · 2023-06-24T11:45:14Z

No description provided.

erikvanoosten · 2023-06-24T14:20:28Z

zio-kafka/src/main/scala/zio/kafka/consumer/internal/Runloop.scala

+        // By shuffling the streams we prevent read-starvation for streams at the end of the list.
+        val streams =
+          if (maxTotalQueueSize == Int.MaxValue) state.assignedStreams
+          else scala.util.Random.shuffle(state.assignedStreams)


Instead of shuffling, we can also sort the streams by queueSize (smallest first) 🤔

Never mind, when we do that, behavior becomes unpredictable again. I would not be surprised when sorting by queueSize would cause instead of prevent read-starvation.

svroonland

Could you add motivation why this can not be controlled by reducing the maxPartitionQueueSize such that mPQS * nr of partitions < maxTotalQueueSize ?

erikvanoosten · 2023-06-28T10:16:03Z

Could you add motivation why this can not be controlled by reducing the maxPartitionQueueSize such that mPQS * nr of partitions < maxTotalQueueSize ?

Of course!

Suppose there are 2000 partitions to consume from and messages are around 30kB. With maxPartitionQueueSize set to the default of 1024 that means we need up to 2000 * 1024 * 30k = 63 GB of heap. Lets say we want to stay below 2GB for all queues in total. With your suggestion we need to set maxPartitionQueueSize to 2GB / 2000 / 30k = 35.

35 is so low that once a queue reaches this size, the remaining messages will be processed well before a new poll is executed. This leads to low throughput.

The introduction of a global maximum allows a higher maxPartitionQueueSize (e.g. 1024, the default is fine). For example, to stay within 2GB, we can set maxTotalQueueSize to 2G/30k = 70k. This would allow 70k/1024 = 70 partitions to get high throughput. At every poll there is a different selection of partitions that benefit, therefore, there would always be high throughput progress.

Note: the numbers in this comment are not made up, they are taken from a service we deploy at my company.

svroonland · 2023-06-28T11:14:02Z

35 is so low that once a queue reaches this size, the remaining messages will be processed well before a new poll is executed. This leads to low throughput.

If you have 2000 partitions, that's 70.000 messages to be processed before the next poll, right? So isn't there plenty of messages in the queue then? Or are you assuming unequal nr of messages on the partitions?

erikvanoosten · 2023-06-28T12:02:14Z

35 is so low that once a queue reaches this size, the remaining messages will be processed well before a new poll is executed. This leads to low throughput.

If you have 2000 partitions, that's 70.000 messages to be processed before the next poll, right? So isn't there plenty of messages in the queue then? Or are you assuming unequal nr of messages on the partitions?

Indeed, in our application there is a huge imbalance in the partitions. Some partitions have thousands of msg per second while other partitions have a handful per day. Currently when we need to process backlog, those high throughput partitions take a long time to catch up.

svroonland · 2023-06-28T12:37:38Z

zio-kafka/src/main/scala/zio/kafka/consumer/internal/Runloop.scala

@@ -245,17 +246,23 @@ private[consumer] final class Runloop private (
          s"Starting poll with ${state.pendingRequests.size} pending requests and ${state.pendingCommits.size} pending commits"
        )
      _ <- currentStateRef.set(state)
-      partitionsToFetch <-
+      partitionsToFetch <- {


So we do a random shuffling of partitions to resume. That to me sounds like it becomes very unpredictable when to expect new messages on some partition, that's treading into dangerous / unpredictable terroritory if you ask me.

At the very least we'd need some good unit tests to validate behavior in some scenarios.

Without shuffling the partitions that are at the beginning of the list have a higher chance to get data than partitions at the bottom of the list. By shuffling at each poll, each partition gets (on average) an equals chance to get data. So because there are many polls, and we shuffle differently every time, the average behavior is very predictable.

Note that we still resume a lot of partitions (e.g. roughly 70 in the example above), so it is still up to the broker to select which partitions to sent data for.

What do you think might go wrong?

I don't know what can go wrong. Can you write a unit test that validates the on-average fairness in throughput for the partitions and that the total queue size is always smaller than the config setting?

That should be possible (though not easy). I did not want to make the effort of writing unit tests if no-one likes the idea to begin with. But if I can convince you with a test, I will write it :)

Yeah, well it is a quite extreme use case to have that many partitions for one consumer, so I would prefer a mechanism outside of zio-kafka, or at least an optional component outside the Runloop.

I do not agree that 2000 partitions is extreme. It is well within the capabilities of a Kafka cluster.

Do you have an idea on how else to build this feature?

BTW, even with 200 partitions and 30KB messages the memory requirements are quite huge (~6GB).

We could create a trait PrefetchStrategy that the user can supply an instance for.

That would be good for testing the prefetch strategy in unit tests as well

erikvanoosten · 2023-07-10T20:00:33Z

Replaced by #970.

guizmaii and others added 5 commits June 24, 2023 14:02

Review #803: Optimize partitionsToFetch computation

ebbcf42

Prevent OOM by introducing a max on total buffer capacity

16c56d4

Formatting, fix for scala 2.12.

4371571

Fix scaladoc

2fc116f

Fix scaladoc (2)

f8db806

erikvanoosten commented Jun 24, 2023

View reviewed changes

Typo

919660e

Base automatically changed from review_803 to master June 24, 2023 15:57

Merge branch 'master' into global-queue-limit

36d53ff

This was referenced Jun 24, 2023

Pre-fetch per consumer, not per stream #933

Closed

Prevent runaway feedback loop in optimistic resume #920

Closed

svroonland reviewed Jun 28, 2023

View reviewed changes

erikvanoosten closed this Jul 10, 2023

erikvanoosten deleted the global-queue-limit branch July 10, 2023 20:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent OOM by introducing a max on total buffer capacity #944

Prevent OOM by introducing a max on total buffer capacity #944

erikvanoosten commented Jun 24, 2023

erikvanoosten Jun 24, 2023 •

edited

Loading

erikvanoosten Jun 24, 2023 •

edited

Loading

svroonland left a comment •

edited

Loading

erikvanoosten commented Jun 28, 2023

svroonland commented Jun 28, 2023 •

edited

Loading

erikvanoosten commented Jun 28, 2023

svroonland Jun 28, 2023

erikvanoosten Jun 28, 2023

svroonland Jun 28, 2023

erikvanoosten Jun 28, 2023

svroonland Jun 28, 2023

erikvanoosten Jun 28, 2023

erikvanoosten Jun 28, 2023

svroonland Jun 29, 2023

svroonland Jun 29, 2023

erikvanoosten commented Jul 10, 2023

Prevent OOM by introducing a max on total buffer capacity #944

Prevent OOM by introducing a max on total buffer capacity #944

Conversation

erikvanoosten commented Jun 24, 2023

erikvanoosten Jun 24, 2023 • edited Loading

Choose a reason for hiding this comment

erikvanoosten Jun 24, 2023 • edited Loading

Choose a reason for hiding this comment

svroonland left a comment • edited Loading

Choose a reason for hiding this comment

erikvanoosten commented Jun 28, 2023

svroonland commented Jun 28, 2023 • edited Loading

erikvanoosten commented Jun 28, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erikvanoosten commented Jul 10, 2023

erikvanoosten Jun 24, 2023 •

edited

Loading

erikvanoosten Jun 24, 2023 •

edited

Loading

svroonland left a comment •

edited

Loading

svroonland commented Jun 28, 2023 •

edited

Loading