Make worker pool more resilient to slow tasks #5751

thampiotr · 2023-11-10T12:23:13Z

PR Description

The current worker pool distributes tasks among goroutines using hash of the task's key. In presence of a very slow task, e.g. due to a bug, we can have one goroutine that will be effectively stuck. All other tasks that are unfortunate to hash to the same goroutine will also be affected.

We could prevent this if we were to not rely on hash, but instead assign tasks randomly to any goroutine. The problem with this approach, however, is that a slow task may come in again and again and soon we will have every goroutine in the pool trying to execute the very slow task. So, in order to prevent this, we must only allow one task with given key to be executing at any given time.

In this PR changes the worker pool implementation:

We maintain the existing behaviour of only one task with given key can be waiting to be executed at a time - this reduces duplicate work.
We add a mechanism that tracks the task keys that are currently running and only allow the goroutines to run one task with a given key at a time, to prevent the problem described above.
We assign tasks eligible to run on a random available goroutine
As a result a very slow task will a) not be able to take over all the goroutines b) will not block other tasks from executing.

As discussed in previous PRs - the worker pool we need for the Agent requires a quite unique behaviour as described above and we are not able to use an existing library to fulfil these requirements.

Notes to the Reviewer

Tested this in a dev cluster, where, as expected, I can see a drop in times that tasks had to wait for a goroutine to pick them up:

(left half - before change, right half - after the change)

The size of this effect is expected to be larger in other large clusters that have slower components.

PR Checklist

CHANGELOG.md updated
Tests updated

thampiotr · 2023-11-10T12:25:53Z

pkg/flow/internal/worker/worker_pool_test.go

+			require.Equal(t, 2, pool.QueueSize())
+		})
+
+		t.Run("should not block when one task is stuck", func(t *testing.T) {


In the previous implementation, this test would fail.

thampiotr · 2023-11-10T12:26:34Z

pkg/flow/internal/worker/worker_pool_test.go

+			}, 3*time.Second, 1*time.Millisecond)
+		})
+
+		t.Run("should NOT run concurrently tasks with the same key", func(t *testing.T) {


This is an important invariant that was true in the previous implementation, but was not explicitly tested (although some other tests would become flaky if it was broken)

thampiotr · 2023-11-10T16:34:13Z

pkg/flow/internal/worker/worker_pool_test.go


 			// Wait for the first task to be running already and blocking the worker
 			<-firstTaskRunning
-			require.Equal(t, 0, pool.QueueSize())
+			require.Equal(t, 1, pool.QueueSize())


The definition of QueueSize changed to also include the tasks that are currently executing. This makes things a bit simpler in the implementation, while still being able to perform its function: tell us when queue is building up.

rfratto

LGTM, nice improvement! I have one question about performance implications, but I think this is good to go.

rfratto · 2023-11-10T19:05:46Z

pkg/flow/internal/worker/worker_pool.go

+	}
+
+	// Remove the task from waiting and add it to running set
+	w.waitingOrder = append(w.waitingOrder[:index], w.waitingOrder[index+1:]...)


What's the behavior of append when removing an element in this way; is it copying the whole slice? I'm curious if this has any performance implications for allocations long term, and if a traditional linked list might be more appropriate.

It's a good point as a linked list should be a natural choice for this algorithm.

The above code will copy the second argument element-by-element, as per https://go.dev/blog/slices-intro, but it will NOT allocate since we're reducing the size of the slice. This will be slower than a linked-list approach, where removal is O(1) instead of O(N). However, this code is not a bottleneck for two reasons: 1) this slice copying is dominated by other, slower processes - like evaluating components that can take multiple milliseconds, compared to slice copy that takes few nanoseconds. 2) the size of the queue is usually less than 10 elements even in our largest clusters.

If we had a linked list in standard Go library, I'd use it, but we don't and it's not a bottleneck, so I'm proposing we stick to this more idiomatic implementation.

I’m fine with that, we can always return to this later if it turns into a bottleneck :)

If we had a linked list in standard Go library, I'd use it, but we don't and it's not a bottleneck, so I'm proposing we stick to this more idiomatic implementation.

There is linked list in Go standard library (thanks!) but it allocates memory, so it would not be more performant here... and it's still not a bottleneck :) I have some benchmarks here: #5765 which we can merge if we want to keep it for the posteriority.

tpaschalis

Reviewing after the fact, but LGTM; I like the new semantics and the readability 🙂

Adding a small nit about how tryEnqueue semantics below and whether they matter in the context of Flow graph updates.

pkg/flow/internal/worker/worker_pool.go

* Make worker pool more resilient to slow tasks * CHANGELOG.md * fix invalid pointer

thampiotr commented Nov 10, 2023

View reviewed changes

thampiotr added 2 commits November 10, 2023 15:28

Make worker pool more resilient to slow tasks

a502aa9

CHANGELOG.md

9aaf3f3

thampiotr marked this pull request as ready for review November 10, 2023 16:00

thampiotr requested a review from a team as a code owner November 10, 2023 16:00

thampiotr requested review from rfratto and tpaschalis November 10, 2023 16:03

fix invalid pointer

47ae000

thampiotr force-pushed the thampiotr/resilient-worker-pool branch from 754f423 to 47ae000 Compare November 10, 2023 16:21

thampiotr commented Nov 10, 2023

View reviewed changes

rfratto approved these changes Nov 10, 2023

View reviewed changes

rfratto assigned rfratto and tpaschalis Nov 10, 2023

thampiotr merged commit c2c40f8 into main Nov 13, 2023
8 checks passed

thampiotr deleted the thampiotr/resilient-worker-pool branch November 13, 2023 10:52

thampiotr mentioned this pull request Nov 14, 2023

Benchmark for the queue #5765

Merged

tpaschalis reviewed Nov 14, 2023

View reviewed changes

pkg/flow/internal/worker/worker_pool.go Show resolved Hide resolved

thampiotr added a commit that referenced this pull request Nov 20, 2023

Make worker pool more resilient to slow tasks (#5751)

196e5a2

* Make worker pool more resilient to slow tasks * CHANGELOG.md * fix invalid pointer

thampiotr added a commit that referenced this pull request Nov 20, 2023

Make worker pool more resilient to slow tasks (#5751)

f17ed7d

* Make worker pool more resilient to slow tasks * CHANGELOG.md * fix invalid pointer

github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 21, 2024

github-actions bot locked as resolved and limited conversation to collaborators Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make worker pool more resilient to slow tasks #5751

Make worker pool more resilient to slow tasks #5751

thampiotr commented Nov 10, 2023 •

edited

Loading

thampiotr Nov 10, 2023

thampiotr Nov 10, 2023

thampiotr Nov 10, 2023

rfratto left a comment

rfratto Nov 10, 2023

thampiotr Nov 13, 2023

rfratto Nov 13, 2023

thampiotr Nov 14, 2023

tpaschalis left a comment

Make worker pool more resilient to slow tasks #5751

Make worker pool more resilient to slow tasks #5751

Conversation

thampiotr commented Nov 10, 2023 • edited Loading

PR Description

Notes to the Reviewer

PR Checklist

thampiotr Nov 10, 2023

Choose a reason for hiding this comment

thampiotr Nov 10, 2023

Choose a reason for hiding this comment

thampiotr Nov 10, 2023

Choose a reason for hiding this comment

rfratto left a comment

Choose a reason for hiding this comment

rfratto Nov 10, 2023

Choose a reason for hiding this comment

thampiotr Nov 13, 2023

Choose a reason for hiding this comment

rfratto Nov 13, 2023

Choose a reason for hiding this comment

thampiotr Nov 14, 2023

Choose a reason for hiding this comment

tpaschalis left a comment

Choose a reason for hiding this comment

thampiotr commented Nov 10, 2023 •

edited

Loading