add Concurrency entity for worker #1405

shijiesheng · 2024-11-21T19:25:08Z

What changed?

added worker package for modularity
added resizable Permit entity wrapping the underlying implementation; these permits are synchronization primitive for concurrency control over poller and tasks processing goroutines
replaced buffered channel with Permit implementation for task concurrency.

Why?

add dynamic params component towards worker auto configuration

How did you test it?

unit test
integration test

Potential risks

codecov · 2024-11-21T22:31:48Z

Codecov Report

Attention: Patch coverage is 96.07843% with 2 lines in your changes missing coverage. Please review.

Project coverage is 82.55%. Comparing base (e3802b7) to head (e6d7036).

Files with missing lines	Patch %	Lines
internal/worker/concurrency.go	92.85%	2 Missing ⚠️

Files with missing lines	Coverage Δ
internal/internal_poller_autoscaler.go	`92.70% <100.00%> (ø)`
internal/internal_worker_base.go	`82.46% <100.00%> (+0.51%)`	⬆️
internal/worker/concurrency.go	`92.85% <92.85%> (ø)`

... and 1 file with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e3802b7...e6d7036. Read the comment docs.

internal/worker/dynamic_params.go

taylanisikdemir

Changes look safe as long as the underlying semaphore library is not buggy. It will be the critical component determining concurrency controls of client SDK. I recommend deep diving on that to understand the implementation, check for potential deadlock cases and write comprehensive unit/concurrency tests for the wrapper permit implementation

internal/worker/dynamic_params.go

internal/internal_worker_base.go

internal/worker/dynamic_params.go

internal/worker/concurrency.go

internal/worker/concurrency_test.go

shijiesheng · 2024-11-26T22:01:57Z

internal/worker/concurrency.go

@@ -74,7 +77,7 @@ func (p *permit) AcquireChan(ctx context.Context, wg *sync.WaitGroup) <-chan str
 		}
 		select { // try to send to channel, but don't block if listener is gone
 		case ch <- struct{}{}:
-		default:
+		case <-time.After(10 * time.Millisecond): // wait time is needed to avoid race condition of channel sending


found a race condition of channel sending. Adding a wait time would be more reliable

hmm :\ unfortunately this means that if processing is delayed by 10ms it will block the chan-reader forever. that's not too unlikely with big CPU spikes, and definitely not impossible.

tbh I think that might rule this impl out entirely. though I think it's possible to build a AcquireChan(...) (<-chan struct{}, cancel func()) that doesn't have this issue, and that might be worth doing.

or we might have to embrace the atomic-like behavior around this and add retries to (*baseWorker).runPoller / anything using AcquireChan. that wouldn't be a fatal constraint afaict, though it's not ideal.

if we expect chan-reader to always consume from the returned ch unless ctx is canceled then we can replace this goroutine implementation with

defer wg.Done() if err := p.sem.Acquire(ctx, 1); err != nil { return // assuming Acquire only returns err if ctx.Done } select { case ch <- struct{}{}: case <-ctx.Done(): p.sem.Release(1) }

fixed through Taylan's suggestion. It should be safe now.

Groxx · 2024-11-26T22:36:09Z

Reading through marusama/semaphore/v2:

yep, looks likely correct to me. so 👍 for that library.
some cautions worth noting:
- golang.org/x/sync/semaphore does FIFO acquiring, and:
  - it probably performs MUCH better under heavy load, as marusama wakes up all waiters every time and they all race to acquire
  - it is strictly FIFO, which prevents starvation of >1 acquires, which marusama does not prevent
  - x/sync/semaphore does not resize, so it's not really an option anyway
- there are very clear uint32 overflow possibilities that it does not check
  - many of these will likely cause an invalid state (e.g. count underflow) and panic, but I suspect it's not guaranteed, and I'm pretty confident it's not guaranteed that it'll panic somewhere ~immediately.

So... I think we're probably fine. For pollers we won't run into any of these in practice.
It's probably worth leaving some comments on our use though, to make sure ^ those are checked before using it elsewhere, and before using it with any un-checked config (e.g. a user sets 2 billion concurrent as "unlimited", and behavior is undefined).

In the meantime we should probably get some integer overflow checks added to it, and a new release. There's no reason to allow those to occur.

internal/worker/concurrency.go

Groxx

I think I'm gonna have to block this for https://github.com/cadence-workflow/cadence-go-client/pull/1405/files#r1859368368 since that's potentially fatal.

Overall though:

library looks fine (more details above)
changes look reasonable
- four-layer-deep bw.concurrency.TaskPermit.AcquireChan accesses are a bit dubious in principle, but I think they make sense here. the added structure/layers seems useful.
- wrapper as a whole seems good to have, though do we have any plans to do non-1 values? I'd get rid of count if not, but I suppose that depends on the "resource". (but see library comment for risks there, e.g. we definitely cannot use it for "bytes" in many cases)
I have not checked the tests in detail but the high level approach seems useful. might need some fine-tuning, but it's reasonably "meaningful" and that's a very solid start.

I'm... not entirely sure what to do to resolve the core "resizable semaphore but we also need chans" issue tbh. We might end up needing / badly-wanting chans, so that may be important. I suspect there's some other library, but I haven't hunted for one or thought too much on what it'd need to be correct 🤔

shijiesheng · 2024-11-27T19:13:16Z

I think I'm gonna have to block this for https://github.com/cadence-workflow/cadence-go-client/pull/1405/files#r1859368368 since that's potentially fatal.

Overall though:

library looks fine (more details above)

changes look reasonable

four-layer-deep bw.concurrency.TaskPermit.AcquireChan accesses are a bit dubious in principle, but I think they make sense here. the added structure/layers seems useful.

wrapper as a whole seems good to have, though do we have any plans to do non-1 values? I'd get rid of count if not, but I suppose that depends on the "resource". (but see library comment for risks there, e.g. we definitely cannot use it for "bytes" in many cases)

I have not checked the tests in detail but the high level approach seems useful. might need some fine-tuning, but it's reasonably "meaningful" and that's a very solid start.

I'm... not entirely sure what to do to resolve the core "resizable semaphore but we also need chans" issue tbh. We might end up needing / badly-wanting chans, so that may be important. I suspect there's some other library, but I haven't hunted for one or thought too much on what it'd need to be correct 🤔

Double checked. We don't do non-one values in acquire at any place. I'll remove it in the next PR. Some of the pollerAutoScaler interface methods are redundant and thus can be removed.

Groxx · 2024-11-27T21:38:59Z

internal/worker/concurrency.go

+// AcquireChan returns a permit ready channel. Similar to Acquire, but non-blocking.
+// Remember to call Release(1) to release the permit after usage
+func (p *permit) AcquireChan(ctx context.Context, wg *sync.WaitGroup) <-chan struct{} {
+	ch := make(chan struct{})
+	wg.Add(1)
+	go func() {
+		defer wg.Done()
+		if err := p.sem.Acquire(ctx, 1); err != nil {
+			return
+		}
+		select { // try to send to channel, but don't block if listener is gone
+		case ch <- struct{}{}:
+		case <-ctx.Done():
+			p.sem.Release(1)
+		}
+	}()
+	return ch
+}


hmm. this is possible to use correctly, and doesn't leak goroutines beyond ctx.Done() which is good (and possibly good enough), but it still leaves the chan permanently blocking in some cases.

that's correct as long as the AcquireChan(...) caller also listens to the same / a derived ctx.Done() and stops using the chan if that occurs. and ideally also cancels the context when it stops reading in all other branches.

func (bw *baseWorker) runPoller() { does this currently, because bw.shutdownCh closes when bw.limiterContext is canceled and there effectively is no timeout, but it feels kinda error-prone 🤔
and it also feels like there might be an alternative that isn't as risky... though I'm not yet sure what that might be.

From trying to build some alternatives, and thinking about it some more, two thoughts:

1: "Permanently block if the context is timed out" is I think the correct choice, because the alternative is to close the chan and that can easily lead to infinite loops + might be interpreted as "no limit". So you'd also need a "detect if closed" check everywhere all the time, rather than just adding the case that is needed to know when to stop. So 👍 that's fine here.

2: With resizing being allowed, I'm growing convinced that either "one-use chan + goroutine + release func if not read, per AcquireChan call" or "a background maintenance goroutine" is unavoidable. Limit and count changes have to be synchronized with reads so reads can't be buffered, but we can't tell if a reader is gone forever or just delayed so we can't pair releases up with acquires synchronously, so we need some kind of buffer somewhere else.

So... this might work, though the "correct use requires passing a context that you also wait on in the same select, and also cancel when you stop reading" detail is still striking me as a moderate footgun.
E.g. the core runPoller loop is only safe right now because there is both no timeout and no other branch in the select - if we add a branch, it'll leak goroutines ~forever every time it takes that other branch.

But that is something we could document thoroughly and probably not run into. And using murasama/semaphore/v2 does make it a much simpler implementation than doing it by hand.

Mind if I sit on this over the long weekend, and maybe we can grab others / discuss an alternative I made that's a bit more misuse-resistant? With some careful docs I think this is acceptable and looks functionally-correct, but I'm not entirely sure it's worth keeping...

shijiesheng requested review from Groxx, jakobht, 3vilhamster, dkrotx, taylanisikdemir and demirkayaender as code owners November 21, 2024 19:25

shijiesheng force-pushed the dynamicparams branch from 610dbeb to 6c9af52 Compare November 21, 2024 22:25

shijiesheng commented Nov 25, 2024

View reviewed changes

internal/worker/dynamic_params.go Outdated Show resolved Hide resolved

taylanisikdemir reviewed Nov 25, 2024

View reviewed changes

shijiesheng changed the title ~~add DynamicParams for worker~~ add Concurrency entity for worker and replace implementation for poller request channel Nov 26, 2024

shijiesheng changed the title ~~add Concurrency entity for worker and replace implementation for poller request channel~~ add Concurrency entity for worker Nov 26, 2024

shijiesheng added 7 commits November 26, 2024 09:46

add DynamicParams for worker

abf6b28

remove unwanted comment

9010a8c

fix initialization

4ee4500

propagate context

f065c10

fix leaked goroutine

a982c04

address comments and add unit tests

b99b8ce

rename

dc99021

shijiesheng force-pushed the dynamicparams branch from bcb22f8 to dc99021 Compare November 26, 2024 17:46

shijiesheng added 2 commits November 26, 2024 09:55

lint

5bcc55b

fix flaky unit test

f08bd7c

taylanisikdemir reviewed Nov 26, 2024

View reviewed changes

internal/worker/concurrency.go Outdated Show resolved Hide resolved

taylanisikdemir reviewed Nov 26, 2024

View reviewed changes

internal/worker/concurrency_test.go Outdated Show resolved Hide resolved

better testing

e6d7036

shijiesheng commented Nov 26, 2024

View reviewed changes

taylanisikdemir approved these changes Nov 26, 2024

View reviewed changes

Groxx reviewed Nov 26, 2024

View reviewed changes

internal/worker/concurrency.go Show resolved Hide resolved

Groxx requested changes Nov 26, 2024

View reviewed changes

fix potential blocking

f453fb3

Groxx reviewed Nov 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add Concurrency entity for worker #1405

add Concurrency entity for worker #1405

shijiesheng commented Nov 21, 2024 •

edited

Loading

codecov bot commented Nov 21, 2024 •

edited

Loading

taylanisikdemir left a comment

shijiesheng Nov 26, 2024

Groxx Nov 26, 2024 •

edited

Loading

taylanisikdemir Nov 27, 2024

shijiesheng Nov 27, 2024

Groxx commented Nov 26, 2024 •

edited

Loading

Groxx left a comment •

edited

Loading

shijiesheng commented Nov 27, 2024 •

edited

Loading

Groxx Nov 27, 2024 •

edited

Loading

Groxx Nov 28, 2024 •

edited

Loading

add Concurrency entity for worker #1405

Are you sure you want to change the base?

add Concurrency entity for worker #1405

Conversation

shijiesheng commented Nov 21, 2024 • edited Loading

codecov bot commented Nov 21, 2024 • edited Loading

Codecov Report

taylanisikdemir left a comment

Choose a reason for hiding this comment

shijiesheng Nov 26, 2024

Choose a reason for hiding this comment

Groxx Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

taylanisikdemir Nov 27, 2024

Choose a reason for hiding this comment

shijiesheng Nov 27, 2024

Choose a reason for hiding this comment

Groxx commented Nov 26, 2024 • edited Loading

Groxx left a comment • edited Loading

Choose a reason for hiding this comment

shijiesheng commented Nov 27, 2024 • edited Loading

Groxx Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

Groxx Nov 28, 2024 • edited Loading

Choose a reason for hiding this comment

shijiesheng commented Nov 21, 2024 •

edited

Loading

codecov bot commented Nov 21, 2024 •

edited

Loading

Groxx Nov 26, 2024 •

edited

Loading

Groxx commented Nov 26, 2024 •

edited

Loading

Groxx left a comment •

edited

Loading

shijiesheng commented Nov 27, 2024 •

edited

Loading

Groxx Nov 27, 2024 •

edited

Loading

Groxx Nov 28, 2024 •

edited

Loading