-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fn: add goroutine manager #9141
Conversation
Important Review skippedAuto reviews are limited to specific labels. 🏷️ Labels to auto review (1)
Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
74b59e2
to
d381817
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
func (g *GoroutineManager) Go(f func(ctx context.Context)) error { | ||
g.mu.Lock() | ||
defer g.mu.Unlock() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit add a comment, that calling Add when already a Wait() is acitve will cause the golang env to panic ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Also added a similar comment in GoroutineManager.Stop method.
fn/goroutine_manager_test.go
Outdated
) | ||
|
||
// TestGoroutineManager tests that GoroutineManager starts goroutines, until ctx | ||
// expires fails to start after it expires and waits for already started |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: // TestGoroutineManager tests that the GoroutineManager starts goroutines until ctx
// expires. It also makes sure it fails to start new goroutines after the context expired and the GoroutineManager is in the process of waiting for already started goroutines in the Stop method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks!
|
||
stopChan := make(chan struct{}) | ||
|
||
time.AfterFunc(1*time.Millisecond, func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the panic happens in the golang runtime if the counter is actually 0 and we all Add(), or doesn't it matter but as soon as an Add() is called after the Wait was called the panic happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From https://pkg.go.dev/sync#WaitGroup.Add
Note that calls with a positive delta that occur when the counter is zero must happen before a Wait. Calls with a negative delta, or calls with a positive delta that start when the counter is greater than zero, may happen at any time.
So if the counter is 0, Add(1)
and Wait()
must not be called in parallel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked the code, and it even panics if you call Add() while Wait is already waiting for all goroutine to finish. Then Wait() will panic, in your above example Add() will panic if the counter is zero and we already have waiters registered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it even panics if you call Add() while Wait is already waiting for all goroutine to finish
Even with a non-zero counter?
If the counter is 0, Wait() should unblock immediately, not wait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes true, what you are referring to is basically that the wait() method returned because the Wait() method compares if the state was changed before returning if that was the case it panics.
d381817
to
d2cda21
Compare
8ff1c2a
to
a47a35e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing doc to disincentivize the use of this function otherwise lgtm
cancel func() | ||
} | ||
|
||
// NewGoroutineManager constructs and returns a new instance of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regarding the GoroutineManager
I mean it's good that we are introducing it, but I think we should disincentivize its use, because there is something wrong with the design if you need to use it, maybe we should mark this in the comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Added the following note:
// NOTE: while it is safe to use GoroutineManager, it is recommended to change
// the design of the code using it to avoid launching goroutines except at start
// time. Ideally, goroutine launches and Stop() call should not compete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I have to chime in here. Why are we adding an abstraction that we are explicitly discouraging? Seems like this will be counterproductive in the long run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm ok, so you are actually fine with this way to use the waitGroup ? I thought it was a bandaid to the current situation to not rewrite the whole htlc switch/payment code, but using the waitgroup in that way is not good design imo, because the waitgroup is inherently thread safe and does not need a mutex generally when it is made sure Adds are called before Wait().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point is that building infrastructure to support bad code design is just weird. We should fix the design or patch over it. We don't want to make it easier to write bad code in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The common scenario is like this:
Some type
- New() returns new instance of the type
- Some method starts an auxiliary goroutine which may keep working after the method completes
(though this is not needed to have the problem with goroutines)
- Method Stop() instructs all such background running workers to stop and waits for them to
actually stop to avoid goroutine leak
(It is not only htlc swich, also found in ChannelRouter.SendPaymentAsync, maybe in other cases too.)
Such a scenario is not thread-safe to use a WaitGroup (without a mutex) to track goroutines.
But is this always a bad scenario which we should discourage in favor of a design using fixed number of goroutines started in New()
and stopped in Stop()
? The later scenario is thread safe for an unprotected WaitGroup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the issue is with fixed vs variable number of goroutines. This issue arises from having multiple threads responsible for managing the WaitGroup. By sending requests into some object's main event loop (which is responsible for managing the waitgroup) and having it launch the new goroutine would handily solve the issue. However right now we cannot make the assumption that the methods of an object are called from the same thread, which is what causes the problem.
This brings me to a new theorem of mine: WaitGroups that manage a runtime determined number of threads should only exist in the stack memory of a goroutine, not in a field on some object. If the number of goroutines is statically known at compile time, then it may be on the field of the object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the theorem! It is a good rule resulting in cleaner code. Also I think it is possible to enforce it at compile time (lint rule?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how to write lint rules right now but if that's possible that sounds great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the note.
|
||
// Done returns a channel which is closed when either the context passed to | ||
// NewGoroutineManager expires or when Stop is called. | ||
func (g *GoroutineManager) Done() <-chan struct{} { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅
fn/goroutine_manager_test.go
Outdated
// Make sure new goroutines do not start after Stop. | ||
require.ErrorIs(t, m.Go(func(ctx context.Context) {}), ErrStopping) | ||
|
||
// When Stop() is called, m.Done() is closed. Test that it is closed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Test that the internal context is closed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
|
||
stopChan := make(chan struct{}) | ||
|
||
time.AfterFunc(1*time.Millisecond, func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes true, what you are referring to is basically that the wait() method returned because the Wait() method compares if the state was changed before returning if that was the case it panics.
The package provides type GoroutineManager which is used to launch goroutines until context expires or the manager is stopped. Stop method blocks until all started goroutines stop. Original code by Andras https://go.dev/play/p/HhRpE-K2lA0 Adjustments and tests by Boris.
Looking at this a bit more closely while reviewing #9253. The |
Yes I think you are right, if we call from the waitgroup impl. we would panic:
|
@guggero @ziggie1984 Thanks for looking into it! Documentation of sync.WaitGroup.Wait has the following part:
I guess this means, that I also created a stress test testing GoroutineManager trying to catch this race condition, if it exists: // TestGoroutineManageStopsStress launches many Stop() calls in parallel with a
// task exiting. It attempts to catch a race condition between wg.Done() and
// wg.Wait() calls. According to documentation of wg.Wait() this is acceptable,
// therefore this test passes even with -race.
func TestGoroutineManageStopsStress(t *testing.T) {
t.Parallel()
m := NewGoroutineManager(context.Background())
// jobChan is used to make the task to finish.
jobChan := make(chan struct{})
// Start a task and wait inside it until we start calling Stop() method.
err := m.Go(func(ctx context.Context) {
<-jobChan
})
require.NoError(t, err)
// Now launch many gorotines calling Stop() method in parallel.
var wg sync.WaitGroup
for i := 0; i < 100; i++ {
wg.Add(1)
go func() {
defer wg.Done()
m.Stop()
}()
}
// Exit the task in parallel with Stop() calls.
close(jobChan)
// Wait until all the Stop() calls complete.
wg.Wait()
} The test passes under I think, that the current implementation is correct. |
Make sure there is no race condition between Done() and Wait() methods in the GoroutineManager implementation. See lightningnetwork#9141 (comment)
@starius is it allowed to call |
@Crypt-iQ From "or calls with a positive delta that start when the counter is greater than zero" part this should be allowed. |
@starius thanks for the explanation, I missed that part in the doc. It indeed looks like the current implementation is correct then! |
I think there is still a possibility of a race but not sure how to trigger it via tests: Imagine you call Done() (which leads to a counter of 0) but in the same time a Wait was called (before the goroutine counter was 0) then you cannot be sure that the semaphore will release the waiter because the implementation would panic anyways but if it would not panic not as many semaphore release calls would be triggered.
This is part of the Add implementation |
@ziggie1984 If I'll try to look deeper into the code implementing Add(). It has many branching depending on |
the docs says clearly |
@ziggie1984 I think that "when the counter is greater than zero" part applies only to "calls with a positive delta", not to "Calls with a negative delta" part in that sentence. If counter=0, this means Add(-1) has already been applied. Doesn't this mean, that Add(-1) precedes Wait() in this case?
Can you elaborate on this, please? If Wait() sees counter=0, it doesn't need to be notified, because it return immediately in that case. While writing the response, I made a small table version of how I understand the rules of using Add():
Add(-1) is not allowed when counter=0, because then the counter would become negative. |
@ziggie1984 Also check example of WaitGroup, please. They call wg.Done() from a goroutine and it can run at the same time with Wait(). If this is not allowed, then the example is wrong and it is a bug in Go. |
Ahh ok right I was a bit mislead by this here:
|
After discussing with @ziggie1984 this morning I think this is correct. |
Change Description
The package provides type
GoroutineManager
which is used to launch goroutines until context expires or the manager is stopped. Stop method blocks until all started goroutines stop.Original code by Andras https://go.dev/play/p/HhRpE-K2lA0
Adjustments and tests by Boris.
I also added another unrelated
fn
fix to this PR:fn: generalize type of t in UnwrapOrFail
.Steps to Test
CI runs unit test. Also I used GoroutineManager in #9140
Pull Request Checklist
Testing
Code Style and Documentation
[skip ci]
in the commit message for small changes.