-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
discovery+graph: track job set dependencies in vb #9241
base: master
Are you sure you want to change the base?
Conversation
Important Review skippedAuto reviews are limited to specific labels. 🏷️ Labels to auto review (1)
Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
7acf321
to
fc00572
Compare
ValidationBarrier
ValidationBarrier
fc72083
to
7d95cd2
Compare
cc: @gijswijs for review |
graph/validation_barrier.go
Outdated
func (v *ValidationBarrier) FetchJobSlot() { | ||
// We'll wait for either a new slot to become open, or for the quit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since the builder code is called by the gossiper code which itself also uses a semaphore - why isnt that inheritance enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like it is, good catch
func (v *ValidationBarrier) SignalDependants(job interface{}, allow bool) { | ||
// SignalDependents signals to any child jobs that this parent job has | ||
// finished. | ||
func (v *ValidationBarrier) SignalDependents(job interface{}, id JobID) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha sneaky change from British spelling to American 😝
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this has been driving me crazy for years
discovery/gossiper.go
Outdated
"JobID=%v", spew.Sdump(nMsg.msg), jobID) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we not handle the error similarly to how it is handled for WaitForParents
above? (including returning after handling?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
graph/validation_barrier.go
Outdated
info.activeParentJobIDs.Add(annJobID) | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style nit: return after info.activeParentJobIDs.Add(annJobID)
and remove & unindent the else
block
graph/validation_barrier.go
Outdated
signals, ok = v.chanEdgeDependencies[msg.ShortChannelID] | ||
annID = msg.ShortChannelID | ||
|
||
// TODO: If ok is false, we have serious issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
throw the error? (if it is really impossible then panic)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Despite the many legitimate uses of panics, they have been rejected every time I have tried to use them (even for the provably impossible scenario). I believe that a critical log is the next best thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
returning error now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm finding the distinction between parent and child jobs here both confusing and unnecessary. What we have here is a dependency graph of undifferentiated jobIDs. once all of the dependencies have run we can run. once we run we want to signal all of our dependents. We should be able to accomplish this with a single removeJob
that does this index cleanup and dependent signaling.
The main difficulty I'm noticing in this PR is that we have multiple IDs that we want to be able to map to JobID
s from disjoint domains. My recommendation here is to make the core algebra of this component undifferentiated and then have auxilliary mappings that help recover the relevant JobID
from the other unique protocol identifiers.
// length and entries therefore cannot hash to the same keys. | ||
// NOTE: IF OTHER TYPES OF KEYS ARE STORED, CHECK THAT COLLISION WON'T | ||
// OCCUR. | ||
jobInfoMap map[any]*jobInfo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we ought to use an explicit closed union (via an interface) in the key here. any
is a disaster waiting to happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to figure out how to do this, but couldn't figure out how? map
won't accept an interface as a key because interfaces aren't comparable
graph/validation_barrier.go
Outdated
// should complete after another) for the (childJobID, annID) tuple. This must | ||
// only be called from InitJobDependencies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just define this as a local function inside that scope to enforce that it is only referenceable there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed, this was only introduced to not have to deal with line-length issues
graph/validation_barrier.go
Outdated
// Copy over the parent job IDs at this moment for this annID. | ||
// This job must be processed AFTER these parent IDs. | ||
parentJobs := info.activeParentJobIDs.Union(fn.NewSet[JobID]()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this reveals a need for a set copying method.
graph/validation_barrier.go
Outdated
signals, ok = v.chanEdgeDependencies[msg.ShortChannelID] | ||
annID = msg.ShortChannelID | ||
|
||
// TODO: If ok is false, we have serious issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Despite the many legitimate uses of panics, they have been rejected every time I have tried to use them (even for the provably impossible scenario). I believe that a critical log is the next best thing.
annID = msg.ShortChannelID | ||
|
||
// TODO: If ok is false, we have serious issues. | ||
parentJobIDs, ok = v.jobDependencies[childJobID] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this read not need a mutex lock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's locked above no?
graph/validation_barrier.go
Outdated
// and cleans up its job dependency mappings. This MUST be called from | ||
// SignalDependents. | ||
// NOTE: MUST be called with the mutex held. | ||
func (v *ValidationBarrier) removeChildJob(annID any, childJobID JobID) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🫡
graph/validation_barrier.go
Outdated
// We don't want to block when sending out the signal. | ||
select { | ||
case notifyChan <- struct{}{}: | ||
default: | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we ok with swallowing the signal instead? Seems like this could case jobs to never be run, particularly if lastJob
is true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean here. 1. How would jobs not be run in the current scenario? 2. What does swallowing the signal look like here?
graph/validation_barrier.go
Outdated
case *lnwire.NodeAnnouncement: | ||
delete(v.nodeAnnDependencies, route.Vertex(msg.NodeID)) | ||
// Remove child job info. | ||
v.removeChildJob(route.Vertex(msg.NodeID), id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need two distinct removal functions for parent and child.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed, using a bool now since they have different reuqirements
cc: @gijswijs for review |
@Crypt-iQ, remember to re-request review from reviewers when ready |
Outstanding things to do:
|
@Crypt-iQ - is this ready for re-review given the itest failures? |
Sorry no, it seems like I broke something |
This omits calls to InitJobDependencies, SignalDependants, and WaitForDependants. These changes have been made here because the router / builder code does not actually need job dependency management. Calls to the builder code (i.e. AddNode, AddEdge, UpdateEdge) are all blocking in the gossiper. This, combined with the fact that child jobs are run after parent jobs in the gossiper, means that the calls to the router will happen in the proper dependency order.
Ready for review now, had an issue with |
@Crypt-iQ - defs ready given all the failures? |
This commit does two things: - removes the concept of allow / deny. Having this in place was a minor optimization and removing it makes the solution simpler. - changes the job dependency tracking to track sets of abstact parent jobs rather than individual parent jobs. As a note, the purpose of the ValidationBarrier is that it allows us to launch gossip validation jobs in goroutines while still ensuring that the validation order of these goroutines is adhered to when it comes to validating ChannelAnnouncement _before_ ChannelUpdate and _before_ NodeAnnouncement.
now it is, the fn change broke it |
This PR changes the
ValidationBarrier
to track abstract job dependencies. This just means that every time a child job comes in (i.e. channel update or node announcement), we track the set of possible parent jobs that are related to it (channel announcement(s)) that have registered viaInitJobDependencies
. The goroutines containing the child jobs will then wait to be notified every time one of their parent jobs completes. From the child job's POV, this just works as ref-counting except that you're only counting the parent jobs you're interested in.With this, we can now extend the
ValidationBarrier
to track any sort of abstract job that requires both concurrency and waiting for another job to finish. It also makes it possible in a future PR to very easily make node announcements depend on channel announcements. See the commit messages for more details.TODO:
ValidationBarrier
and ensure that all child jobs finish after their related parent jobs.