Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing the jobrunner #42

Merged

Conversation

cardil
Copy link
Contributor

@cardil cardil commented Jul 21, 2021

Changes

  • 🐛 Fix the batchv1.Job runner to properly wait for completion of remote job that is sending the event.

/kind bug

Fixes #10

Release Note

The "In Cluster Sender" has been fixed to properly shows a state of sended event. In case of sending event to clster local resources, sended waits until message is delivered on target.

@knative-prow-robot knative-prow-robot added kind/bug Categorizes issue or PR as related to a bug. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jul 21, 2021
@google-cla google-cla bot added the cla: yes Indicates the PR's author has signed the CLA. label Jul 21, 2021
@knative-prow-robot knative-prow-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 21, 2021
@cardil
Copy link
Contributor Author

cardil commented Jul 21, 2021

/cc @rhuss
/cc @dsimansk

})
if err != nil {
return fmt.Errorf("%w: %v", ErrICSenderJobFailed, err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be defer watcher.Stop() helpful around here to clean up resources?

Copy link
Contributor

@dsimansk dsimansk Jul 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if you check RetryWatcher from client-go tools, link below. In that case it's trying to restart on closed ResultChan channel. That's more likely a rare case but we've seen such reports in client's issue that ResultChan is closed unexpectedly, but with recoverable error.

Finally I wonder if you couldn't reuse something like existing RetryWatcher code directly here.

https://github.com/kubernetes/client-go/blob/master/tools/watch/retrywatcher.go#L242-L275

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking that maybe that's not be possible in this exact case because of:

https://github.com/kubernetes/client-go/blob/ac207faedfb64acd5b99a2fb309b7044918b4dda/tools/watch/retrywatcher.go#L68-L71

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I recall that part now. Sure let's keep your watching implementation.

Do you think it's worth to address at least a bit of retry logic when ResultChan is closed? Of course it can done as a future enhancement/hardening.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's good idea for an enhancement in future PR. Would you mind opening an issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I'll do first thing tomorrow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#45 tracks the improvement. I'm pretty convinced that #68 is impacted by this.

@dsimansk
Copy link
Contributor

/lgtm

@cardil feel free to unhold.
/hold

@knative-prow-robot knative-prow-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 21, 2021
@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 21, 2021
…b-runner

Conflicts fixed:

* pkg/tests/fakeclients.go
@knative-prow-robot knative-prow-robot removed the lgtm Indicates that a PR is ready to be merged. label Jul 21, 2021
@cardil cardil requested a review from dsimansk July 21, 2021 16:42
@cardil
Copy link
Contributor Author

cardil commented Jul 21, 2021

@dsimansk Thanks for review.

I had to rebase. Please add LGTM flag again.

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 21, 2021
@knative-prow-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cardil, dsimansk

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@cardil
Copy link
Contributor Author

cardil commented Jul 21, 2021

/unhold

@knative-prow-robot knative-prow-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 21, 2021
@knative-prow-robot knative-prow-robot merged commit 44b1876 into knative-extensions:main Jul 21, 2021
@cardil cardil deleted the bugfix/10-proper-job-runner branch July 21, 2021 18:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cla: yes Indicates the PR's author has signed the CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

jobRunner.Run isn't waiting properly for the end of the Job it creates
3 participants