Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 fix work agent performance issue #785

Conversation

elgnay
Copy link
Contributor

@elgnay elgnay commented Jan 2, 2025

Summary

Related issue(s)

Fixes #

Copy link

codecov bot commented Jan 2, 2025

Codecov Report

Attention: Patch coverage is 88.88889% with 1 line in your changes missing coverage. Please review.

Project coverage is 63.78%. Comparing base (e1b5f88) to head (f5a77b3).
Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
...lers/manifestcontroller/manifestwork_reconciler.go 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #785      +/-   ##
==========================================
+ Coverage   63.77%   63.78%   +0.01%     
==========================================
  Files         192      192              
  Lines       18596    18608      +12     
==========================================
+ Hits        11860    11870      +10     
- Misses       5756     5759       +3     
+ Partials      980      979       -1     
Flag Coverage Δ
unit 63.78% <88.88%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@elgnay
Copy link
Contributor Author

elgnay commented Jan 2, 2025

/hold

@elgnay elgnay force-pushed the fix_work_perf_issue branch 2 times, most recently from ce649d1 to 387676c Compare January 6, 2025 02:24
@@ -175,6 +175,9 @@ func (m *ManifestWorkController) sync(ctx context.Context, controllerContext fac
// if needed.
if !mwUpdated && !amwUpdated && requeueTime < MaxRequeueDuration {
controllerContext.Queue().AddAfter(manifestWorkName, requeueTime)
} else {
// resync each manifestwork every 5 minutes
controllerContext.Queue().AddAfter(manifestWorkName, 5*time.Minute)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we need else here, we do not need to requeue when manifestwork or appliedmanfiestwork is udpated. I think we only need to change this line
if !mwUpdated && !amwUpdated && requeueTime < MaxRequeueDuration
to
if !mwUpdated && !amwUpdated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated as suggested.

spokeWorkInformerFactory := workinformers.NewSharedInformerFactory(spokeWorkClient, 5*time.Minute)
// resync with a small interval could result in performance issue when the number of appliedmanifestworks
// is large.
spokeWorkInformerFactory := workinformers.NewSharedInformerFactory(spokeWorkClient, 21*time.Hour)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

24 hours?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If both manifestwork informer and appliedmanifestwork informer resync every 24 hours, they may resync at the same time that will result in a huge peak. However, if they have different resync intervals, it will bring less pressure to the work agent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we add some commenst on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment added.

@elgnay elgnay force-pushed the fix_work_perf_issue branch from 387676c to 3952304 Compare January 6, 2025 03:30
@@ -138,7 +137,7 @@ func (m *ManifestWorkController) sync(ctx context.Context, controllerContext fac
}
newAppliedManifestWork := appliedManifestWork.DeepCopy()

var requeueTime = MaxRequeueDuration
var requeueTime = ResyncInterval
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ResyncInterval is also set in withResync. Should these two to be set the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean ResyncEvery? It has already used the ResyncInterval although I think it is useless for the ManifestWorkAgent controller.

WithSync(controller.sync).ResyncEvery(ResyncInterval).ToController("ManifestWorkAgent", recorder)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add some comment for ResyncInterval, given it is used at multiple locations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@elgnay elgnay force-pushed the fix_work_perf_issue branch from 3952304 to f5a77b3 Compare January 6, 2025 08:41
@qiujian16
Copy link
Member

/approve
/lgtm

Copy link
Contributor

openshift-ci bot commented Jan 6, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elgnay, qiujian16

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@zhujian7
Copy link
Member

zhujian7 commented Jan 7, 2025

@elgnay So after this change, after a manifestwork containing a configmap is applied, if I delete the configmap on the manged cluster manually, how long will the work agent recreate the configmap? 24 hours?

@elgnay
Copy link
Contributor Author

elgnay commented Jan 7, 2025

@elgnay So after this change, after a manifestwork containing a configmap is applied, if I delete the configmap on the manged cluster manually, how long will the work agent recreate the configmap? 24 hours?

Still each ManifestWork will be resynced at intervals not exceeding 5 minutes. The only change is that this logic no longer leverages the resync of the informers. In another word, the ManifestWork will be requeued by the controller (https://github.com/open-cluster-management-io/ocm/pull/785/files#diff-f0293b9137d86645469d2f7303ca98e1845a7b482671a25427e774a5c56c08abL177) after it has been processed successfully.

@zhujian7
Copy link
Member

zhujian7 commented Jan 7, 2025

/lgtm

@elgnay
Copy link
Contributor Author

elgnay commented Jan 8, 2025

/unhold

@openshift-merge-bot openshift-merge-bot bot merged commit 9af100f into open-cluster-management-io:main Jan 8, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants