Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: compute and emit realtime metrics while node not fulfilled #13441

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

fyp711
Copy link
Contributor

@fyp711 fyp711 commented Aug 8, 2024

Fixes #13440

Motivation

When I use the realtime metrics in templates level. The realtime metrics not emit in every template execution.

Explanation:
Before the modification, the node would only calculate the real-time logic during its first creation. However, if the conditions specified in the "when" clause are not met at that time, it would have to wait until the node finishes running before recalculating the "when" conditions for real-time. This does not align with the intended meaning of real-time.

Modifications

compute and then emit realtime metrics while node not fulfilled

How to test ?

Such as this template, I add a prometheus metrics at templates level, It will emit the metrics when {{duration}} > 100.
If workflow keeps running, it never stops. When the workflowResyncPeriod time is reached, workflow is scheduled again, but because the node is not fulfilled, the indicator cannot be calculated again, resulting in that the indicator cannot be emitted when the condition is fulfilled.

 spec:
  templates:
    - name: gen-random-int
      metrics:
        prometheus:
          - name:  node_timeout_metrics
            help: Duration gauge by name
            when: '{{duration}} > 100'
            gauge:
              value: '{{duration}}'
              realtime: true

My expectation

compute and then emit realtime metrics while node not fulfilled

@agilgur5 agilgur5 changed the title fix: should emit realtime metrics in every execution fix: emit realtime metrics in every execution Aug 8, 2024
fyp711 added 2 commits August 9, 2024 11:32
Signed-off-by: Yuping Fan <[email protected]>
Signed-off-by: Yuping Fan <[email protected]>
@fyp711 fyp711 changed the title fix: emit realtime metrics in every execution fix: emit realtime metrics while node not fulfilled Aug 9, 2024
@agilgur5 agilgur5 requested a review from Joibel August 10, 2024 19:22
@fyp711 fyp711 changed the title fix: emit realtime metrics while node not fulfilled fix: compute and emit realtime metrics while node not fulfilled Aug 12, 2024
@fyp711
Copy link
Contributor Author

fyp711 commented Aug 14, 2024

@Joibel Could you take a view for this pr, thanks!

Copy link
Member

@Joibel Joibel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unconvinced that when clauses should apply to realtime metrics. It seems bizarre to have a gauge which disappears and reappears. If my car did that I'd think it was broken and want to fix it. My preference would be to warn on when clauses being used in conjunction with realtime, and just ignore the when clause.

Given that when clauses do apply, they shouldn't only be evaluated on workflowResyncPeriod. This PR as is doesn't fix that.

My suggestion for a better fix would be to change the interface to RealTimeValueFunc to add a boolean return based on the when clause:

type RealTimeValueFunc func() (bool, float64)

and then fix the fallout from this such that the when clause for realtime is not evaluated in ComputeMetrics and is checked in the func instead.

@fyp711
Copy link
Contributor Author

fyp711 commented Aug 15, 2024

and then fix the fallout from this such that the when clause for realtime is not evaluated in ComputeMetrics and is checked in the func instead.

@Joibel Thanks for review, the point I want to improve is not the calculation method of "when", but if the node will not call woc.computeMetrics(..., true) again during operation.

If it is a workflow-level metrics, this woc.computeMetrics(..., true) is performed every time workflow is scheduled (see

woc.computeMetrics(ctx, woc.execWf.Spec.Metrics.Prometheus, localScope, realTimeScope, true)
). So, I think the template level metrics also needs to be calculated every time the template is scheduled.

@Joibel
Copy link
Member

Joibel commented Aug 15, 2024

and then fix the fallout from this such that the when clause for realtime is not evaluated in ComputeMetrics and is checked in the func instead.

@Joibel Thanks for review, the point I want to improve is not the calculation method of "when", but if the node will not call woc.computeMetrics(..., true) again during operation.

Why do you want to call it again during operation for realtime metrics? The only reason I can see to call it a second time is because the first time the when clause evaluated to false so the realtime metric didn't get created.

If it is a workflow-level metrics, this woc.computeMetrics(..., true) is performed every time workflow is scheduled (see

woc.computeMetrics(ctx, woc.execWf.Spec.Metrics.Prometheus, localScope, realTimeScope, true)

).

This should be unnecessary, I'm not sure what this does that is useful. It should probably only happen once under the guard of if woc.wf.Status.Phase == wfv1.WorkflowUnknown, by moving it down a few lines.

So, I think the template level metrics also needs to be calculated every time the template is scheduled.

Why? What is special about a template being scheduled that means realtime metrics should be updated. Its an arbitrary time point.

@fyp711
Copy link
Contributor Author

fyp711 commented Aug 15, 2024

The only reason I can see to call it a second time is because the first time the when clause evaluated to false so the realtime metric didn't get created.

yes, that's the reason

Comment on lines 2251 to 2253
// Check if the node was just created, if it was emit realtime metrics.
// Check if the node was just created or not fulfilled, if it was emit realtime metrics.
// If the node did not previously exist, we can infer that it was created during the current operation, emit real time metrics.
if _, ok := woc.preExecutionNodePhases[node.ID]; !ok {
if _, ok := woc.preExecutionNodePhases[node.ID]; !ok || !node.Fulfilled() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Joibel What I want to modify is at the template level, my reason is that I specified when duration > 100, which means that the realtime metric is actually sent out when it is greater than 100 seconds. In fact, this metric will not be collected when it is greater than 100 seconds, because the duration is 0 when the template is just created. When calculated as false, I permanently lost this metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this does not conform to the definition of "when".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My expectation is that I can collect this metrics when the when condition is met.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you read my suggestion again. I believe it does what you're looking for here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you read my suggestion again. I believe it does what you're looking for here.

I understand, thank you.

@fyp711 fyp711 marked this pull request as draft August 15, 2024 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

should emit realtime metrics while node not fulfilled
3 participants