Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watch agent metadata service #5017

Merged
merged 22 commits into from
Jun 8, 2024
Merged

Watch agent metadata service #5017

merged 22 commits into from
Jun 8, 2024

Conversation

pingsutw
Copy link
Member

@pingsutw pingsutw commented Mar 6, 2024

Tracking issue

#3936

Why are the changes needed?

Monitor the agent metadata service to sync the metadata (support_task_type, task_type_version, is_sync, etc.) so the Agent doesn't need to be started before Propeller.

What changes were proposed in this pull request?

Add a watcher to the agent plugin to periodically sync the metadata.

  • If the agent doesn't implement metadata service, the plugin will use metadata (support task type) in the config map
  • if Propeller gets the metadata from the agent, this metadata will override the existing metadata in the config map.

How was this patch tested?

Setup process

  agent-service:
    defaultAgent:
      defaultTimeout: 10s
      endpoint: localhost:8005
      insecure: true
    agents:
      myAgent:
        endpoint: localhost:8000
        insecure: true

Screenshots

The agent is not stated. propeller sends the request to the default agent localhost:8005

image

Start running agent. propeller sends the request to the default myAgent localhost:8000

image

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

NA

Docs link

NA

pingsutw added 3 commits March 5, 2024 17:51
Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Kevin Su <[email protected]>
@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. enhancement New feature or request labels Mar 6, 2024
Signed-off-by: Kevin Su <[email protected]>
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Mar 6, 2024
Copy link

codecov bot commented Mar 6, 2024

Codecov Report

Attention: Patch coverage is 47.61905% with 33 lines in your changes missing coverage. Please review.

Project coverage is 60.98%. Comparing base (38883c7) to head (e91de3b).
Report is 137 commits behind head on master.

Files with missing lines Patch % Lines
...yteplugins/go/tasks/plugins/webapi/agent/client.go 50.00% 14 Missing and 3 partials ⚠️
...yteplugins/go/tasks/plugins/webapi/agent/plugin.go 44.82% 16 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5017      +/-   ##
==========================================
- Coverage   60.99%   60.98%   -0.02%     
==========================================
  Files         793      793              
  Lines       51295    51313      +18     
==========================================
+ Hits        31289    31294       +5     
- Misses      17125    17134       +9     
- Partials     2881     2885       +4     
Flag Coverage Δ
unittests-datacatalog 69.31% <ø> (ø)
unittests-flyteadmin 58.66% <ø> (+0.02%) ⬆️
unittests-flytecopilot 17.79% <ø> (ø)
unittests-flytectl 67.97% <ø> (ø)
unittests-flyteidl 79.04% <ø> (ø)
unittests-flyteplugins 61.84% <47.61%> (-0.10%) ⬇️
unittests-flytepropeller 57.32% <ø> (ø)
unittests-flytestdlib 65.82% <ø> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Kevin Su <[email protected]>
@Future-Outlier
Copy link
Member

Wow, the watcher is amazing, will do more test about it.
image

Signed-off-by: Kevin Su <[email protected]>
Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested this in this scenario.
It shows me that it works well.

  1. start the propeller (Agent server is closed)
  2. trigger a sensor task to print my current agentRegistry
  3. start an agent server
  4. trigger a sensor task to print my current agentRegistry

Print agentRegistry before the agent server started
image

before the agent server started
image

after the agent server started
image

flyteplugins/go/tasks/plugins/webapi/agent/client.go Outdated Show resolved Hide resolved
Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done some tests on it.
I think even we can use the watcher to update support_task_type, it can't help use to update propeller's metadata.
Try this case.

  agent-service:
  #   supportedTaskTypes:
  #     - sensor
    defaultAgent:
      endpoint: "dns:///localhost:8000"
      insecure: true
      timeouts:
        CreateTask: 100s
        GetTask: 100s
      defaultTimeout: 100s

Even you start the agent server, it will still fail to execute agent task.
image

So here's a follow up question, is it possible to update the supportedTaskType for propeller?
I've traced the code and found that it has conflict with current propeller's load plugin mechanism.

@Future-Outlier
Copy link
Member

cc @honnix , this is the current watcher implementation.
To save your time, let me give you a summary of how this watcher works.

  • It will not panic even the agent server is not started.
  • It will update AgentRegistry every 10 seconds, so it is available that you can start the agent server after propeller started
  • you sill needs to write supportedTaskType in your configmap, since propeller will not wait your agent server give you all supportedTaskType

Does it look well to you?
Do you have any thoughts to enable the propeller to get supportedTaskType after agent servers start?
Thank you for your patience ~ ! : )

pingsutw added 4 commits March 7, 2024 11:34
Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Kevin Su <[email protected]>
@honnix
Copy link
Member

honnix commented Mar 8, 2024

@Future-Outlier Thank you for summarizing. This approach looks good to me.

However I noticed a breaking change (not introduced by this PR). For create operation, it used to be synchronous when resources is returned directly; but now all configured (in config file) agents is by default async, and that seems to break the contract because we can no longer have sync creation unless implementing the metadata endpoint. Is this correct understanding?

Signed-off-by: Kevin Su <[email protected]>
@Future-Outlier
Copy link
Member

@Future-Outlier Thank you for summarizing. This approach looks good to me.

However I noticed a breaking change (not introduced by this PR). For create operation, it used to be synchronous when resources is returned directly; but now all configured (in config file) agents is by default async, and that seems to break the contract because we can no longer have sync creation unless implementing the metadata endpoint. Is this correct understanding?

@honnix
Yes, you are correct, sorry for this breaking change.
Do you think it is not ok to have this breaking change?
Maybe I can discuss with Kevin about integrating the breaking change to agent/plugin.go.
However, the code will be a little bit confusing.
Are you okay with the impact of this breaking change?

@honnix
Copy link
Member

honnix commented Mar 9, 2024

@Future-Outlier Thank you for summarizing. This approach looks good to me.
However I noticed a breaking change (not introduced by this PR). For create operation, it used to be synchronous when resources is returned directly; but now all configured (in config file) agents is by default async, and that seems to break the contract because we can no longer have sync creation unless implementing the metadata endpoint. Is this correct understanding?

@honnix Yes, you are correct, sorry for this breaking change. Do you think it is not ok to have this breaking change? Maybe I can discuss with Kevin about integrating the breaking change to agent/plugin.go. However, the code will be a little bit confusing. Are you okay with the impact of this breaking change?

@Future-Outlier Thank you for the confirmation. No it is not a problem for us, and I think in general it looks like a good change (being explicit is good). As long as it is noted in the release notes, I think it should be fine.

@Future-Outlier
Copy link
Member

@Future-Outlier Thank you for summarizing. This approach looks good to me.
However I noticed a breaking change (not introduced by this PR). For create operation, it used to be synchronous when resources is returned directly; but now all configured (in config file) agents is by default async, and that seems to break the contract because we can no longer have sync creation unless implementing the metadata endpoint. Is this correct understanding?

@honnix Yes, you are correct, sorry for this breaking change. Do you think it is not ok to have this breaking change? Maybe I can discuss with Kevin about integrating the breaking change to agent/plugin.go. However, the code will be a little bit confusing. Are you okay with the impact of this breaking change?

@Future-Outlier Thank you for the confirmation. No it is not a problem for us, and I think in general it looks like a good change (being explicit is good). As long as it is noted in the release notes, I think it should be fine.

@honnix , thank you, we will try to add more information in the release notes.
Also, there's a change about readiness probe for the agent pod, which means that in the upcoming release,
agent pod needs to have health check servicer.
(It can be turned off by the configmap)
flyte config map: #4990
flytekit agent server implementation: flyteorg/flytekit#2232

If there's anything we can help, please leave comments, thank you!

@pingsutw pingsutw marked this pull request as draft March 11, 2024 10:47
@honnix
Copy link
Member

honnix commented Mar 11, 2024

@Future-Outlier Thank you for the heads-up. If I understood correctly, this only matters if we deploy the agent service as a pod.

@Future-Outlier
Copy link
Member

@Future-Outlier Thank you for the heads-up. If I understood correctly, this only matters if we deploy the agent service as a pod.

Yes, you are right!

@pingsutw
Copy link
Member Author

you sill needs to write supportedTaskType in your configmap, since propeller will not wait your agent server give you all supportedTaskType

I'm going to update this PR, so we're also able to reload support task type at runtime.

@honnix
Copy link
Member

honnix commented Apr 12, 2024

Any plan to get this going? Thanks.

@Future-Outlier
Copy link
Member

Any plan to get this going? Thanks.

Hi, sorry for the late reply, I will discuss this with Kevin and update our discussion this week, thank you for your patience!

@Future-Outlier
Copy link
Member

Any plan to get this going? Thanks.

Hi, @honnix Kevin is busy with other issues now.
He says he can list some details for me, and I can try implementing this.
However, I am busy with school homework now, and I can come back after 5/12.
Will it be too late for you?

@honnix
Copy link
Member

honnix commented Apr 23, 2024

Any plan to get this going? Thanks.

Hi, @honnix Kevin is busy with other issues now. He says he can list some details for me, and I can try implementing this. However, I am busy with school homework now, and I can come back after 5/12. Will it be too late for you?

Hi @Future-Outlier . Thank you for the update. This is not urgent so please take your time.

@Future-Outlier
Copy link
Member

Future-Outlier commented May 19, 2024

Any plan to get this going? Thanks.

Hi, @honnix Kevin is busy with other issues now. He says he can list some details for me, and I can try implementing this. However, I am busy with school homework now, and I can come back after 5/12. Will it be too late for you?

Hi @Future-Outlier . Thank you for the update. This is not urgent so please take your time.

Doing it, will make some variables from private to public so that we can dynamically update support task types,
thank you for waiting, will ask for your review after finish it.

@honnix
Copy link
Member

honnix commented Jun 7, 2024

To give a real world incident: we just had a flytepropeller startup trouble due to one agent service being unavailable. This feature will be highly appreciated.

@Future-Outlier
Copy link
Member

To give a real world incident: we just had a flytepropeller startup trouble due to one agent service being unavailable. This feature will be highly appreciated.

No problem, fighting on it with Kevin.

@pingsutw pingsutw marked this pull request as ready for review June 8, 2024 09:23
pingsutw added 5 commits June 8, 2024 02:26
Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Kevin Su <[email protected]>
@Future-Outlier Future-Outlier self-requested a review June 8, 2024 09:54
pingsutw added 2 commits June 8, 2024 02:57
Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Kevin Su <[email protected]>
Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is not easy, thank you for the effort!

@pingsutw pingsutw merged commit cd37d1b into master Jun 8, 2024
48 of 50 checks passed
@pingsutw pingsutw deleted the watch-agent branch June 8, 2024 10:19
robert-ulbrich-mercedes-benz pushed a commit to robert-ulbrich-mercedes-benz/flyte that referenced this pull request Jul 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants