Watch agent metadata service #5017

pingsutw · 2024-03-06T22:06:05Z

Tracking issue

Why are the changes needed?

Monitor the agent metadata service to sync the metadata (support_task_type, task_type_version, is_sync, etc.) so the Agent doesn't need to be started before Propeller.

What changes were proposed in this pull request?

Add a watcher to the agent plugin to periodically sync the metadata.

If the agent doesn't implement metadata service, the plugin will use metadata (support task type) in the config map
if Propeller gets the metadata from the agent, this metadata will override the existing metadata in the config map.

How was this patch tested?

Setup process

  agent-service:
    defaultAgent:
      defaultTimeout: 10s
      endpoint: localhost:8005
      insecure: true
    agents:
      myAgent:
        endpoint: localhost:8000
        insecure: true

Screenshots

The agent is not stated. propeller sends the request to the default agent localhost:8005

Start running agent. propeller sends the request to the default myAgent localhost:8000

Check all the applicable boxes

I updated the documentation accordingly.
All new and existing tests passed.
All commits are signed-off.

Related PRs

NA

Docs link

NA

Signed-off-by: Kevin Su <[email protected]>

codecov · 2024-03-06T23:08:09Z

Codecov Report

Attention: Patch coverage is 47.61905% with 33 lines in your changes missing coverage. Please review.

Project coverage is 60.98%. Comparing base (38883c7) to head (e91de3b).
Report is 137 commits behind head on master.

Files with missing lines	Patch %	Lines
...yteplugins/go/tasks/plugins/webapi/agent/client.go	50.00%	14 Missing and 3 partials ⚠️
...yteplugins/go/tasks/plugins/webapi/agent/plugin.go	44.82%	16 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5017      +/-   ##
==========================================
- Coverage   60.99%   60.98%   -0.02%     
==========================================
  Files         793      793              
  Lines       51295    51313      +18     
==========================================
+ Hits        31289    31294       +5     
- Misses      17125    17134       +9     
- Partials     2881     2885       +4

Flag	Coverage Δ
unittests-datacatalog	`69.31% <ø> (ø)`
unittests-flyteadmin	`58.66% <ø> (+0.02%)`	⬆️
unittests-flytecopilot	`17.79% <ø> (ø)`
unittests-flytectl	`67.97% <ø> (ø)`
unittests-flyteidl	`79.04% <ø> (ø)`
unittests-flyteplugins	`61.84% <47.61%> (-0.10%)`	⬇️
unittests-flytepropeller	`57.32% <ø> (ø)`
unittests-flytestdlib	`65.82% <ø> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Kevin Su <[email protected]>

Future-Outlier · 2024-03-07T04:37:28Z

Wow, the watcher is amazing, will do more test about it.

Signed-off-by: Kevin Su <[email protected]>

Future-Outlier

I've tested this in this scenario.
It shows me that it works well.

start the propeller (Agent server is closed)
trigger a sensor task to print my current agentRegistry
start an agent server
trigger a sensor task to print my current agentRegistry

Print agentRegistry before the agent server started

before the agent server started

after the agent server started

flyteplugins/go/tasks/plugins/webapi/agent/client.go

Future-Outlier

I've done some tests on it.
I think even we can use the watcher to update support_task_type, it can't help use to update propeller's metadata.
Try this case.

  agent-service:
  #   supportedTaskTypes:
  #     - sensor
    defaultAgent:
      endpoint: "dns:///localhost:8000"
      insecure: true
      timeouts:
        CreateTask: 100s
        GetTask: 100s
      defaultTimeout: 100s

Even you start the agent server, it will still fail to execute agent task.

So here's a follow up question, is it possible to update the supportedTaskType for propeller?
I've traced the code and found that it has conflict with current propeller's load plugin mechanism.

Future-Outlier · 2024-03-07T15:52:27Z

cc @honnix , this is the current watcher implementation.
To save your time, let me give you a summary of how this watcher works.

It will not panic even the agent server is not started.
It will update AgentRegistry every 10 seconds, so it is available that you can start the agent server after propeller started
you sill needs to write supportedTaskType in your configmap, since propeller will not wait your agent server give you all supportedTaskType

Does it look well to you?
Do you have any thoughts to enable the propeller to get supportedTaskType after agent servers start?
Thank you for your patience ~ ! : )

Signed-off-by: Kevin Su <[email protected]>

honnix · 2024-03-08T17:00:08Z

@Future-Outlier Thank you for summarizing. This approach looks good to me.

However I noticed a breaking change (not introduced by this PR). For create operation, it used to be synchronous when resources is returned directly; but now all configured (in config file) agents is by default async, and that seems to break the contract because we can no longer have sync creation unless implementing the metadata endpoint. Is this correct understanding?

Signed-off-by: Kevin Su <[email protected]>

Future-Outlier · 2024-03-09T07:53:53Z

@Future-Outlier Thank you for summarizing. This approach looks good to me.

However I noticed a breaking change (not introduced by this PR). For create operation, it used to be synchronous when resources is returned directly; but now all configured (in config file) agents is by default async, and that seems to break the contract because we can no longer have sync creation unless implementing the metadata endpoint. Is this correct understanding?

@honnix
Yes, you are correct, sorry for this breaking change.
Do you think it is not ok to have this breaking change?
Maybe I can discuss with Kevin about integrating the breaking change to agent/plugin.go.
However, the code will be a little bit confusing.
Are you okay with the impact of this breaking change?

honnix · 2024-03-09T09:30:48Z

@Future-Outlier Thank you for summarizing. This approach looks good to me.
However I noticed a breaking change (not introduced by this PR). For create operation, it used to be synchronous when resources is returned directly; but now all configured (in config file) agents is by default async, and that seems to break the contract because we can no longer have sync creation unless implementing the metadata endpoint. Is this correct understanding?

@honnix Yes, you are correct, sorry for this breaking change. Do you think it is not ok to have this breaking change? Maybe I can discuss with Kevin about integrating the breaking change to agent/plugin.go. However, the code will be a little bit confusing. Are you okay with the impact of this breaking change?

@Future-Outlier Thank you for the confirmation. No it is not a problem for us, and I think in general it looks like a good change (being explicit is good). As long as it is noted in the release notes, I think it should be fine.

Future-Outlier · 2024-03-09T12:57:40Z

@Future-Outlier Thank you for summarizing. This approach looks good to me.
However I noticed a breaking change (not introduced by this PR). For create operation, it used to be synchronous when resources is returned directly; but now all configured (in config file) agents is by default async, and that seems to break the contract because we can no longer have sync creation unless implementing the metadata endpoint. Is this correct understanding?

@honnix Yes, you are correct, sorry for this breaking change. Do you think it is not ok to have this breaking change? Maybe I can discuss with Kevin about integrating the breaking change to agent/plugin.go. However, the code will be a little bit confusing. Are you okay with the impact of this breaking change?

@Future-Outlier Thank you for the confirmation. No it is not a problem for us, and I think in general it looks like a good change (being explicit is good). As long as it is noted in the release notes, I think it should be fine.

@honnix , thank you, we will try to add more information in the release notes.
Also, there's a change about readiness probe for the agent pod, which means that in the upcoming release,
agent pod needs to have health check servicer.
(It can be turned off by the configmap)
flyte config map: #4990
flytekit agent server implementation: flyteorg/flytekit#2232

If there's anything we can help, please leave comments, thank you!

honnix · 2024-03-11T11:50:02Z

@Future-Outlier Thank you for the heads-up. If I understood correctly, this only matters if we deploy the agent service as a pod.

Future-Outlier · 2024-03-11T12:36:25Z

@Future-Outlier Thank you for the heads-up. If I understood correctly, this only matters if we deploy the agent service as a pod.

Yes, you are right!

pingsutw · 2024-03-11T17:37:28Z

you sill needs to write supportedTaskType in your configmap, since propeller will not wait your agent server give you all supportedTaskType

I'm going to update this PR, so we're also able to reload support task type at runtime.

honnix · 2024-04-12T07:05:02Z

Any plan to get this going? Thanks.

Future-Outlier · 2024-04-15T14:57:42Z

Any plan to get this going? Thanks.

Hi, sorry for the late reply, I will discuss this with Kevin and update our discussion this week, thank you for your patience!

Future-Outlier · 2024-04-23T04:10:52Z

Any plan to get this going? Thanks.

Hi, @honnix Kevin is busy with other issues now.
He says he can list some details for me, and I can try implementing this.
However, I am busy with school homework now, and I can come back after 5/12.
Will it be too late for you?

honnix · 2024-04-23T07:05:34Z

Any plan to get this going? Thanks.

Hi, @honnix Kevin is busy with other issues now. He says he can list some details for me, and I can try implementing this. However, I am busy with school homework now, and I can come back after 5/12. Will it be too late for you?

Hi @Future-Outlier . Thank you for the update. This is not urgent so please take your time.

Future-Outlier · 2024-05-19T03:34:44Z

Any plan to get this going? Thanks.

Hi, @honnix Kevin is busy with other issues now. He says he can list some details for me, and I can try implementing this. However, I am busy with school homework now, and I can come back after 5/12. Will it be too late for you?

Hi @Future-Outlier . Thank you for the update. This is not urgent so please take your time.

Doing it, will make some variables from private to public so that we can dynamically update support task types,
thank you for waiting, will ask for your review after finish it.

honnix · 2024-06-07T15:45:21Z

To give a real world incident: we just had a flytepropeller startup trouble due to one agent service being unavailable. This feature will be highly appreciated.

Future-Outlier · 2024-06-08T00:50:34Z

To give a real world incident: we just had a flytepropeller startup trouble due to one agent service being unavailable. This feature will be highly appreciated.

No problem, fighting on it with Kevin.

Signed-off-by: Kevin Su <[email protected]>

Future-Outlier

This PR is not easy, thank you for the effort!

Signed-off-by: Kevin Su <[email protected]> Signed-off-by: Kevin Su <[email protected]>

pingsutw added 3 commits March 5, 2024 17:51

wip

0b1955c

Signed-off-by: Kevin Su <[email protected]>

Watch agent service

0cffe52

Signed-off-by: Kevin Su <[email protected]>

lint

ac2a89f

Signed-off-by: Kevin Su <[email protected]>

dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. enhancement New feature or request labels Mar 6, 2024

nit

b88ecb9

Signed-off-by: Kevin Su <[email protected]>

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Mar 6, 2024

Fix test

d29998e

Signed-off-by: Kevin Su <[email protected]>

lint

36e6f87

Signed-off-by: Kevin Su <[email protected]>

Future-Outlier reviewed Mar 7, 2024

View reviewed changes

flyteplugins/go/tasks/plugins/webapi/agent/client.go Outdated Show resolved Hide resolved

Future-Outlier reviewed Mar 7, 2024

View reviewed changes

pingsutw added 4 commits March 7, 2024 11:34

nit

063846f

Signed-off-by: Kevin Su <[email protected]>

updateAgentClientSets instead

7187ade

Signed-off-by: Kevin Su <[email protected]>

lock

3dca1a7

Signed-off-by: Kevin Su <[email protected]>

nit

60377d9

Signed-off-by: Kevin Su <[email protected]>

defer

32a907e

Signed-off-by: Kevin Su <[email protected]>

pingsutw marked this pull request as draft March 11, 2024 10:47

pingsutw added 4 commits June 8, 2024 00:02

resolve conflict

150fc33

Signed-off-by: Kevin Su <[email protected]>

lint

24feb98

Signed-off-by: Kevin Su <[email protected]>

Add getter and setter

d83e51f

Signed-off-by: Kevin Su <[email protected]>

refactor(webapi): Improve agent client handling and logging

86c0ccc

Signed-off-by: Kevin Su <[email protected]>

pingsutw marked this pull request as ready for review June 8, 2024 09:23

pingsutw added 5 commits June 8, 2024 02:26

errorf

4e9e7fb

Signed-off-by: Kevin Su <[email protected]>

update

8e3e805

Signed-off-by: Kevin Su <[email protected]>

remove logger

a1d2a2c

Signed-off-by: Kevin Su <[email protected]>

nit

ac06415

Signed-off-by: Kevin Su <[email protected]>

nit

6fb31ab

Signed-off-by: Kevin Su <[email protected]>

Future-Outlier self-requested a review June 8, 2024 09:54

pingsutw added 2 commits June 8, 2024 02:57

lint

26731cd

Signed-off-by: Kevin Su <[email protected]>

lint

e91de3b

Signed-off-by: Kevin Su <[email protected]>

Future-Outlier approved these changes Jun 8, 2024

View reviewed changes

pingsutw merged commit cd37d1b into master Jun 8, 2024
48 of 50 checks passed

pingsutw deleted the watch-agent branch June 8, 2024 10:19

robert-ulbrich-mercedes-benz pushed a commit to robert-ulbrich-mercedes-benz/flyte that referenced this pull request Jul 2, 2024

Watch agent metadata service (flyteorg#5017)

746a3a4

Signed-off-by: Kevin Su <[email protected]> Signed-off-by: Kevin Su <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Watch agent metadata service #5017

Watch agent metadata service #5017

pingsutw commented Mar 6, 2024 •

edited

Loading

codecov bot commented Mar 6, 2024 •

edited

Loading

Future-Outlier commented Mar 7, 2024

Future-Outlier left a comment

Future-Outlier left a comment

Future-Outlier commented Mar 7, 2024

honnix commented Mar 8, 2024

Future-Outlier commented Mar 9, 2024

honnix commented Mar 9, 2024

Future-Outlier commented Mar 9, 2024

honnix commented Mar 11, 2024

Future-Outlier commented Mar 11, 2024

pingsutw commented Mar 11, 2024

honnix commented Apr 12, 2024

Future-Outlier commented Apr 15, 2024

Future-Outlier commented Apr 23, 2024

honnix commented Apr 23, 2024

Future-Outlier commented May 19, 2024 •

edited

Loading

honnix commented Jun 7, 2024

Future-Outlier commented Jun 8, 2024

Future-Outlier left a comment

Watch agent metadata service #5017

Watch agent metadata service #5017

Conversation

pingsutw commented Mar 6, 2024 • edited Loading

Tracking issue

Why are the changes needed?

What changes were proposed in this pull request?

How was this patch tested?

Setup process

Screenshots

Check all the applicable boxes

Related PRs

Docs link

codecov bot commented Mar 6, 2024 • edited Loading

Codecov Report

Future-Outlier commented Mar 7, 2024

Future-Outlier left a comment

Choose a reason for hiding this comment

Future-Outlier left a comment

Choose a reason for hiding this comment

Future-Outlier commented Mar 7, 2024

honnix commented Mar 8, 2024

Future-Outlier commented Mar 9, 2024

honnix commented Mar 9, 2024

Future-Outlier commented Mar 9, 2024

honnix commented Mar 11, 2024

Future-Outlier commented Mar 11, 2024

pingsutw commented Mar 11, 2024

honnix commented Apr 12, 2024

Future-Outlier commented Apr 15, 2024

Future-Outlier commented Apr 23, 2024

honnix commented Apr 23, 2024

Future-Outlier commented May 19, 2024 • edited Loading

honnix commented Jun 7, 2024

Future-Outlier commented Jun 8, 2024

Future-Outlier left a comment

Choose a reason for hiding this comment

pingsutw commented Mar 6, 2024 •

edited

Loading

codecov bot commented Mar 6, 2024 •

edited

Loading

Future-Outlier commented May 19, 2024 •

edited

Loading