Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite training component using kubeflow-training library #231

Merged
merged 1 commit into from
Jan 24, 2025

Conversation

Shreyanand
Copy link
Member

@Shreyanand Shreyanand commented Dec 4, 2024

This PR changes the pytorch_manifest_op to pytorch_job_launcher_op with the following changes:

  1. Replace the manifest yaml args, env vars, volumes, volume mounts to kubeflow-training sdk functions
  2. Add master pod train logs to the launcher component so that it is visible in the RHOAI pipeline dashboard
  3. Add a flag to delete the pytorch job after completion

To do

  • The image used in the component needs to be changed to the Python image being created by the RHOAI team.
  • Align with RHEL image 1.3
  • Change formatting of env vars, volumes, volume mounts

@Shreyanand Shreyanand marked this pull request as ready for review December 5, 2024 15:59
@Shreyanand Shreyanand changed the title WIP: Rewrite training component using kubeflow-training library Rewrite training component using kubeflow-training library Dec 5, 2024
@Shreyanand Shreyanand requested review from MichaelClifford, tumido and leseb and removed request for tumido December 5, 2024 15:59
training/components.py Outdated Show resolved Hide resolved
training/components.py Outdated Show resolved Hide resolved
training/components.py Outdated Show resolved Hide resolved
training/components.py Outdated Show resolved Hide resolved
training/components.py Outdated Show resolved Hide resolved
training/components.py Show resolved Hide resolved
@MichaelClifford
Copy link
Collaborator

@Shreyanand be sure to run ruff format <file> to overcome the pre-commit-check errors

training/components.py Outdated Show resolved Hide resolved
training/components.py Outdated Show resolved Hide resolved
training/components.py Outdated Show resolved Hide resolved
training/components.py Show resolved Hide resolved
training/components.py Outdated Show resolved Hide resolved
training/components.py Show resolved Hide resolved
training/components.py Outdated Show resolved Hide resolved
@Shreyanand
Copy link
Member Author

@Shreyanand be sure to run ruff format <file> to overcome the pre-commit-check errors

@MichaelClifford I did run it and it doesn't change any file but somehow the pre-commit test fails 🤔
Screenshot 2024-12-10 at 3 11 29 PM

@leseb
Copy link
Collaborator

leseb commented Dec 11, 2024

@Shreyanand be sure to run ruff format <file> to overcome the pre-commit-check errors

@MichaelClifford I did run it and it doesn't change any file but somehow the pre-commit test fails 🤔 Screenshot 2024-12-10 at 3 11 29 PM

It's the imports that are failing so you need to run: ruff check --select I --fix ., thanks!

tumido
tumido previously requested changes Dec 11, 2024
training/components.py Outdated Show resolved Hide resolved
training/components.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@leseb leseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s more of a stylistic preference, but since the current function is quite bloated, breaking it into smaller sub-functions to handle the creation of master and worker containers could enhance readability. Feel free to implement or not.

training/components.py Outdated Show resolved Hide resolved
training/components.py Outdated Show resolved Hide resolved
training/components.py Outdated Show resolved Hide resolved
training/components.py Outdated Show resolved Hide resolved
@tumido tumido dismissed their stale review December 16, 2024 17:18

Dismissing stale change requests... Seems like adding a new review with "comment" action is not enough here. 🤷

@Shreyanand
Copy link
Member Author

/hold till the official RHOAI python image is released.

Copy link
Collaborator

@leseb leseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job!

Copy link
Member

@tumido tumido left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nit. Can we get this one formatted better:
https://github.com/opendatahub-io/ilab-on-ocp/pull/231/files#r1887046992

Otherwise LGTM 🙂

@Shreyanand Shreyanand force-pushed the umain branch 2 times, most recently from 503492a to 67532f0 Compare January 20, 2025 22:53
Copy link
Member

@tumido tumido left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Update launcher component with correct params

Remove namespace argument

Fix formatting

Add PR suggestions

Add changes for RHEL AI image 1.3

Add changes for RHEL AI image 1.3

Add ruff changes

Fix pipeline errors

Add RHEL 1.3.1 image

Add ruff changes

Add uv make pipeline.yaml

Change launcher image and get logs func

Add make pipeline changes

Add latest RHOAI python image

Fix rebase errors

Change formatting errors and add right image shas

Rebase over latest changes

Signed-off-by: Shreyanand <[email protected]>
@Shreyanand
Copy link
Member Author

@tumido @HumairAK @MichaelClifford
I rebased this over the latest image fixes in #255 and a completed test run is here: https://rhods-dashboard-redhat-ods-applications.apps.ocp-beta-test.nerc.mghpcc.org/experiments/ilab/5d5a1755-0df4-4a08-9a46-85b1582be2f5/runs/e8c6ab40-1944-4b64-9b21-7be6a00b910f

@HumairAK HumairAK merged commit 555d222 into opendatahub-io:main Jan 24, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants