Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update PkgCI test_amd to use MI300x conductor cluster #19517

Merged
merged 21 commits into from
Jan 13, 2025

Conversation

yamiyysu
Copy link
Contributor

@yamiyysu yamiyysu commented Dec 18, 2024

We want to migrate the workflows use MI300 and do not require cache support to migrate to our conductor cluster. A new runner with one GPU has been created

This PR is to update the run label.

@yamiyysu yamiyysu requested a review from ScottTodd as a code owner December 18, 2024 18:17
@saienduri saienduri self-requested a review December 18, 2024 18:19
@saienduri saienduri changed the title Update one workflow to use conductor runner Update one workflow to use MI300x conductor cluster Dec 18, 2024
@saienduri saienduri changed the title Update one workflow to use MI300x conductor cluster Update PkgCI test_amd to use MI300x conductor cluster Dec 18, 2024
@ScottTodd ScottTodd self-requested a review January 9, 2025 15:33
Copy link
Member

@ScottTodd ScottTodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Diff'd the logs against a baseline

Branch Logs
this PR https://github.com/iree-org/iree/actions/runs/12682084250/job/35347142938?pr=19517
main https://github.com/iree-org/iree/actions/runs/12681354456/job/35345175460

Same number of tests running and passing, similar time taken. LGTM!

.github/workflows/pkgci_test_amd_mi300.yml Outdated Show resolved Hide resolved
@ScottTodd ScottTodd added infrastructure Relating to build systems, CI, or testing hal/hip Runtime HIP HAL backend labels Jan 9, 2025
@saienduri saienduri self-requested a review January 12, 2025 04:15
Copy link
Collaborator

@saienduri saienduri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM just DCO and then good to merge

@yamiyysu yamiyysu force-pushed the users/yamiyysu/mi300-runner-label branch from 2a49e0c to 3ea1bdf Compare January 12, 2025 22:53
@yamiyysu yamiyysu merged commit 88d5f59 into main Jan 13, 2025
39 checks passed
@yamiyysu yamiyysu deleted the users/yamiyysu/mi300-runner-label branch January 13, 2025 16:32
runs-on: linux-mi300-gpu-1
container:
image: rocm/dev-ubuntu-22.04:6.3
options: --user root --device=/dev/kfd --device=/dev/dri --ipc=host --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This had a failure after merge:

Error response from daemon: error gathering device information while adding custom device "/dev/kfd": no such file or directory
  Error: failed to start containers: 4bec79a3d99b9a5f5195ceb3ae3bf4ebb508b078b46d1de1d9b0366f0be182[41](https://github.com/iree-org/iree/actions/runs/12751728442/job/35540191551#step:2:44)

https://github.com/iree-org/iree/actions/runs/12751728442/job/35540191551

We can retry the job if you think that's a flake. Otherwise, want to revert for now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A retry succeeded 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hal/hip Runtime HIP HAL backend infrastructure Relating to build systems, CI, or testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants