Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for OFED #254

Merged
merged 28 commits into from
Apr 24, 2024
Merged

Add support for OFED #254

merged 28 commits into from
Apr 24, 2024

Conversation

sjpb
Copy link
Collaborator

@sjpb sjpb commented Mar 8, 2023

Adds an ofed role, not enabled by default. This is intended only to be used inside of image build for speed. Note that package updates should be done first, and updates should not be run once OFED has been installed.

NB: CI uses the OFED image for the default RL9. If the optional RL8 CI is run this uses a non-OFED image (as this is what will be deployed for Arcus)

TODO before merge:

  • Add ofed to CI build - producing both ofed and non-ofed images probably
  • Fix image name extraction from manifest now there's two
  • Full GUI checks
  • Check its idempotent if run on an image with OFED already in it

Image size differences

$ qemu-img info openhpc-RL8-240404-1149-db810918
virtual size: 12 GiB (12884901888 bytes)
disk size: 5.84 GiB

$ qemu-img info openhpc-ofed-RL8-240404-1149-db810918
virtual size: 15 GiB (16106127360 bytes)
disk size: 7 GiB

@sjpb sjpb changed the title Add support for OFED WIP: Add support for OFED Mar 14, 2024
@sjpb
Copy link
Collaborator Author

sjpb commented Mar 20, 2024

Testing build here: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/8358239578

Edit: Cancelled, didn't run OFED build

@sjpb
Copy link
Collaborator Author

sjpb commented Mar 20, 2024

Testing build here: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/8358708640.

ofed build failed cause of extravars. other build server deletion timedout

@sjpb
Copy link
Collaborator Author

sjpb commented Mar 20, 2024

@sjpb
Copy link
Collaborator Author

sjpb commented Mar 20, 2024

@sjpb
Copy link
Collaborator Author

sjpb commented Mar 20, 2024

@sjpb
Copy link
Collaborator Author

sjpb commented Mar 21, 2024

@sjpb
Copy link
Collaborator Author

sjpb commented Apr 9, 2024

NB: images in this branch are built in e9fe323.

GUI tests at 5b64a7c

  • OOD shell: OK
  • OOD desktop: OK
  • OOD jupyter: OK
  • Monitoring: OK

@sjpb sjpb marked this pull request as ready for review April 9, 2024 14:34
@sjpb sjpb requested a review from a team as a code owner April 9, 2024 14:34
@sjpb
Copy link
Collaborator Author

sjpb commented Apr 9, 2024

Checked ofed role is idempotent by:

adding

environments/.stackhpc/inventory/extra_groups:
...
[ofed:children]
compute

running ansible-playbook ansible/bootstrap.yml

No changes reported.

@sjpb
Copy link
Collaborator Author

sjpb commented Apr 12, 2024

Test on arcus ilab-60 network using vm.ska.cpu.general.small. Instances were on different hypervisors:

$ openstack server show rl9-compute-0 -c host_id -f value
56d8c2aa7f7ac6bee454fde46d4072b5cd2876ed9ed4ffce170063a4
$ openstack server show rl9-compute-1 -c host_id -f value
2850a0dc4e262d0fb20382f5fb7ff14bdef9be4f4b2f4a94624c507b

Non-OFED: image openhpc-RL9-240327-1026-4812f852

1: 1.61 us 45.60496 Gbit/s
2: 1.66 us 45.59008 Gbit/s
3: 1.62 us 45.64624 Gbit/s
4: 1.62 us 45.59952 Gbit/s
5: 1.67 us 45.60176 Gbit/s

example plot:
image

OFED: image openhpc-ofed-RL9-240404-1503-e9fe3235

1: 1.62 us 45.59592 Gbit/s
2: 1.62 us 45.59344 Gbit/s
3: 1.62 us 45.63064 Gbit/s
4: 1.63 us 45.62312 Gbit/s
5: 1.65 us 45.63504 Gbit/s

latter has:

[rocky@rl9-compute-0 ~]$ ofed_info -s
MLNX_OFED_LINUX-24.01-0.3.3.1:

@sjpb
Copy link
Collaborator Author

sjpb commented Apr 23, 2024

Test CI cancelled, image build RL8/RL9: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/8798490505

@sjpb
Copy link
Collaborator Author

sjpb commented Apr 23, 2024

RL8 (non-ofed) and RL9 (ofed) CI PASSING: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/8800566275

@sjpb sjpb changed the title WIP: Add support for OFED Add support for OFED Apr 24, 2024
@sjpb sjpb merged commit 7c6f48d into main Apr 24, 2024
3 of 4 checks passed
@sjpb sjpb deleted the ofed branch April 24, 2024 10:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants