Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add call for machines #360

Merged
merged 1 commit into from
Feb 12, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 78 additions & 0 deletions docs/sponsor/call-for-machines.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# Call for Machines

## Expected Usage

- Periodic report: run training and validation workflow once per week and each run can spend upto 1 hour depends on the machine processing speed.
The frequency and period of running could be changed with notice and will be applied only with approval.

- Release execution: run training, validation, and export workflow on each Kepler release.

## Type of Machines

As mentioned in several research works, the power consumption behavior can be varied by several factors.
Using the right power model to predict power consumption on the machine that has no power meter is critical the precision of the reported values.

We are seeking machines with the following characteristics, commonly used in cloud computing environments but not limited to:

### Processor
- Intel: Cascade Lake, Skylake (or newer)
- Ampere: Altra, AmpereOne (or newer)
- AMD: Opteron, EPYC (or newer)

### Accelerator Chips

- NVML-supported GPU

|Architecture|Microarchitecture|Supported GPUs|
|---|---|---|
|Kepler|GK104, GK106, GK107|GeForce GTX 600 series (e.g., GTX 660, GTX 670), GTX 700 series|
|Maxwell|GM204, GM206|GeForce GTX 900 series (e.g., GTX 970, GTX 980)|
|Pascal|GP100, GP102|GeForce GTX 10 series (e.g., GTX 1080, GTX 1070), Titan X|
|Volta|GV100|Tesla V100, Titan V|
|Turing|TU102, TU104|GeForce RTX 20 series (e.g., RTX 2080 Ti, RTX 2070)|
|Ampere|GA100|A100, A40, RTX 30 series (e.g., RTX 3090)|
|Ada Lovelace|AD102|RTX 40 series (e.g., RTX 4090)|
|Hopper|GH100|H100|


- DCGM-supported GPU

|Architecture|Supported GPUs|
|---|---|
|Kepler|Tesla K80 and newer|
|Maxwell|Tesla M10, Tesla M60|
|Pascal|Tesla P100, Tesla P40, Tesla P4|
|Volta|Tesla V100|
|Turing|T4, Quadro RTX 4000, RTX A4000|
|Ampere|A100, A40, A30, A10, A2, RTX A6000, RTX A5000|
|Hopper|H100, H200|

- HLML-supported Intel Gaudi

> Please note that power source modules for other accelerator cards, such as AMD GPUs, Google TPUs, and IBM AIUs, are not yet supported by Kepler.
> Contributions to support these devices are always welcome within the Kepler community.

### Power meters
- rapl
- powerclamp
- amd-pstate
- acpi
- hmc
- redfish
- nvml (nvidia)
- hlml (habana)
- dcgm

### Available machines on CI

List of currently available bare metal machines for power model training and validation CI pipeline in [kepler-metal-ci](https://github.com/sustainable-computing-io/kepler-metal-ci/tree/main).

|processor|cores|chips|memory|power meters|source|sponsor|key contact
|---|---|---|----|---|---|---|---|
|intel_xeon_e_2278g| 16 | 1 | - | RAPL (package, core, dram)|equinix c3.small.x86|CNCF|@rootfs|
|intel_xeon_platinum_8259cl| 96 | 2 | 377| RAPL (package, dram)|ec2 m5.metal|RedHat|@rootfs|
|intel_xeon_e5_2686v4| 72 | 2 | 503| RAPL (package, dram)|ec2 i3.metal|RedHat|@rootfs|
|intel_xeon_platinum_8275cl| 96 | 2 | 188| RAPL (package, dram)|ec2 c5.metal|RedHat|@rootfs|
|intel_xeon_platinum_8259cl| 96 | 2 | 755| RAPL (package, dram)|ec2 r5.metal|RedHat|@rootfs|
|intel_xeon_platinum_8252c| 48 | 2 | 188 | RAPL (package, dram)|ec2 m5zn.metal|RedHat|@rootfs|
|intel_xeon_platinum_8488c| 96 | 1 | 377 | RAPL (package, dram)|ec2 m7i.metal-24xl|RedHat|@rootfs|