Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: move to federated cluster #56

Merged
merged 29 commits into from
Aug 5, 2024
Merged
Changes from 1 commit
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
8888527
feat: update to new IP of federated cluster
IgnacioHeredia Nov 23, 2023
712b7cb
feat: endpoints are now node-dependent
IgnacioHeredia Nov 23, 2023
61bb92e
feat: deploy only on nodes serving that namespace
IgnacioHeredia Nov 23, 2023
18e37af
feat: avoid deploying on same nodes as system jobs
IgnacioHeredia Nov 23, 2023
d1f56e1
feat: properly configure disk limits
IgnacioHeredia Nov 24, 2023
a7599ce
fix: disable endpoints if client disconnected
IgnacioHeredia Nov 24, 2023
8be583a
docs: improve comment
IgnacioHeredia Nov 24, 2023
08b3162
fix: set max RAM memory
IgnacioHeredia Nov 24, 2023
a7d9e8f
Merge branch 'master' into federated
IgnacioHeredia Feb 24, 2024
16b4183
feat: add job type (module/tool) to metadata (#43)
MartaOB Apr 5, 2024
2534b06
Merge branch 'master' into federated
IgnacioHeredia Apr 11, 2024
67bcdca
Merge branch 'master' into federated
IgnacioHeredia Apr 22, 2024
dae82d2
fix: available GPU models should be filtered by VO
IgnacioHeredia Apr 24, 2024
4cc053d
feat!: update naming in jobs and tasks
IgnacioHeredia Apr 29, 2024
13c1ddd
refactor: remove dangling `job_type`
IgnacioHeredia Apr 29, 2024
97742ec
Merge branch 'master' into federated
IgnacioHeredia May 15, 2024
08c8611
Merge branch 'master' into federated
IgnacioHeredia May 15, 2024
a676ca6
refactor: aggregate cluster gpu models at the end
IgnacioHeredia May 24, 2024
2fa1372
fix: fix job_num in stats
IgnacioHeredia May 24, 2024
55961d6
feat: filter node stats by VO
IgnacioHeredia May 24, 2024
a0f6a64
fix(stats): avoid overwriting global stats var
IgnacioHeredia Jun 19, 2024
bf26c68
Merge branch 'master' into federated
IgnacioHeredia Jun 20, 2024
654d65a
feat: add anti affinity for `ai4eosc` nodes
IgnacioHeredia Jun 20, 2024
eafa4fd
feat: increase RAM limit
IgnacioHeredia Jun 20, 2024
b0f3bb5
Merge branch 'master' into federated
IgnacioHeredia Jul 8, 2024
bb8ff7e
feat: enforce `node.meta.status=ready`
IgnacioHeredia Jul 9, 2024
0709f3b
feat(stats): add node status to stats
IgnacioHeredia Jul 10, 2024
c15cb2b
Merge branch 'master' into federated
IgnacioHeredia Jul 16, 2024
a95b957
fix(stats): do not aggregate node status
IgnacioHeredia Jul 17, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
fix: available GPU models should be filtered by VO
IgnacioHeredia committed Apr 24, 2024
commit dae82d2c29b5c740a9842f217f8e62fb53c6a5d3
11 changes: 9 additions & 2 deletions ai4papi/nomad/common.py
Original file line number Diff line number Diff line change
@@ -17,6 +17,7 @@
from nomad.api import exceptions
import requests

import ai4papi.conf as papiconf
import ai4papi.nomad.patches as nomad_patches


@@ -367,13 +368,19 @@ def delete_deployment(
return {'status': 'success'}


def get_gpu_models():
@cached(cache=TTLCache(maxsize=1024, ttl=1*60*60))
def get_gpu_models(vo):
"""
Retrieve available GPU models in the cluster.
Retrieve available GPU models in the cluster, filtering nodes by VO.
"""
gpu_models = set()
nodes = Nomad.nodes.get_nodes(resources=True)
for node in nodes:
# Discard nodes that don't belong to the requested VO
meta = Nomad.node.get_node(node['ID'])['Meta']
if papiconf.MAIN_CONF['nomad']['namespaces'][vo] not in meta['namespace']:
continue

# Discard GPU models of nodes that are not eligible
if node['SchedulingEligibility'] != 'eligible':
continue
2 changes: 1 addition & 1 deletion ai4papi/routers/v1/catalog/modules.py
Original file line number Diff line number Diff line change
@@ -73,7 +73,7 @@ def get_config(
)

# Fill with available GPU models in the cluster
models = nomad.common.get_gpu_models()
models = nomad.common.get_gpu_models(vo)
if models:
conf["hardware"]["gpu_type"]["options"] += models