Real-time Inference dynamic loading/unloading of models #211

johnml1135 · 2023-11-06T13:19:52Z

Assuming that we use clearml-serving real time inferencing, we may need to spin our own dynamic loading/unloading algorithm the reason is that the core Triton inference server from NVIDIA does not do it without buying the enterprise plan.

Dynamic model loading/unloading depending on requests? triton-inference-server/server#5345
automatic model load / unload or a lockable store extension triton-inference-server/server#3583
https://www.nvidia.com/en-us/ai-data-science/products/triton-management-service/
I made a request to ClearML to implement this in their slack channel, but am assuming it will not be done.

If we were to do this ourselves, we would need to make the explicit calls to the management API and implement a simple algorithm such as:

Assume:
- All models are of the same size when loaded
- The max number of instances of an individual model is 1
Config:
- Number of seconds to assess usage over (rule of thumb -> 5x model loading time?)
- Auto-unload model if not being used for x minutes (default 5?)
- Number of models that need to be unloaded before x minutes required to adding new auto-scaled instance (default 5?)
Algorithm:
- Load in the model with the largest number of elements in it's queue - and only pull in one at a time
- If not enough space, unload the model with the oldest "last inference" time if it is over n (60?) seconds ago
- Else, unload the model that has an empty queue and also has the least number of incoming requests over the past n (60?) seconds
- If the frequency of unloading models is greater than the threshold, add another auto-scaled instance
- If the loaded models can fit on fewer instances than are currently scaled, gracefully consolidate

ddaspit · 2023-11-06T18:12:51Z

We have something like this implemented for SMT models.

johnml1135 · 2023-11-06T18:14:50Z

ClearML may have it already implemented:

We support that, but this is Not dynamically loaded, this is just removing and adding models, this does not unload them from the GRAM.
That's the main issue. when we unload the model, it is unloaded, to do dynamic, they need to be able to save it in RAM and unload it from GRAM, that's the feature that is missing on all Triton deployments.
Does that make sense ?

johnml1135 · 2023-11-06T18:39:14Z

Just to confirm, yes, ClearML can do automatic loading/unloading, but each load/unload will take time: there is also deserializing CPU time (imaging unpickling 20GB file, this takes time... and actually this is the main bottle neck not just IO).

robosina · 2024-01-15T18:10:54Z

@johnml1135 @ddaspit, Hello, I have a question about this matter, will this loading/unloading happen automatically, or do we need to do something to enable it? Also, if you could provide me with some links/documentation, I would appreciate it.

johnml1135 · 2024-01-16T14:34:14Z

@robosina - this is currently a wish list item and a conceptual design, it has not been implemented into Serval. The core technology that would perform the loading/unloading would be https://github.com/allegroai/clearml-serving, which is a layer on top of https://www.nvidia.com/en-us/ai-data-science/products/triton-management-service/. I would review those products for dynamic loading/unloading.

johnml1135 added this to Serval Nov 6, 2023

github-project-automation bot moved this to 🆕 New in Serval Nov 6, 2023

johnml1135 added this to the 1.4 NMT Dynamic Suggestions milestone Nov 6, 2023

johnml1135 added the enhancement New feature or request label Nov 6, 2023

johnml1135 removed this from the Serval API 1.3 milestone Dec 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Real-time Inference dynamic loading/unloading of models #211

Real-time Inference dynamic loading/unloading of models #211

johnml1135 commented Nov 6, 2023

ddaspit commented Nov 6, 2023

johnml1135 commented Nov 6, 2023

johnml1135 commented Nov 6, 2023

robosina commented Jan 15, 2024 •

edited

Loading

johnml1135 commented Jan 16, 2024

Real-time Inference dynamic loading/unloading of models #211

Real-time Inference dynamic loading/unloading of models #211

Comments

johnml1135 commented Nov 6, 2023

ddaspit commented Nov 6, 2023

johnml1135 commented Nov 6, 2023

johnml1135 commented Nov 6, 2023

robosina commented Jan 15, 2024 • edited Loading

johnml1135 commented Jan 16, 2024

robosina commented Jan 15, 2024 •

edited

Loading