ModelMesh is a mature, general-purpose model serving management/routing layer designed for high-scale, high-density and frequently-changing model use cases. It works with existing or custom-built model servers and acts as a distributed LRU cache for serving runtime models.
For full Kubernetes-based deployment and management of ModelMesh clusters and models, see the ModelMesh Serving repo. This includes a separate controller and provides K8s custom resource based management of ServingRuntimes and InferenceServices along with common, abstracted handling of model repository storage and ready-to-use integrations with some existing OSS model servers.
For more information on supported features and design details, see these charts.
In ModelMesh, a model refers to an abstraction of machine learning models. It is not aware of the underlying model format. There are two model types: model (regular) and vmodel. Regular models in ModelMesh are assumed and required to be immutable. VModels add a layer of indirection in front of the immutable models. See VModels Reference for further reading.
- Wrap your model-loading and invocation logic in this model-runtime.proto gRPC service interface.
runtimeStatus()
- called only during startup to obtain some basic configuration parameters from the runtime, such as version, capacity, model-loading timeout.loadModel()
- load the specified model into memory from backing storage, returning when complete.modelSize()
- determine size (memory usage) of previously-loaded model. If very fast, can be omitted and provided instead in the response fromloadModel
.unloadModel()
- unload previously loaded model, returning when complete.- Use a separate, arbitrary gRPC service interface for model inferencing requests. It can have any number of methods and they are assumed to be idempotent. See predictor.proto for a very simple example.
- The methods of your custom applier interface will be called only for already fully-loaded models.
- Build a grpc server docker container which exposes these interfaces on localhost port 8085 or via a mounted unix domain socket.
- Extend the Kustomize-based Kubernetes manifests to use your docker image, and with appropriate memory and CPU resource allocations for your container.
- Deploy to a Kubernetes cluster as a regular Service, which will expose this grpc service interface via kube-dns (you do not implement this yourself), consume using grpc client of your choice from your upstream service components.
registerModel()
andunregisterModel()
for registering/removing models managed by the cluster- Any custom inferencing interface methods to make a runtime invocation of previously-registered model, making sure to set a
mm-model-id
ormm-vmodel-id
metadata header (or-bin
suffix equivalents for UTF-8 ids)
Please see the Developer Guide for details.