Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setup model registry serving reconciliation #124

Conversation

lampajr
Copy link
Contributor

@lampajr lampajr commented Nov 27, 2023

Fixes https://github.com/opendatahub-io/model-registry/issues/174
Fixes https://github.com/opendatahub-io/model-registry/issues/175

Description

Implement the Model Registry <--> Model Serving interaction described in issue opendatahub-io/model-registry#104.

Setup a new Reconciler that is triggered whenever a ServingRuntime is create/updated/deleted and in a periodic fashion (i.e., every N seconds where N could be specified as startup flag, default 60 seconds).

The summarized reconciliation workflow is the following:

  • For each <Namespace>/<ServingRuntime> do:
    • Fetch all Inference services stored in the model registry and associated to the current NS (modeled as ServingEnvironment) and having as Runtime the current ServingRuntime name.
    • For each model registry <InferenceService> do:
      • Check if an InferenceService CR already exists for the current IS (use a custom label to keep track of these associations)
      • Check the model registry IS state
      • Then:
        • if STATE==DEPLOYED and not existing in cluster --> Create new ISVC
        • if STATE==UNDEPLOYED and not existing in cluster --> Do nothing, skip as there is no intention to deploy and there is no deployment in the cluster for that IS
        • if STATE==DEPLOYED and existing in cluster --> Check if there is any delta, if so apply updated CR
        • if STATE==UNDEPLOYED and existing in cluster --> This is an intention to undeploy the model, therefore delete the ISVC

ODH model controller keep track of those operation in the model registry itself, specifically as ServeModel recods (one per deployment).
These records can have different states, for now I used the following:

  • RUNNING: deployed model (associated to the ISVC) is running in the namespace
  • FAILED: Something went wrong creating/updating/deleting the ISVC - more details can be found in the CR itself
  • UNKNOWN: Temporary and intermediate steps when the actual state of the operation is not known yet
  • COMPLETE: The ISVC has been correctly removed, the deployment is removed - completed succesfully and not running anymore
  • NEW: Not used for now

How Has This Been Tested?

Setup cluster

  1. Install ODH 2.4.0
  2. Install ODH component using the following DataScienceCluster CRD
kind: DataScienceCluster
apiVersion: datasciencecluster.opendatahub.io/v1
metadata:
  name: default
  labels:
    app.kubernetes.io/name: datasciencecluster
    app.kubernetes.io/instance: default
    app.kubernetes.io/part-of: opendatahub-operator
    app.kubernetes.io/managed-by: kustomize
    app.kubernetes.io/created-by: opendatahub-operator
spec:
  components:
    codeflare:
      managementState: Removed
    dashboard:
      managementState: Managed
    datasciencepipelines:
      managementState: Managed
    kserve:
      managementState: Managed
      devFlags:
        manifests:
          - contextDir: config
            sourcePath: overlays/odh
            uri: https://github.com/opendatahub-io/kserve/tarball/master
          - contextDir: config
            uri: https://github.com/lampajr/odh-model-controller/tarball/mr_serving_reconciler_test
    modelmeshserving:
      managementState: Managed
      devFlags:
        manifests:
          - contextDir: config
            sourcePath: overlays/odh
            uri: https://github.com/opendatahub-io/modelmesh-serving/tarball/main
          - contextDir: config
            uri: https://github.com/lampajr/odh-model-controller/tarball/mr_serving_reconciler_test
    ray:
      managementState: Removed
    workbenches:
      managementState: Managed
    trustyai:
      managementState: Removed

NOTE: mr_serving_reconciler_test contains manifest changes to make use of the correct odh-model-controller image: alampare/odh-model-controller

By default this new model-registry for model-serving reconciler/controller is disabled in odh-model-controller, in order to properly enable it you should just start the controller with --model-registry-enabled flag.
https://github.com/lampajr/odh-model-controller/tarball/mr_serving_reconciler_test already contains this change in the configuration.

  1. Install model registry operator
  2. Clone model-registry-operator repository: git clone https://github.com/opendatahub-io/model-registry-operator.git
  3. Install CRDs: make KUBECTL=oc install
  4. Deploy the operator: make KUBECTL=oc deploy

Setup DS project

  1. Create a Data Science project using the ODH dashboard, e.g., demo-model-registry-20231211
  2. Setup a data connection (you could install a local minio instance in the newly create DS project)
  3. Create a model server using the ODH dashboard
  4. Apply the sample model registry CR from the model-registry-operator folder: kustomize build config/samples | oc apply -f - (make sure oc is pointing to the data science project create in step 1.)
  5. Create a new model server using the ODH dashboard, in my case named model-server

Upload models

For the sake of simplicity let's upload some onnx files in the local minio instance (bucket name is models):

  • models/mnist/v12/mnist-12.onnx
  • models/mnist/v8/mnist-8.onnx

Fill up the model registry

NOTE: IDs could change depending on the order of operations and on the number of models you are going to register.

  1. Set the model registry proxy url, you might have to setup a custom manual route from port :8080 of the model registry proxy
MR_HOSTNAME="<endpoint>"
  1. Register an empty model, e.g, MNIST
curl --silent -X 'POST' \
  "$MR_HOSTNAME/api/model_registry/v1alpha1/registered_models" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "description": "MNIST model for recognizing handwritten digits",
  "name": "MNIST"
}'
  1. Add a new model version
curl --silent -X 'POST' \
  "$MR_HOSTNAME/api/model_registry/v1alpha1/model_versions" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "description": "MNIST model version n. 8",
  "name": "v8",
  "registeredModelID": "2"
}'
  1. Add the model artifact to the created model version
curl --silent -X 'POST' \
  "$MR_HOSTNAME/api/model_registry/v1alpha1/model_versions/3/artifacts" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "description": "Model artifact for MNIST v8",
  "uri": "s3://models/mnist/v8/mnist-8.onnx",
  "state": "UNKNOWN",
  "name": "v8-model",
  "modelFormatName": "onnx",
  "modelFormatVersion": "1",
  "storageKey": "aws-connection-models",
  "storagePath": "mnist/v8",
  "artifactType": "model-artifact"
}'
  1. [Optional] If for some reason you need to update the artifact you can run:
curl --silent -X 'PATCH' \
  "$MR_HOSTNAME/api/model_registry/v1alpha1/model_artifacts/1" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '  {
    "description": "Model artifact for MNIST v8",
    "modelFormatName": "onnx",
    "modelFormatVersion": "1",
    "state": "UNKNOWN",
    "storageKey": "aws-connection-models",
    "storagePath": "mnist/v8",
    "uri": "s3://models/mnist/v8/mnist-8.onnx"
  }
'

NOTE: modelFormatName, modelFormatVersion, storageKey and storagePath are all those information needed to create the InferenceService so they should be valid values. E.g., the storageKey must match the existing data connection name (which is aws-connection-{{connection name}}).

You can use the following curls to inspect model registry content:

# Get registered models
curl --silent -X 'GET' \
  "$MR_HOSTNAME/api/model_registry/v1alpha1/registered_models?pageSize=100&orderBy=ID&sortOrder=DESC&nextPageToken=" \
  -H 'accept: application/json' | jq '.items'

# Get model versions for a specific registered model
curl --silent -X 'GET' \
  "$MR_HOSTNAME/api/model_registry/v1alpha1/model_versions?pageSize=100&orderBy=ID&sortOrder=DESC&nextPageToken=" \
  -H 'accept: application/json' | jq '.items'

# Get model artifacts for a specific model version
curl --silent -X 'GET' \
  "$MR_HOSTNAME/api/model_registry/v1alpha1/model_versions/3/artifacts?pageSize=100&orderBy=ID&sortOrder=DESC&nextPageToken=" \
  -H 'accept: application/json' | jq '.items'

# Get all inference services stored in model registry
curl --silent -X 'GET' \
  "$MR_HOSTNAME/api/model_registry/v1alpha1/inference_services?pageSize=100&orderBy=ID&sortOrder=DESC&nextPageToken=" \
  -H 'accept: application/json' | jq '.items'

Test model controller workflow

As soon as you create a model server in the DS project, the odh-model-controller will create the ServingEnvironment in the model registry having the name == namespace (i.e., DS project), you can inspect its ID:

curl --silent -X 'GET' \
  "$MR_HOSTNAME/api/model_registry/v1alpha1/serving_environments?pageSize=100&orderBy=ID&sortOrder=DESC&nextPageToken=" \
  -H 'accept: application/json' | jq '.items'
  1. Set the intention to deploy a model, i.e., create the inference service in the model registry:
curl --silent -X 'POST' \
  "$MR_HOSTNAME/api/model_registry/v1alpha1/serving_environments/1/inference_services" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "name": "mnist-deployment",
  "runtime": "model-server",
  "registeredModelId": "2",
  "servingEnvironmentId": "1",
  "state": "DEPLOYED"
}'

The runtime must match the name of the model server create in the ODH platform for the current DS project.

In this specific example we are going to store in the model registry the intention to deploy a model into model-server Openvino server:

  • servingEnvironmentId identifies the namespace where the deployment should occur, in this case id 1 which identifies DS project demo-model-registry-20231211
  • registeredModelId identifies the registered model we would like to deploy
  • modelVersionId identifies the specific version to deploy, if omitted (as in this case) let's deploy the latest version for that registered model

NOTE: the IDs must match, therefore ensure you are providing the corrects IDs.

At this point, the odh-model-controller will periodically inspect all existing ServingRuntime CRs (identifying configured model servers) and for each of them it will retrieve all Inference Services stored in model registry (using ServingEnv and Runtime as search criteria), then for each IS in model registry will do:

  • if STATE==DEPLOYED and not existing in cluster --> Create new ISVC
  • if STATE==UNDEPLOYED and not existing in cluster --> Do nothing, skip as there is no intention to deploy and there is no deployment in the cluster for that IS
  • if STATE==DEPLOYED and existing in cluster --> Check if there is any delta, if so apply updated CR
  • if STATE==UNDEPLOYED and existing in cluster --> This is an intention to undeploy the model, therefore delete the ISVC
❯ curl --silent -X 'GET' \
  "$MR_HOSTNAME/api/model_registry/v1alpha1/inference_services?pageSize=100&orderBy=ID&sortOrder=DESC&nextPageToken=" \
  -H 'accept: application/json' | jq '.items'
[
  {
    "createTimeSinceEpoch": "1702298833638",
    "customProperties": {},
    "id": "4",
    "lastUpdateTimeSinceEpoch": "1702298833638",
    "name": "mnist-deployment",
    "registeredModelId": "2",
    "runtime": "model-server",
    "servingEnvironmentId": "1",
    "state": "DEPLOYED"
  }
]

mnist-v8-deployment

  1. Tag a new version

Let's try to tag a new version and see if the reconciliation will automatically update the CR, triggering the atuomated deployment of the latest version (the new one)

curl --silent -X 'POST' \
  "$MR_HOSTNAME/api/model_registry/v1alpha1/model_versions" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "description": "MNIST model version n. 12",
  "name": "v12",
  "registeredModelID": "2"
}'
curl --silent -X 'POST' \
  "$MR_HOSTNAME/api/model_registry/v1alpha1/model_versions/5/artifacts" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "description": "Model artifact for MNIST v12",
  "uri": "s3://models/mnist/v12/mnist-12.onnx",
  "state": "UNKNOWN",
  "name": "v12-model",
  "modelFormatName": "onnx",
  "modelFormatVersion": "1",
  "storageKey": "aws-connection-models",
  "storagePath": "mnist/v12",
  "artifactType": "model-artifact"
}'

The next reconciliation loop should find that there is a Delta since the latest version is different, therefore the desired CR does not match the existing one. Once th eCR is updated, the model server should automatically react to deploy the new model:

1.702304193389716e+09 INFO OpenVINO Adapter.Openvino Model Server Adapter Server Both bucket and default_bucket params were provided in S3 storage config, ignoring default_bucket {"bucket": "models", "default_bucket": "models"}
1.7023041933897424e+09 DEBUG OpenVINO Adapter.Openvino Model Server Adapter Server found existing client in cache {"type": "s3", "cacheKey": "s3|0xc5ece145e9ad40f"}
1.7023041933943875e+09 DEBUG OpenVINO Adapter.Openvino Model Server Adapter Server found objects to download {"type": "s3", "cacheKey": "s3|0xc5ece145e9ad40f", "path": "mnist/v12", "count": 1}
1.7023041933945765e+09 DEBUG OpenVINO Adapter.Openvino Model Server Adapter Server downloading object {"type": "s3", "cacheKey": "s3|0xc5ece145e9ad40f", "path": "mnist/v12/mnist-12.onnx", "filename": "/models/mnist-deployment__isvc-8755919844/v12/mnist-12.onnx"}
1.702304193396579e+09 INFO OpenVINO Adapter.Openvino Model Server Adapter Server Calculated disk size {"modelFullPath": "/models/mnist-deployment__isvc-8755919844/v12", "disk_size": 26143}
  1. Set the intention to undeploy the model by setting the state to UNDEPLOYED:
curl --silent -X 'PATCH' \
  "$MR_HOSTNAME/api/model_registry/v1alpha1/inference_services/4" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "runtime": "model-server",
  "state": "UNDEPLOYED"
}'

Form the odh-model-controller logs:

2023-12-11T14:21:31Z	INFO	controllers.ModelRegistry	ISVC existing and IS state is UNDEPLOYED, removing ISVC: mnist-deployment	{"ServingRuntime": "model-server", "namespace": "demo-model-registry-20231211", "InferenceService": "mnist-deployment"}
2023-12-11T14:21:32Z	DEBUG	controllers.ModelRegistry	Delta found	{"ServingRuntime": "model-server", "namespace": "demo-model-registry-20231211", "InferenceService": "mnist-deployment", "delete": "mnist-deployment"}
2023-12-11T14:21:32Z	INFO	controllers.ModelRegistry	Deleted ISVC: mnist-deployment	{"ServingRuntime": "model-server", "namespace": "demo-model-registry-20231211", "InferenceService": "mnist-deployment"}

As a result the model deployment should disappear from the ODH dashboard too.

mnist-v12-undeployment

  1. Deploy a specific version rather than the latest one, i.e., v8
curl --silent -X 'PATCH' \
  "$MR_HOSTNAME/api/model_registry/v1alpha1/inference_services/4" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "runtime": "model-server",
  "modelVersionId": "3",
  "state": "DEPLOYED"
}'

Here some logs from ovms-adapter:

1.7023048054436572e+09 DEBUG OpenVINO Adapter.Openvino Model Server Adapter Server creating new repository client {"type": "s3", "cacheKey": "s3|0xc5ece145e9ad40f"}
1.7023048054478123e+09 DEBUG OpenVINO Adapter.Openvino Model Server Adapter Server found objects to download {"type": "s3", "cacheKey": "s3|0xc5ece145e9ad40f", "path": "mnist/v8", "count": 1}
1.7023048054479847e+09 DEBUG OpenVINO Adapter.Openvino Model Server Adapter Server downloading object {"type": "s3", "cacheKey": "s3|0xc5ece145e9ad40f", "path": "mnist/v8/mnist-8.onnx", "filename": "/models/mnist-deployment__isvc-36ecdf0658/v8/mnist-8.onnx"}
  1. The state and operations performed by odh-model-controller are audited in the model registry as ServeModel records (associated to a specific model registry inference service) that you can inspect as follows:
curl --silent -X 'GET' \
  "$MR_HOSTNAME/api/model_registry/v1alpha1/inference_services/4/serves?pageSize=100&orderBy=ID&sortOrder=DESC&nextPageToken=" \
  -H 'accept: application/json' | jq '.items'

[
  {
    "createTimeSinceEpoch": "1702304792636",
    "customProperties": {},
    "id": "2",
    "lastKnownState": "RUNNING",
    "lastUpdateTimeSinceEpoch": "1702304852848",
    "modelVersionId": "3",
    "name": "mnist-deployment/68cf3949-b9d6-4510-b5e6-4994ffdcd873"
  },
  {
    "createTimeSinceEpoch": "1702298852648",
    "customProperties": {},
    "id": "1",
    "lastKnownState": "COMPLETE",
    "lastUpdateTimeSinceEpoch": "1702304492637",
    "modelVersionId": "3",
    "name": "mnist-deployment/86cdc3d7-86e3-4ef9-a7c0-bbd5df7ba460"
  }
]

NOTE step Upload models and Fill up the model registry could be automated using either notebooks or pipelines, please follow a more complete e2e demo at https://github.com/tarilabs/demo20231121#demo

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Copy link
Contributor

openshift-ci bot commented Nov 27, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link
Contributor

openshift-ci bot commented Nov 27, 2023

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: lampajr
Once this PR has been reviewed and has the lgtm label, please assign spolti for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@lampajr lampajr changed the title Lampajr 20231120 mr serving reconciler Setup model registry serving reconciliation Nov 27, 2023
@lampajr lampajr force-pushed the lampajr_20231120_mr_serving_reconciler branch from 8e5397e to f0e4560 Compare November 27, 2023 12:55
@lampajr lampajr force-pushed the lampajr_20231120_mr_serving_reconciler branch 2 times, most recently from aaa6e2d to d99d986 Compare November 28, 2023 11:49
@lampajr
Copy link
Contributor Author

lampajr commented Nov 30, 2023

I am struggling finding a way to test this new reconciliation as part of the unit testing already setup, i.e., using envtest.

I setup the new reconciler in the existing tests (this should guarantee all other reconcilers are not affected by this new one) but to correctly test the behavior I should be able to connect to an existing Model Registry, this is a sort of preconditions otherwise the reconcile loop won't execute anything at all. Therefore it would be more like an e2e test rather than a unit test.

I was thinking to make use of a TestContainer to setup the Model Registry and connect to that, not sure if that is really feasible and if that is the correct approach here. Any suggestion or advice is really appreciated here!!

@lampajr lampajr force-pushed the lampajr_20231120_mr_serving_reconciler branch from d99d986 to ca2d6a3 Compare December 11, 2023 14:41
@lampajr
Copy link
Contributor Author

lampajr commented Dec 11, 2023

For the records I was able to properly setup test for the new controller by making use of TestContainer to simulate the model-registry (actually the ml-metadata) service.

@lampajr
Copy link
Contributor Author

lampajr commented Dec 11, 2023

/test all

@lampajr
Copy link
Contributor Author

lampajr commented Dec 11, 2023

For the records I was able to properly setup test for the new controller by making use of TestContainer to simulate the model-registry (actually the ml-metadata) service.

Ok it looks like this is not working on openshift-ci as it seems there is no docker daemon running there:

       Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?: failed to create container
        {
            msg: "Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?: failed to create container",
            err: <errdefs.errUnknown>{
                error: <client.errConnectionFailed>{
                    host: "unix:///var/run/docker.sock",
                },
            },
        }

Is that not running for a reason? does it could be enabled somehow?
Do you see any other alternative to properly test this use case?

@lampajr lampajr marked this pull request as ready for review December 11, 2023 15:09
@openshift-ci openshift-ci bot requested review from Jooho and spolti December 11, 2023 15:09
@lampajr
Copy link
Contributor Author

lampajr commented Dec 12, 2023

Do you see any other alternative to properly test this use case?

One thing that came to my mind was to mock the model registry library by providing static content, this way I can check reconciler behavior based on that content. This might solve the unit test part, wdyt?

On the other hand setting up some ITs where there is a real connection to a MLMD service could be very helpful as in this case the real model registry library would be used.

@lampajr lampajr force-pushed the lampajr_20231120_mr_serving_reconciler branch from 1671f36 to 1064261 Compare December 12, 2023 10:31
@lampajr
Copy link
Contributor Author

lampajr commented Dec 12, 2023

With commit 1064261 I updated the unit testing for this new reconciliation loop, in particular I removed the TestContainer approach and I mocked the model registry service.

Will evaluate/setup e2e test in a separate PR to avoid blocking this for too long.

Signed-off-by: Andrea Lamparelli <[email protected]>
@lampajr lampajr force-pushed the lampajr_20231120_mr_serving_reconciler branch from 1064261 to 4e17ea0 Compare December 12, 2023 10:41
Signed-off-by: Andrea Lamparelli <[email protected]>
@lampajr
Copy link
Contributor Author

lampajr commented Dec 15, 2023

With commit 177f9af I just updated the model registry service to the latest one which included some fixes.

@spolti @Jooho the PR is ready for a review so whenever you guys have some time, do you mind to check it? Thanks a lot!
I tried to add as much information as possible in the PR description but let me know if you need any other detail or if you have any doubt.

As I also mentioned in the description this PR is actually the implementation of the workflow diagram described and agreed here opendatahub-io/model-registry#104

@@ -167,8 +167,11 @@ rules:
resources:
- inferenceservices
verbs:
- create

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this 2 additional verbs?

Copy link
Contributor Author

@lampajr lampajr Dec 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, that is because the new reconciliation will create/delete/update inferenceservices objects based on the Model Registry content which is actually the main purpose of this new reconciler

@lampajr
Copy link
Contributor Author

lampajr commented Jan 4, 2024

Pull request drafted as we are still considering which direction we would like to implement first, this one or #135 - since these two PRs are conflicting each other, once we decide which one will go first I will adapt the other one as well.

israel-hdez added a commit that referenced this pull request Feb 15, 2024
Implement the model registry InferenceService reconciliation, the implemented workflow is the opposite of the one proposed in #124. Here the direction goes from Cluster to ModelRegistry.

The new reconciler will monitor InferenceService CRs having pre-defined labels, based on those labels will sync the model registry by keeping track of every deployment that occurred in the cluster.
Then will update the InferenceService CR by linking it to the model registry record using a specific label.

* Fix overlays/dev deployment
* Setup InferenceService reconciliation to model registry
* Setup e2e test for new reconciler
* Enable new reconciler
* Create serving env when not found
* Retrieve MR namespace from ISVC label
* Added Kind cluster configuration for local e2e tests
* Check if IS already exists in MR
* Improve e2e testing
* Refactor dev manifests

Signed-off-by: Andrea Lamparelli <[email protected]>
Co-authored-by: Edgar Hernández <[email protected]>
@lampajr lampajr closed this Feb 15, 2024
@lampajr
Copy link
Contributor Author

lampajr commented Feb 15, 2024

Since #135 was merged, I would close this one and if we will need to implement again this workflow I'd suggest to start fresh new PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants