Skip to content

Commit

Permalink
Merge pull request #132 from redhat-et/mixtral_serving_pvc
Browse files Browse the repository at this point in the history
serving with a pvc and directions
  • Loading branch information
MichaelClifford authored Oct 30, 2024
2 parents abe3dc3 + 90604a7 commit c807fb2
Show file tree
Hide file tree
Showing 6 changed files with 290 additions and 91 deletions.
127 changes: 127 additions & 0 deletions kubernetes_yaml/mixtral_serve/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
## Mixtral serving
Mixtral is required as part of the Instructlab process in various places. The following will describe how to provide Mixtral and the LORA adapters.

### Secret
Because we neet to run oras inside of a container to download the various artifacts we must provide a .dockerconfigjson to the Kubernetes job with authentication back to registry.redhat.io.
It is suggested to use a Service account. https://access.redhat.com/terms-based-registry/accounts is the location to create a service account.

Create a secret based off of the service account.

secret.yaml

```
apiVersion: v1
kind: Secret
metadata:
name: 7033380-ilab-pull-secret
data:
.dockerconfigjson: sadfassdfsadfasdfasdfasdfasdfasdfasdf=
type: kubernetes.io/dockerconfigjson
```

Create the secret

```
oc create -f secret.yaml
```

### Kubernetes Job
Depending on the name of your secret the file `../mixtral_pull/pull_kube_job.yaml` will need to be modified.

```
...redacted...
- name: docker-config
secret:
secretName: 7033380-ilab-pull-secret
...redacted...
```

With the secretName now reflecting your secret the job can be launched.

```
kubectl create -f ./mixtral_pull
```

This will create 3 different containers downloading various things using oras.

### Mixtral serving
This will make no sense but it is the only way discovered so far to ensure that a token is generated to work with the model. Using the RHODS model serving UI define a model to be served named mixtral. Ensure external access and token are selected as the TOKEN is the piece not yet discovered when using just the CLI.

We will now use the PVC from the previous step to serve the model and replace the runtime defined in the UI.

```
kubectl apply -f ./mixtral_serve/runtime.yaml
```

Modify the inference service and copy the entire spec field from ./mixtral_serve/inference.yaml

```
oc edit inferenceservice mixtral
```

```
spec:
predictor:
maxReplicas: 1
minReplicas: 1
model:
args:
- --dtype=bfloat16
- --tensor-parallel-size=4
- --enable-lora
- --max-lora-rank=64
- --lora-dtype=bfloat16
- --fully-sharded-loras
- --lora-modules
- skill-classifier-v3-clm=/mnt/skills
- text-classifier-knowledge-v3-clm=/mnt/knowledge
modelFormat:
name: vLLM
name: ""
resources:
limits:
cpu: "4"
memory: 40Gi
nvidia.com/gpu: "4"
requests:
cpu: "4"
memory: 40Gi
nvidia.com/gpu: "4"
runtime: mixtral
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
```


Follow the log of the kserve-container and wait for the the following log entry

```
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
```


### Testing
To interact with the model grab the inference endpoint from the RHOAI UI and the token.

```
oc get secret -o yaml default-name-mixtral-sa | grep token: | awk -F: '{print $2}' | tr -d ' ' | base64 -d
```

Export that value as a variable named TOKEN

```
export TOKEN=BLOBOFLETTERSANDNUMBERS
```

Using curl you can ensure that the model is accepting connections
```
curl -X POST "https://mixtral-labels.apps.hulk.octo-emerging.redhataicoe.com/v1/completions" -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" -d '{"model": "mixtral", "prompt": "San Francisco is a", "max_tokens": 7, "temperature": 0 }'
{"id":"cmpl-ecd5bd72a947438b805e25134bbdf636","object":"text_completion","created":1730231625,"model":"mixtral","choices":[{"index":0,"text":" city that is known for its steep","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":12,"completion_tokens":7}}%
```
44 changes: 44 additions & 0 deletions kubernetes_yaml/mixtral_serve/mixtral_pull/pull_kube_job.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
apiVersion: batch/v1
kind: Job
metadata:
name: oras-copy-job
spec:
template:
spec:
containers:
- name: oras-copy-knowledge
image: ghcr.io/oras-project/oras:v1.2.0
command: ["oras", "pull", "registry.redhat.io/rhelai1/knowledge-adapter-v3:1.2-1728663941", "--output", "/mnt/knowledge", "--registry-config", "/workspace/.docker"]
volumeMounts:
- name: docker-config
mountPath: /workspace/.docker
subPath: .dockerconfigjson # Mount the Docker config as config.json
- name: model-pvc
mountPath: /mnt
- name: oras-copy-skills
image: ghcr.io/oras-project/oras:v1.2.0
command: ["oras", "pull", "registry.redhat.io/rhelai1/skills-adapter-v3:1.2-1728663941", "--output", "/mnt/skills", "--registry-config", "/workspace/.docker"]
volumeMounts:
- name: docker-config
mountPath: /workspace/.docker
subPath: .dockerconfigjson # Mount the Docker config as config.json
- name: model-pvc
mountPath: /mnt
- name: oras-copy-model
image: ghcr.io/oras-project/oras:v1.2.0
command: ["oras", "pull", "registry.redhat.io/rhelai1/mixtral-8x7b-instruct-v0-1:1.2-1728663941", "--output", "/mnt/model", "--registry-config", "/workspace/.docker"]
volumeMounts:
- name: docker-config
mountPath: /workspace/.docker
subPath: .dockerconfigjson # Mount the Docker config as config.json
- name: model-pvc
mountPath: /mnt
restartPolicy: Never
volumes:
- name: model-pvc
persistentVolumeClaim:
claimName: mixtral-serving-ilab
- name: docker-config
secret:
secretName: 7033380-ilab-pull-secret
backoffLimit: 4
11 changes: 11 additions & 0 deletions kubernetes_yaml/mixtral_serve/mixtral_pull/pvc.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
## PVC to be used for model storage
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mixtral-serving-ilab
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 500Gi
91 changes: 0 additions & 91 deletions kubernetes_yaml/mixtral_serve/mixtral_serve.yaml

This file was deleted.

51 changes: 51 additions & 0 deletions kubernetes_yaml/mixtral_serve/mixtral_serve/inference.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
annotations:
openshift.io/display-name: mixtral
security.opendatahub.io/enable-auth: "true"
serving.knative.openshift.io/enablePassthrough: "true"
sidecar.istio.io/inject: "true"
sidecar.istio.io/rewriteAppHTTPProbers: "true"
creationTimestamp: "2024-10-29T19:14:46Z"
finalizers:
- inferenceservice.finalizers
generation: 5
labels:
opendatahub.io/dashboard: "true"
name: mixtral
namespace: labels
resourceVersion: "8869282"
uid: 433d76da-6c52-4b47-a3cd-ba3765e7b5bf
spec:
predictor:
maxReplicas: 1
minReplicas: 1
model:
args:
- --dtype=bfloat16
- --tensor-parallel-size=4
- --enable-lora
- --max-lora-rank=64
- --lora-dtype=bfloat16
- --fully-sharded-loras
- --lora-modules
- skill-classifier-v3-clm=/mnt/skills
- text-classifier-knowledge-v3-clm=/mnt/knowledge
modelFormat:
name: vLLM
name: ""
resources:
limits:
cpu: "4"
memory: 40Gi
nvidia.com/gpu: "4"
requests:
cpu: "4"
memory: 40Gi
nvidia.com/gpu: "4"
runtime: mixtral
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
57 changes: 57 additions & 0 deletions kubernetes_yaml/mixtral_serve/mixtral_serve/runtime.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
annotations:
opendatahub.io/accelerator-name: migrated-gpu
opendatahub.io/apiProtocol: REST
opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
opendatahub.io/template-display-name: vLLM ServingRuntime for KServe
opendatahub.io/template-name: vllm-runtime
openshift.io/display-name: mixtral
creationTimestamp: "2024-10-25T15:59:12Z"
generation: 3
labels:
opendatahub.io/dashboard: "true"
name: mixtral
namespace: labels
spec:
annotations:
prometheus.io/path: /metrics
prometheus.io/port: "8080"
containers:
- args:
- --port=8080
- --model=/mnt/model
- --served-model-name={{.Name}}
- --distributed-executor-backend=mp
command:
- python
- -m
- vllm.entrypoints.openai.api_server
env:
- name: HF_HOME
value: /tmp/hf_home
image: quay.io/modh/vllm@sha256:3c56d4c2a5a9565e8b07ba17a6624290c4fb39ac9097b99b946326c09a8b40c8
name: kserve-container
ports:
- containerPort: 8080
protocol: TCP
volumeMounts:
- mountPath: /dev/shm
name: shm
- mountPath: /mnt
name: mixtral-serve
multiModel: false
storageHelper:
disabled: true
supportedModelFormats:
- autoSelect: true
name: vLLM
volumes:
- name: mixtral-serve
persistentVolumeClaim:
claimName: mixtral-serving-ilab
- emptyDir:
medium: Memory
sizeLimit: 2Gi
name: shm

0 comments on commit c807fb2

Please sign in to comment.