Skip to content

Commit

Permalink
tests_gaudi: Added L2 vllm workload
Browse files Browse the repository at this point in the history
Signed-off-by: vbedida79 <[email protected]>
  • Loading branch information
vbedida79 committed Oct 31, 2024
1 parent 550eed8 commit c7d75b9
Show file tree
Hide file tree
Showing 3 changed files with 190 additions and 0 deletions.
79 changes: 79 additions & 0 deletions tests/gaudi/l2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,4 +74,83 @@ Welcome to HCCL demo
[BENCHMARK] NW Bandwidth : 258.209121 GB/s
[BENCHMARK] Algo Bandwidth : 147.548069 GB/s
####################################################################################################
```

## VLLM
VLLM is a serving engine for LLM's. The following workloads deploys a VLLM server with an LLM using Intel Gaudi. Refer to [Intel Gaudi VLLM fork](https://github.com/HabanaAI/vllm-fork.git) for more details.

Build the workload container image:
```
$ oc apply -f https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/main/tests/gaudi/l2/vllm_buildconfig.yaml
```
Deploy the workload:
* Update the hugging face token and the pvc according to your cluster setup
```
$ oc apply -f https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/main/tests/gaudi/l2/vllm_deployment.yaml
```
Create the vllm service
```
oc expose deploy/vllm-workload
```
Verify Output:
```
$ oc get pods
NAME READY STATUS RESTARTS AGE
vllm-workload-1-build 0/1 Completed 0 19m
vllm-workload-55f7c6cb7b-cwj2b 1/1 Running 0 8m36s
$ oc get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
vllm-workload ClusterIP 1.2.3.4 <none> 8000/TCP 114s
```
```
$ oc logs vllm-workload-55f7c6cb7b-cwj2b
INFO 10-30 19:35:53 habana_model_runner.py:95] VLLM_DECODE_BS_BUCKET_MIN=32 (default:min)
INFO 10-30 19:35:53 habana_model_runner.py:95] VLLM_DECODE_BS_BUCKET_STEP=32 (default:step)
INFO 10-30 19:35:53 habana_model_runner.py:95] VLLM_DECODE_BS_BUCKET_MAX=256 (default:max)
INFO 10-30 19:35:53 habana_model_runner.py:95] VLLM_PROMPT_SEQ_BUCKET_MIN=128 (default:min)
INFO 10-30 19:35:53 habana_model_runner.py:95] VLLM_PROMPT_SEQ_BUCKET_STEP=128 (default:step)
INFO 10-30 19:35:53 habana_model_runner.py:95] VLLM_PROMPT_SEQ_BUCKET_MAX=1024 (default:max)
INFO 10-30 19:35:53 habana_model_runner.py:95] VLLM_DECODE_BLOCK_BUCKET_MIN=128 (default:min)
INFO 10-30 19:35:53 habana_model_runner.py:95] VLLM_DECODE_BLOCK_BUCKET_STEP=128 (default:step)
INFO 10-30 19:35:53 habana_model_runner.py:95] VLLM_DECODE_BLOCK_BUCKET_MAX=4096 (default:max)
INFO 10-30 19:35:53 habana_model_runner.py:691] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 64], seq:[128, 128, 1024]
INFO 10-30 19:35:53 habana_model_runner.py:696] Decode bucket config (min, step, max_warmup) bs:[32, 32, 256], block:[128, 128, 4096]
============================= HABANA PT BRIDGE CONFIGURATION ===========================
PT_HPU_LAZY_MODE = 1
PT_RECIPE_CACHE_PATH =
PT_CACHE_FOLDER_DELETE = 0
PT_HPU_RECIPE_CACHE_CONFIG =
PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
PT_HPU_LAZY_ACC_PAR_MODE = 1
PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
PT_HPU_EAGER_PIPELINE_ENABLE = 1
PT_HPU_EAGER_COLLECTIVE_PIPELINE_ENABLE = 1
---------------------------: System Configuration :---------------------------
Num CPU Cores : 160
CPU RAM : 1056371848 KB
------------------------------------------------------------------------------
INFO 10-30 19:35:56 selector.py:85] Using HabanaAttention backend.
INFO 10-30 19:35:56 loader.py:284] Loading weights on hpu ...
INFO 10-30 19:35:56 weight_utils.py:224] Using model weights format ['*.safetensors', '*.bin', '*.pt']
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:03<00:11, 3.87s/it]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:07<00:07, 3.71s/it]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:10<00:03, 3.59s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:11<00:00, 2.49s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:11<00:00, 2.93s/it]
```
Run inference requests using the service url.
```
sh-5.1# curl "http://vllm-workload.gaudi-validation.svc.cluster.local:8000/v1/models"{"object":"list","data":[{"id":"meta-llama/Llama-3.1-8B","object":"model","created":1730317412,"owned_by":"vllm","root":"meta-llama/Llama-3.1-8B","parent":null,"max_model_len":131072,"permission":[{"id":"modelperm-452b2bd990834aa5a9416d083fcc4c9e","object":"model_permission","created":1730317412,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}
```

```
sh-5.1# curl http://vllm-workload.gaudi-validation.svc.cluster.local:8000/v1/completions -H "Content-Type: application/json" -d '{
"model": "meta-llama/Llama-3.1-8B",
"prompt": "A constellation is a",
"max_tokens": 10
}'
{"id":"cmpl-9a0442d0da67411081837a3a32a354f2","object":"text_completion","created":1730321284,"model":"meta-llama/Llama-3.1-8B","choices":[{"index":0,"text":" group of individual stars that forms a pattern or figure","logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":5,"total_tokens":15,"completion_tokens":10}}
```
35 changes: 35 additions & 0 deletions tests/gaudi/l2/vllm_buildconfig.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Copyright (c) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

apiVersion: image.openshift.io/v1
kind: ImageStream
metadata:
name: vllm-workload
namespace: gaudi-validation
spec: {}
---
apiVersion: build.openshift.io/v1
kind: BuildConfig
metadata:
name: vllm-workload
namespace: gaudi-validation
spec:
triggers:
- type: "ConfigChange"
- type: "ImageChange"
runPolicy: "Serial"
source:
git:
uri: https://github.com/opendatahub-io/vllm.git
ref: main-gaudi
strategy:
type: Docker
dockerStrategy:
dockerfilePath: Dockerfile.hpu.ubi
buildArgs:
- name: "BASE_IMAGE"
value: "vault.habana.ai/gaudi-docker/1.18.0/rhel9.4/habanalabs/pytorch-installer-2.4.0:1.18.0-524"
output:
to:
kind: ImageStreamTag
name: vllm-workload:latest
76 changes: 76 additions & 0 deletions tests/gaudi/l2/vllm_deployment.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Copyright (c) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
---
apiVersion: v1
kind: Secret
metadata:
name: hf-token
namespace: gaudi-valdation
type: Opaque
data:
hf-token: # Add your token
---
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: vllm-workload-pvc
namespace: gaudi-validation
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 60Gi
storageClassName: "" # Add your storage class
volumeMode: Filesystem
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-workload
namespace: gaudi-validation
labels:
app: vllm-workload
spec:
replicas: 1
selector:
matchLabels:
app: vllm-workload
template:
metadata:
labels:
app: vllm-workload
spec:
containers:
- name: vllm-container
image: image-registry.openshift-image-registry.svc:5000/gaudi-validation/vllm-workload:latest
command: [ "/bin/bash", "-c", "--" ]
args: ["vllm serve meta-llama/Llama-3.1-8B"] # Add the model
ports:
- containerPort: 8000
resources:
limits:
habana.ai/gaudi: 4
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: hf-token
- name: HF_HOME
value: /home/vllm/.cache/huggingface
imagePullPolicy: Always
volumeMounts:
- name: hf-cache
mountPath: /home/vllm/.cache
- name: shm
mountPath: /dev/shm
volumes:
- name: hf-cache
persistentVolumeClaim:
claimName: vllm-workload-pvc
- name: shm
emptyDir:
medium: Memory
sizeLimit: "2Gi"

0 comments on commit c7d75b9

Please sign in to comment.