Eventually exchange ollama with vllm and use https://www.nvidia.com/en-us/ai/ for Nvidia #21

jbaumgartl · 2025-01-22T14:18:07Z

https://github.com/dusty-nv/jetson-containers/blob/master/packages/llm/vllm/test.py

makoit · 2025-02-02T15:57:25Z

vLLM testing

https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/vllm

Run vLLM on jetson via terminal:
jetson-containers run $(autotag vllm)

OR

jetson-containers run -it dustynv/vllm:0.6.3-r36.4.0

-> uses image: https://hub.docker.com/r/dustynv/vllm
-> will open container shell

Serve model in container:

-> Models are fetched from HF model hub
vllm serve Qwen/Qwen2.5-0.5B-Instruct —gpu-memory-utilization=0.3

If you want to run llama3.2 you first need create an HF account, request access to meta models and use an api key inside of the vLLM container because otherwise the model cannot be downloaded.

vllm serve meta-llama/Llama-3.2-1B --gpu-memory-utilization=0.7 --max_model_len=6000

(will reduce utilization: dusty-nv/jetson-containers#704)

I tried multiple models to serve llama3.2 1B, qwen2.5 0.5B, deepseek and I was not able to run one of these on the hw. Either the model is to big to run or exceptions are thrown. I see there is also an overhead of memory allocation in vLLM: dusty-nv/jetson-containers#795 .

Here the exception:


INFO 02-02 17:46:30 model_runner.py:1099] Loading model weights took 2.3185 GB
INFO 02-02 17:46:34 worker.py:241] Memory profiling takes 4.56 seconds
INFO 02-02 17:46:34 worker.py:241] the current vLLM instance can use total_gpu_memory (7.44GiB) x gpu_memory_utilization (0.70) = 5.21GiB
INFO 02-02 17:46:34 worker.py:241] model weights take 2.32GiB; non_torch_memory takes 0.26GiB; PyTorch activation peak memory takes 1.19GiB; the rest of the memory reserved for KV Cache is 1.44GiB.
INFO 02-02 17:46:35 gpu_executor.py:76] # GPU blocks: 2943, # CPU blocks: 8192
INFO 02-02 17:46:35 gpu_executor.py:80] Maximum concurrency for 6000 tokens per request: 7.85x
Task exception was never retrieved
future: <Task finished name='Task-2' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py:178> exception=ZMQError('Operation not supported')>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop
    while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
  File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll
    raise _zmq.ZMQError(_zmq.ENOTSUP)
zmq.error.ZMQError: Operation not supported
Task exception was never retrieved
future: <Task finished name='Task-3' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py:178> exception=ZMQError('Operation not supported')>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop
    while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
  File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll
    raise _zmq.ZMQError(_zmq.ENOTSUP)
zmq.error.ZMQError: Operation not supported
Task exception was never retrieved
future: <Task finished name='Task-4' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py:178> exception=ZMQError('Operation not supported')>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop
    while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
  File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll
    raise _zmq.ZMQError(_zmq.ENOTSUP)
zmq.error.ZMQError: Operation not supported
Traceback (most recent call last):
  File "/usr/local/bin/vllm", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/vllm/scripts.py", line 201, in main
    args.dispatch_function(args)
  File "/usr/local/lib/python3.10/dist-packages/vllm/scripts.py", line 42, in serve
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 82, in run
    return loop.run_until_complete(wrapper())
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 740, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 118, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 223, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.
root@ubuntu:/# /usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

makoit · 2025-02-05T07:31:32Z

@jbaumgartl this is the issue with ollama: dusty-nv/jetson-containers#814 (comment)

Ollama is broken on all platforms, I added some comments to the issue how you can run the service and models.

jbaumgartl changed the title ~~Eventually exchange ollama with vllm~~ Eventually exchange ollama with vllm and use https://www.nvidia.com/en-us/ai/ for Nvidia Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eventually exchange ollama with vllm and use https://www.nvidia.com/en-us/ai/ for Nvidia #21

Eventually exchange ollama with vllm and use https://www.nvidia.com/en-us/ai/ for Nvidia #21

jbaumgartl commented Jan 22, 2025

makoit commented Feb 2, 2025 •

edited

Loading

makoit commented Feb 5, 2025

Eventually exchange ollama with vllm and use https://www.nvidia.com/en-us/ai/ for Nvidia #21

Eventually exchange ollama with vllm and use https://www.nvidia.com/en-us/ai/ for Nvidia #21

Comments

jbaumgartl commented Jan 22, 2025

makoit commented Feb 2, 2025 • edited Loading

makoit commented Feb 5, 2025

makoit commented Feb 2, 2025 •

edited

Loading