Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eventually exchange ollama with vllm and use https://www.nvidia.com/en-us/ai/ for Nvidia #21

Open
jbaumgartl opened this issue Jan 22, 2025 · 2 comments

Comments

@jbaumgartl
Copy link
Contributor

https://github.com/dusty-nv/jetson-containers/blob/master/packages/llm/vllm/test.py

@jbaumgartl jbaumgartl changed the title Eventually exchange ollama with vllm Eventually exchange ollama with vllm and use https://www.nvidia.com/en-us/ai/ for Nvidia Jan 27, 2025
@makoit
Copy link
Collaborator

makoit commented Feb 2, 2025

vLLM testing

https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/vllm

Run vLLM on jetson via terminal:
jetson-containers run $(autotag vllm)

OR

jetson-containers run -it dustynv/vllm:0.6.3-r36.4.0

-> uses image: https://hub.docker.com/r/dustynv/vllm
-> will open container shell

Serve model in container:

-> Models are fetched from HF model hub
vllm serve Qwen/Qwen2.5-0.5B-Instruct —gpu-memory-utilization=0.3

If you want to run llama3.2 you first need create an HF account, request access to meta models and use an api key inside of the vLLM container because otherwise the model cannot be downloaded.

vllm serve meta-llama/Llama-3.2-1B --gpu-memory-utilization=0.7 --max_model_len=6000

(will reduce utilization: dusty-nv/jetson-containers#704)

I tried multiple models to serve llama3.2 1B, qwen2.5 0.5B, deepseek and I was not able to run one of these on the hw. Either the model is to big to run or exceptions are thrown. I see there is also an overhead of memory allocation in vLLM: dusty-nv/jetson-containers#795 .

Here the exception:


INFO 02-02 17:46:30 model_runner.py:1099] Loading model weights took 2.3185 GB
INFO 02-02 17:46:34 worker.py:241] Memory profiling takes 4.56 seconds
INFO 02-02 17:46:34 worker.py:241] the current vLLM instance can use total_gpu_memory (7.44GiB) x gpu_memory_utilization (0.70) = 5.21GiB
INFO 02-02 17:46:34 worker.py:241] model weights take 2.32GiB; non_torch_memory takes 0.26GiB; PyTorch activation peak memory takes 1.19GiB; the rest of the memory reserved for KV Cache is 1.44GiB.
INFO 02-02 17:46:35 gpu_executor.py:76] # GPU blocks: 2943, # CPU blocks: 8192
INFO 02-02 17:46:35 gpu_executor.py:80] Maximum concurrency for 6000 tokens per request: 7.85x
Task exception was never retrieved
future: <Task finished name='Task-2' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py:178> exception=ZMQError('Operation not supported')>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop
    while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
  File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll
    raise _zmq.ZMQError(_zmq.ENOTSUP)
zmq.error.ZMQError: Operation not supported
Task exception was never retrieved
future: <Task finished name='Task-3' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py:178> exception=ZMQError('Operation not supported')>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop
    while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
  File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll
    raise _zmq.ZMQError(_zmq.ENOTSUP)
zmq.error.ZMQError: Operation not supported
Task exception was never retrieved
future: <Task finished name='Task-4' coro=<MQLLMEngineClient.run_output_handler_loop() done, defined at /usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py:178> exception=ZMQError('Operation not supported')>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 184, in run_output_handler_loop
    while await self.output_socket.poll(timeout=VLLM_RPC_TIMEOUT
  File "/usr/local/lib/python3.10/dist-packages/zmq/_future.py", line 400, in poll
    raise _zmq.ZMQError(_zmq.ENOTSUP)
zmq.error.ZMQError: Operation not supported
Traceback (most recent call last):
  File "/usr/local/bin/vllm", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/vllm/scripts.py", line 201, in main
    args.dispatch_function(args)
  File "/usr/local/lib/python3.10/dist-packages/vllm/scripts.py", line 42, in serve
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 82, in run
    return loop.run_until_complete(wrapper())
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 740, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 118, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 223, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.
root@ubuntu:/# /usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

@makoit
Copy link
Collaborator

makoit commented Feb 5, 2025

@jbaumgartl this is the issue with ollama: dusty-nv/jetson-containers#814 (comment)

Ollama is broken on all platforms, I added some comments to the issue how you can run the service and models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants