Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running Qwen2-VL-2B-Instruct on TGI is giving an error #2955

Open
2 of 4 tasks
ashwani-bhat opened this issue Jan 27, 2025 · 0 comments
Open
2 of 4 tasks

Running Qwen2-VL-2B-Instruct on TGI is giving an error #2955

ashwani-bhat opened this issue Jan 27, 2025 · 0 comments

Comments

@ashwani-bhat
Copy link

System Info

docker run --gpus all --shm-size 1g -p 8080:80 -e CUDA_VISIBLE_DEVICES=0,1,2,3 \
ghcr.io/huggingface/text-generation-inference:2.4.1 \
--model-id Qwen/Qwen2-VL-2B-Instruct --trust-remote-code \
--quantize bitsandbytes-nf4 --cuda-graphs 0

The above command is giving the following error:


Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/cli.py", line 117, in serve
    server.serve(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 315, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
  File "/opt/conda/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.11/asyncio/events.py", line 84, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 268, in serve_inner
    model = get_model_with_lora_adapters(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/__init__.py", line 1336, in get_model_with_lora_adapters
    model = get_model(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/__init__.py", line 1184, in get_model
    return VlmCausalLM(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/vlm_causal_lm.py", line 290, in __init__
    super().__init__(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 1287, in __init__
    model = model_class(prefix, config, weights)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/qwen2_vl.py", line 392, in __init__
    self.text_model = Qwen2Model(prefix=None, config=config, weights=weights)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 290, in __init__
    [
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 291, in <listcomp>
    Qwen2Layer(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 227, in __init__
    self.self_attn = Qwen2Attention(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 101, in __init__
    self.num_groups = self.num_heads // self.num_key_value_heads
ZeroDivisionError: integer division or modulo by zero

When I am running the same command with 7B model, its working fine.

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

docker run --gpus all --shm-size 1g -p 8080:80 -e CUDA_VISIBLE_DEVICES=0,1,2,3 \
ghcr.io/huggingface/text-generation-inference:2.4.1 \
--model-id Qwen/Qwen2-VL-2B-Instruct --trust-remote-code \
--quantize bitsandbytes-nf4 --cuda-graphs 0

Expected behavior

should work as expectd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant