-
-
Notifications
You must be signed in to change notification settings - Fork 292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
qwen coder32b run on colab t4 #682
Comments
Problem solved when flash-attention erase |
!mkdir my_model2 !cd exllamav2 && python examples/chat.py -m ../my_model2 -mode llama -pt -ncf -ngram
|
not run on colab t4 with |
bartowski/Qwen2.5-Coder-32B-Instruct-exl2 !cd exllamav2 && python examples/chat.py -m ../my_model2 -mode llama -pt -ncf -ngram -cq4 |
bartowski/Qwen2.5-Coder-32B-Instruct-exl2 use 2.2 bits per weight |
usage: chat.py [-h] [-dm DRAFT_MODEL_DIR] [-nds] [-dn DRAFT_N_TOKENS] [-modes] |
run qwen coder32b on colab t4 with-c8 !cd exllamav2 && python examples/chat.py -m ../my_model2 -mode llama -pt -ncf -ngram -c8 |
The T4 is too old of a GPU to use Flash Attention 2. |
OS
Linux
GPU Library
CUDA 12.x
Python version
3.10
Pytorch version
xxxxxxxxxxx
Model
turboderp/Mistral-7B-instruct-exl2
Describe the bug
Warning: Flash Attention is installed but unsupported GPUs were detected
Reproduction steps
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Expected behavior
not run
Logs
!cd exllamav2 && python examples/chat.py -m ../my_model -mode llama -pt -ncf -ngram
[4]
14m
!cd exllamav2 && python examples/chat.py -m ../my_model -mode llama -pt -ncf -ngram
Loading exllamav2_ext extension (JIT)...
Building C++/CUDA extension ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:13:45 0:00:00
Warning: Flash Attention is installed but unsupported GPUs were detected.
-- Model: ../my_model
-- Options: []
-- Loading tokenizer...
-- Loading model...
-- Loading model...
-- Prompt format: llama
-- System prompt:
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
User: welcome
Traceback (most recent call last):
File "/content/exllamav2/examples/chat.py", line 310, in
generator.begin_stream_ex(active_context, settings)
File "/content/exllamav2/exllamav2/generator/streaming.py", line 363, in begin_stream_ex
self._gen_begin_reuse(input_ids, gen_settings)
File "/content/exllamav2/exllamav2/generator/streaming.py", line 731, in _gen_begin_reuse
self._gen_begin(in_tokens, gen_settings)
File "/content/exllamav2/exllamav2/generator/streaming.py", line 692, in _gen_begin
self.model.forward(self.sequence_ids[:, :-1],
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/content/exllamav2/exllamav2/model.py", line 898, in forward
r = self.forward_chunk(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/content/exllamav2/exllamav2/model.py", line 1004, in forward_chunk
x = module.forward(x, cache = cache, attn_params = attn_params, past_len = past_len, loras = loras, **kwargs)
File "/content/exllamav2/exllamav2/attn.py", line 1125, in forward
attn_output = attn_func(batch_size, q_len, q_states, k_states, v_states, attn_params, cfg)
File "/content/exllamav2/exllamav2/attn.py", line 929, in _attn_flash
attn_output = flash_attn_func(
File "/usr/local/lib/python3.10/dist-packages/flash_attn/flash_attn_interface.py", line 1163, in flash_attn_func
return FlashAttnFunc.apply(
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 575, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/usr/local/lib/python3.10/dist-packages/flash_attn/flash_attn_interface.py", line 810, in forward
out_padded, softmax_lse, S_dmask, rng_state = _wrapped_flash_attn_forward(
File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1116, in call
return self._op(*args, **(kwargs or {}))
File "/usr/local/lib/python3.10/dist-packages/torch/_library/custom_ops.py", line 324, in backend_impl
result = self._backend_fns[device_type](*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/_compile.py", line 32, in inner
return disable_fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 632, in _fn
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/_library/custom_ops.py", line 367, in wrapped_fn
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/flash_attn/flash_attn_interface.py", line 91, in _flash_attn_forward
out, softmax_lse, S_dmask, rng_state = flash_attn_cuda.fwd(
RuntimeError: FlashAttention only supports Ampere GPUs or newer.
Additional context
No response
Acknowledgements
The text was updated successfully, but these errors were encountered: