[Bug]: Qwen2.5-72B model hang/crash #61

tstescoTT · 2025-02-05T16:59:25Z

Your current environment

Ubuntu 20.04
tt-metal (hf-llama branch yesterday): https://github.com/tenstorrent/tt-metal/tree/aac80d111840ccc324a105d499060e814ca7f2c0
vllm (dev pinned commit): https://github.com/tenstorrent/vllm/tree/b9564bf364e95a3850619fc7b2ed968cc71e30b7

running using: https://github.com/tenstorrent/tt-inference-server/blob/tstesco/qwen25-support/vllm-tt-metal-llama3/src/run_vllm_api_server.py

Model Input Dumps

No response

🐛 Describe the bug

After a few hundred completions during BBH eval run I got the crash shown below.

Repro:

run vLLM + model using: https://github.com/tenstorrent/tt-inference-server/blob/tstesco/qwen25-support/vllm-tt-metal-llama3/src/run_vllm_api_server.py
run BBH eval via:

# meta_bbh (3 shot COT)
lm_eval \
--model local-completions \
--model_args model=${HF_MODEL_REPO_ID},base_url=http://127.0.0.1:7000/v1/completions,num_concurrent=32,max_retries=4,tokenized_requests=False,add_bos_token=True,timeout=None \
--gen_kwargs model=${HF_MODEL_REPO_ID},stream=False \
--tasks meta_bbh \
--batch_size auto \
--output_path /home/container_app_user/cache_root/eval_output \
--include_path ./work_dir \
--seed 42  \
--log_samples

Crash log (full log attached):

INFO:     127.0.0.1:49498 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 02-05 07:57:34 logger.py:37] Received request cmpl-0b4977cfb4cb480aaf1a5c9eb97d7184-0: prompt: 'A logical deduction task which requires deducing the order of a ', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=1234, stop=['<|eot_id|>'], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), guided_decoding=GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None), prompt_token_ids: [32, 19819, 38843, 3383, 892, 7460, 7681, 58044, 279, 1973, 315, 264, 8500, 315, 6171, 382, 48, 25, 576, 2701, 42643, 1817, 7512, 264, 738, 315, 2326, 6171, 27802, 304, 264, 8356, 1973, 13, 576, 12239, 525, 73045, 12966, 2878, 1817, 14311, 13, 758, 264, 19120, 16129, 11, 1052, 1033, 2326, 19120, 388, 25, 28693, 11, 33262, 11, 323, 31360, 13, 31360, 8060, 3403], lora_request: None, prompt_adapter_request: None.
DEBUG 02-05 07:57:34 async_llm_engine.py:523] Building guided decoding logits processor. Params: GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None)
INFO 02-05 07:57:34 engine.py:291] Added request cmpl-0b4977cfb4cb480aaf1a5c9eb97d7184-0.
2025-02-05 07:57:35.147 | INFO     | models.demos.llama3.tt.generator:prefill_forward_text:58 - Prefilling User 1
2025-02-05 07:57:35.937 | INFO     | models.demos.llama3.tt.generator:prefill_forward_text:58 - Prefilling User 2
2025-02-05 07:57:36.725 | INFO     | models.demos.llama3.tt.generator:prefill_forward_text:80 - Finished prefill for all users up to 861 tokens, Starting decode...
INFO:     127.0.0.1:49428 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 02-05 07:57:36 logger.py:37] Received request cmpl-ab41b0c0fe1c42e5bd9cbc85e4d811fc-0: prompt: 'Detect the type of error in an English translation of a German s', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=1234, stop=['<|eot_id|>'], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), guided_decoding=GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None), prompt_token_ids: [57193, 279, 943, 315, 1465, 304, 458, 6364, 14468, 315, 264, 5938, 2530, 11652, 382, 48, 25, 576, 2701, 36693, 504, 5938, 311, 6364, 6644, 264, 3953, 1465, 13, 2938, 1465, 686, 387, 825, 315, 279, 2701, 4494, 25, 40459, 52249, 25, 1527, 5387, 320, 11400, 11, 7482, 11, 10468, 11, 4992, 6138, 374, 5497, 311, 264, 2155, 5387, 13, 47124, 938, 24979, 25], lora_request: None, prompt_adapter_request: None.
DEBUG 02-05 07:57:36 async_llm_engine.py:523] Building guided decoding logits processor. Params: GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None)
INFO 02-05 07:57:36 engine.py:291] Added request cmpl-ab41b0c0fe1c42e5bd9cbc85e4d811fc-0.
INFO:     127.0.0.1:49396 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 02-05 07:57:37 logger.py:37] Received request cmpl-bff1abf4d7454c71b662967a4d4836a5-0: prompt: 'Name geometric shapes from their SVG paths.\n\nQ: This SVG path el', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=1234, stop=['<|eot_id|>'], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), guided_decoding=GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None), prompt_token_ids: [675, 52484, 20816, 504, 862, 39800, 12716, 382, 48, 25, 1096, 39800, 1815, 2392, 366, 2343, 294, 428, 44, 220, 18, 16, 13, 15, 15, 11, 22, 18, 13, 15, 15, 444, 220, 18, 17, 13, 15, 15, 11, 20, 24, 13, 15, 15, 444, 220, 19, 19, 13, 15, 15, 11, 20, 15, 13, 15, 15, 444, 220, 19, 24, 13, 15, 15], lora_request: None, prompt_adapter_request: None.
DEBUG 02-05 07:57:37 async_llm_engine.py:523] Building guided decoding logits processor. Params: GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None)
INFO 02-05 07:57:37 engine.py:291] Added request cmpl-bff1abf4d7454c71b662967a4d4836a5-0.
2025-02-05 07:57:37.484 | INFO     | models.demos.llama3.tt.generator:prefill_forward_text:58 - Prefilling User 1
2025-02-05 07:57:39.331 | INFO     | models.demos.llama3.tt.generator:prefill_forward_text:58 - Prefilling User 2
...
ERROR 02-05 08:12:08 client.py:250] TimeoutError('No heartbeat received from MQLLMEngine')
ERROR 02-05 08:12:08 client.py:250] NoneType: None
DEBUG 02-05 08:12:08 client.py:144] Shutting down MQLLMEngineClient check health loop due to timeout
DEBUG 02-05 08:12:37 client.py:170] Waiting for output from MQLLMEngine.
CRITICAL 02-05 08:12:37 launcher.py:99] MQLLMEngine is already dead, terminating server process
...
CRITICAL 02-05 08:12:37 launcher.py:99] MQLLMEngine is already dead, terminating server process
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [33427]
                 Device | INFO     | Closing user mode device drivers
libc++abi: terminating due to uncaught exception of type std::runtime_error: Timed out after waiting 1 seconds for device 4 ARC to respond to message 0xaa54
/usr/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
Aborted (core dumped)

2025-02-05-qwen25-72b-crash.log

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

skhorasganiTT · 2025-02-05T17:01:54Z

cc @uaydonat

tstescoTT · 2025-02-05T19:30:45Z

Looks it happened again. I rebooted the system and reset the cards before running.

2025-02-05 11:20:42.758 | INFO     | models.demos.llama3.tt.generator:prefill_forward_text:80 - Finished prefill for all users up to 1560 tokens, Starting decode...
INFO 02-05 11:20:42 metrics.py:396] Avg prompt throughput: 243.3 tokens/s, Avg generation throughput: 12.2 tokens/s, Running: 32 reqs, Swapped: 0 reqs, Pending: 16 reqs, GPU KV cache usage: 34.3%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.1:38136 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 02-05 11:20:45 logger.py:37] Received request cmpl-126eda5643b242e3894e552562064e6d-0: prompt: 'Name geometric shapes from their SVG paths.\n\nQ: This SVG path el', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=1234, stop=['\n\nQ: '], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), guided_decoding=GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None), prompt_token_ids: [675, 52484, 20816, 504, 862, 39800, 12716, 382, 48, 25, 1096, 39800, 1815, 2392, 366, 2343, 294, 428, 44, 220, 18, 16, 13, 15, 15, 11, 22, 18, 13, 15, 15, 444, 220, 18, 17, 13, 15, 15, 11, 20, 24, 13, 15, 15, 444, 220, 19, 19, 13, 15, 15, 11, 20, 15, 13, 15, 15, 444, 220, 19, 24, 13, 15, 15], lora_request: None, prompt_adapter_request: None.
DEBUG 02-05 11:20:45 async_llm_engine.py:523] Building guided decoding logits processor. Params: GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None)
INFO 02-05 11:20:45 engine.py:291] Added request cmpl-126eda5643b242e3894e552562064e6d-0.
2025-02-05 11:20:45.814 | INFO     | models.demos.llama3.tt.generator:prefill_forward_text:58 - Prefilling User 1
2025-02-05 11:20:47.675 | INFO     | models.demos.llama3.tt.generator:prefill_forward_text:80 - Finished prefill for all users up to 1239 tokens, Starting decode...
INFO 02-05 11:20:48 metrics.py:396] Avg prompt throughput: 484.7 tokens/s, Avg generation throughput: 27.9 tokens/s, Running: 32 reqs, Swapped: 0 reqs, Pending: 16 reqs, GPU KV cache usage: 35.7%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.1:38206 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 02-05 11:20:49 logger.py:37] Received request cmpl-c7295fae81e54f39a00e5b0dea752d41-0: prompt: 'Detect the type of error in an English translation of a German s', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=1234, stop=['\n\nQ: '], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), guided_decoding=GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None), prompt_token_ids: [57193, 279, 943, 315, 1465, 304, 458, 6364, 14468, 315, 264, 5938, 2530, 11652, 382, 48, 25, 576, 2701, 36693, 504, 5938, 311, 6364, 6644, 264, 3953, 1465, 13, 2938, 1465, 686, 387, 825, 315, 279, 2701, 4494, 25, 40459, 52249, 25, 1527, 5387, 320, 11400, 11, 7482, 11, 10468, 11, 4992, 6138, 374, 5497, 311, 264, 2155, 5387, 13, 47124, 938, 24979, 25], lora_request: None, prompt_adapter_request: None.
DEBUG 02-05 11:20:49 async_llm_engine.py:523] Building guided decoding logits processor. Params: GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None)
INFO 02-05 11:20:49 engine.py:291] Added request cmpl-c7295fae81e54f39a00e5b0dea752d41-0.
2025-02-05 11:20:49.227 | INFO     | models.demos.llama3.tt.generator:prefill_forward_text:58 - Prefilling User 1
2025-02-05 11:20:51.082 | INFO     | models.demos.llama3.tt.generator:prefill_forward_text:58 - Prefilling User 2
2025-02-05 11:20:54.724 | INFO     | models.demos.llama3.tt.generator:prefill_forward_text:80 - Finished prefill for all users up to 2365 tokens, Starting decode...
INFO 02-05 11:20:54 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.8 tokens/s, Running: 32 reqs, Swapped: 0 reqs, Pending: 15 reqs, GPU KV cache usage: 36.1%, CPU KV cache usage: 0.0%.
2025-02-05 11:20:56.265 | INFO     | models.demos.llama3.tt.generator:prefill_forward_text:58 - Prefilling User 1

This was after 147 prompts completed in BBH:

2025-02-05:11:03:07,898 WARNING  [task.py:337] [Task: meta_bbh] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2025-02-05:11:03:07,906 WARNING  [task.py:337] [Task: meta_bbh] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2025-02-05:11:03:08,392 INFO     [evaluator.py:279] Setting fewshot random generator seed to 42
2025-02-05:11:03:08,392 WARNING  [model.py:422] model.chat_template was called with the chat_template set to False or None. Therefore no chat template will be applied. Make sure this is an intended behavior.
2025-02-05:11:03:08,394 INFO     [task.py:423] Building contexts for meta_bbh on rank 0...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6511/6511 [00:00<00:00, 23359.87it/s]
2025-02-05:11:03:09,180 INFO     [evaluator.py:465] Running generate_until requests
Requesting API:   2%|████▏

skhorasganiTT · 2025-02-05T21:35:28Z

I have not been able to reproduce this hang (which appears to occur during prefill based on the logs) on sjc-snva-t3018 (with the same commits) after testing the offline inference script with the async engine and several hundred prompts for each input sequence length of 2k/3k/4k. It's possible that this is machine specific. @tstescoTT if any other machines with reworked cards become available would you be able to try them out?

tstescoTT · 2025-02-06T01:35:29Z

I have not been able to reproduce this hang (which appears to occur during prefill based on the logs) on sjc-snva-t3018 (with the same commits) after testing the offline inference script with the async engine and several hundred prompts for each input sequence length of 2k/3k/4k. It's possible that this is machine specific. @tstescoTT if any other machines with reworked cards become available would you be able to try them out?

Will do, might be a while before that but thanks for trying to repro.

For reference in this channel, I was able to repro 4x by sending ISL=3900, OSL=128 and ISL=4096,OSL=128 both batch 32, multiple batches of requests. Typically occured after 64 to 256 requests.

tstescoTT added the bug Something isn't working label Feb 5, 2025

tstescoTT assigned skhorasganiTT Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Qwen2.5-72B model hang/crash #61

[Bug]: Qwen2.5-72B model hang/crash #61

tstescoTT commented Feb 5, 2025 •

edited

Loading

skhorasganiTT commented Feb 5, 2025

tstescoTT commented Feb 5, 2025

skhorasganiTT commented Feb 5, 2025

tstescoTT commented Feb 6, 2025

[Bug]: Qwen2.5-72B model hang/crash #61

[Bug]: Qwen2.5-72B model hang/crash #61

Comments

tstescoTT commented Feb 5, 2025 • edited Loading

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

skhorasganiTT commented Feb 5, 2025

tstescoTT commented Feb 5, 2025

skhorasganiTT commented Feb 5, 2025

tstescoTT commented Feb 6, 2025

tstescoTT commented Feb 5, 2025 •

edited

Loading