Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deepseek v2.5 w4a16在vllm-0.6.3.post1运行失败 #99

Open
cxmt-ai-tc opened this issue Dec 9, 2024 · 1 comment
Open

deepseek v2.5 w4a16在vllm-0.6.3.post1运行失败 #99

cxmt-ai-tc opened this issue Dec 9, 2024 · 1 comment

Comments

@cxmt-ai-tc
Copy link

CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m vllm.entrypoints.openai.api_server --model /nas_data/userdata/tc/models/deepseek/deepseek-coder-v2-instruct-awq/ -tp 4 --max-model-len 4096 --trust-remote-code
INFO 12-08 23:21:23 api_server.py:528] vLLM API server version 0.6.3.post1
INFO 12-08 23:21:23 api_server.py:529] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/nas_data/userdata/tc/models/deepseek/deepseek-coder-v2-instruct-awq/', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=4096, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, scheduling_policy='fcfs', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False)
INFO 12-08 23:21:23 api_server.py:166] Multiprocessing frontend to use ipc:///tmp/0417594d-6be5-47ed-a972-83686b6fd559 for IPC Path.
INFO 12-08 23:21:23 api_server.py:179] Started engine process with PID 23409
INFO 12-08 23:21:23 config.py:107] Replacing legacy 'type' key with 'rope_type'
INFO 12-08 23:21:26 config.py:107] Replacing legacy 'type' key with 'rope_type'
INFO 12-08 23:21:26 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 12-08 23:21:26 config.py:905] Defaulting to use mp for distributed inference
WARNING 12-08 23:21:26 arg_utils.py:1019] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
INFO 12-08 23:21:30 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 12-08 23:21:30 config.py:905] Defaulting to use mp for distributed inference
WARNING 12-08 23:21:30 arg_utils.py:1019] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
INFO 12-08 23:21:30 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='/nas_data/userdata/tc/models/deepseek/deepseek-coder-v2-instruct-awq/', speculative_config=None, tokenizer='/nas_data/userdata/tc/models/deepseek/deepseek-coder-v2-instruct-awq/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/nas_data/userdata/tc/models/deepseek/deepseek-coder-v2-instruct-awq/, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None)
WARNING 12-08 23:21:30 multiproc_gpu_executor.py:53] Reducing Torch parallelism from 64 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 12-08 23:21:30 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=23680) INFO 12-08 23:21:30 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=23682) INFO 12-08 23:21:30 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=23681) INFO 12-08 23:21:30 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
INFO 12-08 23:21:31 utils.py:1008] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=23682) INFO 12-08 23:21:31 utils.py:1008] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=23680) INFO 12-08 23:21:31 utils.py:1008] Found nccl from library libnccl.so.2
INFO 12-08 23:21:31 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=23681) INFO 12-08 23:21:31 utils.py:1008] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=23682) INFO 12-08 23:21:31 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=23680) INFO 12-08 23:21:31 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=23681) INFO 12-08 23:21:31 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=23680) WARNING 12-08 23:21:31 custom_all_reduce.py:132] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=23681) WARNING 12-08 23:21:31 custom_all_reduce.py:132] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=23682) WARNING 12-08 23:21:31 custom_all_reduce.py:132] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 12-08 23:21:31 custom_all_reduce.py:132] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 12-08 23:21:32 shm_broadcast.py:241] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7f0470aa4490>, local_subscribe_port=54089, remote_subscribe_port=None)
INFO 12-08 23:21:32 model_runner.py:1056] Starting to load model /nas_data/userdata/tc/models/deepseek/deepseek-coder-v2-instruct-awq/...
(VllmWorkerProcess pid=23681) INFO 12-08 23:21:32 model_runner.py:1056] Starting to load model /nas_data/userdata/tc/models/deepseek/deepseek-coder-v2-instruct-awq/...
(VllmWorkerProcess pid=23680) INFO 12-08 23:21:32 model_runner.py:1056] Starting to load model /nas_data/userdata/tc/models/deepseek/deepseek-coder-v2-instruct-awq/...
(VllmWorkerProcess pid=23682) INFO 12-08 23:21:32 model_runner.py:1056] Starting to load model /nas_data/userdata/tc/models/deepseek/deepseek-coder-v2-instruct-awq/...
Cache shape torch.Size([163840, 64])
(VllmWorkerProcess pid=23680) Cache shape torch.Size([163840, 64])
(VllmWorkerProcess pid=23682) Cache shape torch.Size([163840, 64])
(VllmWorkerProcess pid=23681) Cache shape torch.Size([163840, 64])
Loading safetensors checkpoint shards: 0% Completed | 0/26 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 4% Completed | 1/26 [00:00<00:20, 1.25it/s]
Loading safetensors checkpoint shards: 8% Completed | 2/26 [00:01<00:22, 1.05it/s]
Loading safetensors checkpoint shards: 12% Completed | 3/26 [00:02<00:22, 1.02it/s]
Loading safetensors checkpoint shards: 15% Completed | 4/26 [00:03<00:21, 1.03it/s]
Loading safetensors checkpoint shards: 19% Completed | 5/26 [00:04<00:21, 1.01s/it]
Loading safetensors checkpoint shards: 23% Completed | 6/26 [00:06<00:21, 1.06s/it]
Loading safetensors checkpoint shards: 27% Completed | 7/26 [00:07<00:20, 1.08s/it]
Loading safetensors checkpoint shards: 31% Completed | 8/26 [00:08<00:19, 1.06s/it]
Loading safetensors checkpoint shards: 35% Completed | 9/26 [00:09<00:18, 1.08s/it]
Loading safetensors checkpoint shards: 38% Completed | 10/26 [00:10<00:17, 1.09s/it]
Loading safetensors checkpoint shards: 42% Completed | 11/26 [00:11<00:16, 1.08s/it]
Loading safetensors checkpoint shards: 46% Completed | 12/26 [00:12<00:15, 1.07s/it]
Loading safetensors checkpoint shards: 50% Completed | 13/26 [00:13<00:14, 1.09s/it]
Loading safetensors checkpoint shards: 54% Completed | 14/26 [00:14<00:13, 1.09s/it]
Loading safetensors checkpoint shards: 58% Completed | 15/26 [00:15<00:11, 1.08s/it]
Loading safetensors checkpoint shards: 62% Completed | 16/26 [00:16<00:10, 1.07s/it]
Loading safetensors checkpoint shards: 65% Completed | 17/26 [00:17<00:09, 1.05s/it]
Loading safetensors checkpoint shards: 69% Completed | 18/26 [00:18<00:08, 1.06s/it]
Loading safetensors checkpoint shards: 73% Completed | 19/26 [00:20<00:07, 1.07s/it]
Loading safetensors checkpoint shards: 77% Completed | 20/26 [00:21<00:06, 1.08s/it]
Loading safetensors checkpoint shards: 81% Completed | 21/26 [00:22<00:05, 1.08s/it]
Loading safetensors checkpoint shards: 85% Completed | 22/26 [00:23<00:04, 1.10s/it]
Loading safetensors checkpoint shards: 88% Completed | 23/26 [00:24<00:03, 1.12s/it]
Loading safetensors checkpoint shards: 92% Completed | 24/26 [00:25<00:02, 1.12s/it]
Loading safetensors checkpoint shards: 96% Completed | 25/26 [00:26<00:01, 1.12s/it]
Loading safetensors checkpoint shards: 100% Completed | 26/26 [00:27<00:00, 1.01it/s]
Loading safetensors checkpoint shards: 100% Completed | 26/26 [00:27<00:00, 1.06s/it]

(VllmWorkerProcess pid=23682) INFO 12-08 23:22:13 model_runner.py:1067] Loading model weights took 30.3855 GB
(VllmWorkerProcess pid=23680) INFO 12-08 23:22:13 model_runner.py:1067] Loading model weights took 30.3855 GB
INFO 12-08 23:22:13 model_runner.py:1067] Loading model weights took 30.3855 GB
(VllmWorkerProcess pid=23681) INFO 12-08 23:22:14 model_runner.py:1067] Loading model weights took 30.3855 GB
(VllmWorkerProcess pid=23680) WARNING 12-08 23:22:14 fused_moe.py:323] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=160,N=10240,device_name=NVIDIA_A40.json
(VllmWorkerProcess pid=23682) WARNING 12-08 23:22:14 fused_moe.py:323] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=160,N=10240,device_name=NVIDIA_A40.json
(VllmWorkerProcess pid=23681) WARNING 12-08 23:22:14 fused_moe.py:323] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=160,N=10240,device_name=NVIDIA_A40.json
(VllmWorkerProcess pid=23680) INFO 12-08 23:22:14 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241208-232214.pkl...
(VllmWorkerProcess pid=23682) INFO 12-08 23:22:14 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241208-232214.pkl...
(VllmWorkerProcess pid=23681) INFO 12-08 23:22:14 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241208-232214.pkl...
WARNING 12-08 23:22:14 fused_moe.py:323] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=160,N=10240,device_name=NVIDIA_A40.json
INFO 12-08 23:22:14 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241208-232214.pkl...
(VllmWorkerProcess pid=23682) INFO 12-08 23:22:14 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20241208-232214.pkl.
(VllmWorkerProcess pid=23680) INFO 12-08 23:22:14 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20241208-232214.pkl.
(VllmWorkerProcess pid=23681) INFO 12-08 23:22:14 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20241208-232214.pkl.
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks.
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks.
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] Traceback (most recent call last):
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] Traceback (most recent call last):
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return func(*args, **kwargs)
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return func(*args, **kwargs)
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1658, in execute_model
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1658, in execute_model
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks.
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] Traceback (most recent call last):
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return func(*args, **kwargs)
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1658, in execute_model
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 510, in forward
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 510, in forward
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] hidden_states = self.model(input_ids, positions, kv_caches,
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] hidden_states = self.model(input_ids, positions, kv_caches,
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 510, in forward
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] hidden_states = self.model(input_ids, positions, kv_caches,
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 465, in forward
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 465, in forward
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] hidden_states, residual = layer(positions, hidden_states,
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] hidden_states, residual = layer(positions, hidden_states,
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 465, in forward
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] hidden_states, residual = layer(positions, hidden_states,
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 402, in forward
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 402, in forward
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] hidden_states = self.mlp(hidden_states)
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] hidden_states = self.mlp(hidden_states)
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 402, in forward
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] hidden_states = self.mlp(hidden_states)
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 149, in forward
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] final_hidden_states = self.experts(
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 149, in forward
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] final_hidden_states = self.experts(
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 149, in forward
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 474, in forward
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] final_hidden_states = self.experts(
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] final_hidden_states = self.quant_method.apply(
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 474, in forward
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/awq_marlin.py", line 452, in apply
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] final_hidden_states = self.quant_method.apply(
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return fused_marlin_moe(
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/awq_marlin.py", line 452, in apply
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_marlin_moe.py", line 219, in fused_marlin_moe
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return fused_marlin_moe(
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] sorted_token_ids, _, _ = moe_align_block_size(topk_ids, block_size_m, E)
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 474, in forward
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_marlin_moe.py", line 219, in fused_marlin_moe
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 228, in moe_align_block_size
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] final_hidden_states = self.quant_method.apply(
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] sorted_token_ids, _, _ = moe_align_block_size(topk_ids, block_size_m, E)
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] ops.moe_align_block_size(topk_ids, num_experts, block_size, sorted_ids,
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/awq_marlin.py", line 452, in apply
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 228, in moe_align_block_size
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 45, in wrapper
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return fused_marlin_moe(
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] ops.moe_align_block_size(topk_ids, num_experts, block_size, sorted_ids,
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return fn(*args, **kwargs)
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_marlin_moe.py", line 219, in fused_marlin_moe
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 45, in wrapper
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 844, in moe_align_block_size
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] sorted_token_ids, _, _ = moe_align_block_size(topk_ids, block_size_m, E)
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return fn(*args, **kwargs)
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] torch.ops._C.moe_align_block_size(topk_ids, num_experts, block_size,
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 228, in moe_align_block_size
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 844, in moe_align_block_size
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/ops.py", line 1061, in call
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] ops.moe_align_block_size(topk_ids, num_experts, block_size, sorted_ids,
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return self
._op(*args, **(kwargs or {}))
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 45, in wrapper
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] torch.ops._C.moe_align_block_size(topk_ids, num_experts, block_size,
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] RuntimeError: CUDA error: invalid argument
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return fn(*args, **kwargs)
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1061, in call
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/custom_ops.py", line 844, in moe_align_block_size
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return self
._op(*args, **(kwargs or {}))
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229]
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] torch.ops._C.moe_align_block_size(topk_ids, num_experts, block_size,
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] RuntimeError: CUDA error: invalid argument
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229]
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/ops.py", line 1061, in call
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] The above exception was the direct cause of the following exception:
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return self
._op(*args, **(kwargs or {}))
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229]
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229]
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] RuntimeError: CUDA error: invalid argument
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] Traceback (most recent call last):
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229]
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229]
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] The above exception was the direct cause of the following exception:
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] output = executor(*args, **kwargs)
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229]
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229]
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] The above exception was the direct cause of the following exception:
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] Traceback (most recent call last):
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return func(*args, **kwargs)
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229]
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 223, in determine_num_available_blocks
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] output = executor(*args, **kwargs)
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] Traceback (most recent call last):
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] self.model_runner.profile_run()
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] output = executor(*args, **kwargs)
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return func(*args, **kwargs)
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return func(*args, **kwargs)
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1305, in profile_run
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 223, in determine_num_available_blocks
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return func(*args, **kwargs)
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] self.model_runner.profile_run()
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 223, in determine_num_available_blocks
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] self.model_runner.profile_run()
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return func(*args, **kwargs)
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return func(*args, **kwargs)
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner_base.py", line 152, in _wrapper
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1305, in profile_run
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return func(*args, **kwargs)
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] raise type(err)(
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1305, in profile_run
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] RuntimeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241208-232214.pkl): CUDA error: invalid argument
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return func(*args, **kwargs)
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=23680) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229]
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner_base.py", line 152, in _wrapper
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] return func(*args, **kwargs)
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] raise type(err)(
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner_base.py", line 152, in _wrapper
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] RuntimeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241208-232214.pkl): CUDA error: invalid argument
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] raise type(err)(
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] RuntimeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241208-232214.pkl): CUDA error: invalid argument
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
(VllmWorkerProcess pid=23681) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229]
(VllmWorkerProcess pid=23682) ERROR 12-08 23:22:14 multiproc_worker_utils.py:229]
INFO 12-08 23:22:14 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20241208-232214.pkl.
(VllmWorkerProcess pid=23680) INFO 12-08 23:22:14 multiproc_worker_utils.py:240] Worker exiting
(VllmWorkerProcess pid=23681) INFO 12-08 23:22:14 multiproc_worker_utils.py:240] Worker exiting
(VllmWorkerProcess pid=23682) INFO 12-08 23:22:14 multiproc_worker_utils.py:240] Worker exiting
INFO 12-08 23:22:14 multiproc_worker_utils.py:120] Killing local vLLM worker processes
Process SpawnProcess-1:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1658, in execute_model
hidden_or_intermediate_states = model_executable(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 510, in forward
hidden_states = self.model(input_ids, positions, kv_caches,
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 465, in forward
hidden_states, residual = layer(positions, hidden_states,
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 402, in forward
hidden_states = self.mlp(hidden_states)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 149, in forward
final_hidden_states = self.experts(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 474, in forward
final_hidden_states = self.quant_method.apply(
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/awq_marlin.py", line 452, in apply
return fused_marlin_moe(
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_marlin_moe.py", line 219, in fused_marlin_moe
sorted_token_ids, _, _ = moe_align_block_size(topk_ids, block_size_m, E)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 228, in moe_align_block_size
ops.moe_align_block_size(topk_ids, num_experts, block_size, sorted_ids,
File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 45, in wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 844, in moe_align_block_size
torch.ops._C.moe_align_block_size(topk_ids, num_experts, block_size,
File "/usr/local/lib/python3.10/dist-packages/torch/ops.py", line 1061, in call
return self
._op(*args, **(kwargs or {}))
RuntimeError: CUDA error: invalid argument
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 390, in run_mp_engine
engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 139, in from_engine_args
return cls(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 78, in init
self.engine = LLMEngine(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 348, in init
self._initialize_kv_caches()
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 483, in _initialize_kv_caches
self.model_executor.determine_num_available_blocks())
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 39, in determine_num_available_blocks
num_blocks = self._run_workers("determine_num_available_blocks", )
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 192, in _run_workers
driver_worker_output = driver_worker_method(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 223, in determine_num_available_blocks
self.model_runner.profile_run()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1305, in profile_run
self.execute_model(model_input, kv_caches, intermediate_tensors)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner_base.py", line 152, in _wrapper
raise type(err)(
RuntimeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241208-232214.pkl): CUDA error: invalid argument
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 585, in
uvloop.run(run_server(args))
File "/usr/local/lib/python3.10/dist-packages/uvloop/init.py", line 82, in run
return loop.run_until_complete(wrapper())
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.10/dist-packages/uvloop/init.py", line 61, in wrapper
return await main
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 552, in run_server
async with build_async_engine_client(args) as engine_client:
File "/usr/lib/python3.10/contextlib.py", line 199, in aenter
return await anext(self.gen)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 107, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/usr/lib/python3.10/contextlib.py", line 199, in aenter
return await anext(self.gen)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start
root@s0pgpuap12:/workspace/sglang# /usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '

@cxmt-ai-tc
Copy link
Author

deepseek v2.5 w4a16使用llm-compressor生成,为gptq模式

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant