[BUG]: ColossalAI Inference example response empty result without error #6112

GuangyaoZhang · 2024-11-04T03:06:27Z

Is there an existing issue for this bug?

I have searched the existing issues

🐛 Describe the bug

Git commit: 2f583c1(Current master branch)

code(Example code in colossalai inference readme):

import torch
import transformers
import colossalai
from colossalai.inference import InferenceEngine, InferenceConfig
from pprint import pprint

colossalai.launch_from_torch()


model_path = "lmsys/vicuna-7b-v1.3"
model = transformers.LlamaForCausalLM.from_pretrained(model_path).cuda()
tokenizer = transformers.AutoTokenizer.from_pretrained(model_path)


inference_config = InferenceConfig(
                dtype=torch.float16,
                max_batch_size=4,
                max_input_len=1024,
                max_output_len=512,
                use_cuda_kernel=True,
            )


engine = InferenceEngine(model, tokenizer, inference_config, verbose=True)

prompts = ['Who is the best player in the history of NBA?']
response = engine.generate(prompts)
pprint(response)

run command:

colossalai run --nproc_per_node 1 speed.py

Output:


/data/miniconda/envs/torch/lib/python3.10/site-packages/diffusers/models/transformers/transformer_2d.py:34: FutureWarning: `Transformer2DModelOutput` is deprecated and will be removed in version 1.0.0. Importing `Transformer2DModelOutput` from `diffusers.models.transformer_2d` is deprecated and this will be removed in a future version. Please use `from diffusers.models.modeling_outputs import Transformer2DModelOutput`, instead.
  deprecate("Transformer2DModelOutput", "1.0.0", deprecation_message)
/data/coding/ColossalAI/colossalai/shardformer/layer/normalization.py:45: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
  warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel")
[11/04/24 11:04:32] INFO     colossalai - colossalai - INFO:                    
                             /data/coding/ColossalAI/colossalai/initialize.py:75
                             launch                                             
                    INFO     colossalai - colossalai - INFO: Distributed        
                             environment is initialized, world size: 1          
/data/miniconda/envs/torch/lib/python3.10/site-packages/huggingface_hub/file_download.py:797: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Loading checkpoint shards: 100%|██████████| 2/2 [00:17<00:00,  8.83s/it]
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
/data/miniconda/envs/torch/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
[extension] Time taken to load inference_ops_cuda op: 0.16129255294799805 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.001485586166381836 seconds
[11/04/24 11:05:06] WARNING  colossalai - colossalai.inference.utils - WARNING: 
                             /data/coding/ColossalAI/colossalai/inference/utils.
                             py:162 can_use_flash_attn2                         
                    WARNING  colossalai - colossalai.inference.utils - WARNING: 
                             flash_attn2 has not been installed yet, we will use
                             triton flash attn instead.                         
[11/04/24 11:05:06] INFO     colossalai - colossalai.inference.core.llm_engine -
                             INFO:                                              
                             /data/coding/ColossalAI/colossalai/inference/core/l
                             lm_engine.py:158 init_model                        
                    INFO     colossalai - colossalai.inference.core.llm_engine -
                             INFO: the device is cuda:0                         
                    INFO     colossalai - colossalai.inference.core.llm_engine -
                             INFO:                                              
                             /data/coding/ColossalAI/colossalai/inference/core/l
                             lm_engine.py:163 init_model                        
                    INFO     colossalai - colossalai.inference.core.llm_engine -
                             INFO: Before the shard, Rank: [0], model size:     
                             12.551277160644531 GB, model's device is: cuda:0   
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0019431114196777344 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0009531974792480469 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007824897766113281 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007727146148681641 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007011890411376953 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008337497711181641 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0006923675537109375 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007014274597167969 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007956027984619141 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0006723403930664062 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007219314575195312 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007529258728027344 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.00080108642578125 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0010461807250976562 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007071495056152344 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007612705230712891 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007638931274414062 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0009360313415527344 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008411407470703125 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0010635852813720703 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008685588836669922 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0010421276092529297 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008721351623535156 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0009806156158447266 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008914470672607422 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0010721683502197266 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008542537689208984 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008599758148193359 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008606910705566406 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008687973022460938 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008411407470703125 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0009608268737792969 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008566379547119141 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0009701251983642578 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008494853973388672 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0009267330169677734 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0010409355163574219 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0009996891021728516 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0009884834289550781 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0010025501251220703 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.001371622085571289 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008530616760253906 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008502006530761719 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008380413055419922 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0010218620300292969 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008378028869628906 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008902549743652344 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008327960968017578 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008392333984375 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008347034454345703 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008482933044433594 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008289813995361328 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008499622344970703 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008411407470703125 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008337497711181641 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008308887481689453 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008511543273925781 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008406639099121094 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008447170257568359 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008463859558105469 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007150173187255859 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007104873657226562 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007312297821044922 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007383823394775391 seconds
[11/04/24 11:05:08] INFO     colossalai - colossalai.inference.core.llm_engine -
                             INFO:                                              
                             /data/coding/ColossalAI/colossalai/inference/core/l
                             lm_engine.py:193 init_model                        
                    INFO     colossalai - colossalai.inference.core.llm_engine -
                             INFO: After the shard, Rank: [0], model size:      
                             12.551277160644531 GB, model's device is: cuda:0   
                    INFO     colossalai - colossalai.inference.core.llm_engine -
                             INFO:                                              
                             /data/coding/ColossalAI/colossalai/inference/core/l
                             lm_engine.py:208 init_model                        
                    INFO     colossalai - colossalai.inference.core.llm_engine -
                             INFO: Rank [0], Model Weight Max Occupy 2.33984375 
                             GB, Model size: 12.551277160644531 GB              
[11/04/24 11:05:08] INFO     colossalai -                                       
                             colossalai.inference.kv_cache.kvcache_manager -    
                             INFO:                                              
                             /data/coding/ColossalAI/colossalai/inference/kv_cac
                             he/kvcache_manager.py:98 __init__                  
                    INFO     colossalai -                                       
                             colossalai.inference.kv_cache.kvcache_manager -    
                             INFO: Allocating K cache with shape: (384, 32, 16, 
                             16, 8), V cache with shape: (384, 32, 16, 128)     
                             consisting of 384 blocks.                          
                    INFO     colossalai -                                       
                             colossalai.inference.kv_cache.kvcache_manager -    
                             INFO:                                              
                             /data/coding/ColossalAI/colossalai/inference/kv_cac
                             he/kvcache_manager.py:115 __init__                 
                    INFO     colossalai -                                       
                             colossalai.inference.kv_cache.kvcache_manager -    
                             INFO: Allocated 3.00 GB of KV cache on device      
                             cuda:0.                                            
[]

====== Training on All Nodes =====
127.0.0.1: success

====== Stopping All Nodes =====
127.0.0.1: finish

Environment

pytorch=2.3.1
python=3.10

nvidia-smi
V100 32G, with CUDA=12.4

The text was updated successfully, but these errors were encountered:

GuangyaoZhang · 2024-11-04T03:07:19Z

@ver217

Issues-translate-bot · 2024-11-04T03:10:34Z

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

@ver217

GuangyaoZhang added the bug Something isn't working label Nov 4, 2024

GuangyaoZhang assigned ver217 Nov 4, 2024

GuangyaoZhang unassigned ver217 Nov 4, 2024

GuangyaoZhang assigned ver217 Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: ColossalAI Inference example response empty result without error #6112

[BUG]: ColossalAI Inference example response empty result without error #6112

GuangyaoZhang commented Nov 4, 2024 •

edited

Loading

GuangyaoZhang commented Nov 4, 2024

Issues-translate-bot commented Nov 4, 2024

[BUG]: ColossalAI Inference example response empty result without error #6112

[BUG]: ColossalAI Inference example response empty result without error #6112

Comments

GuangyaoZhang commented Nov 4, 2024 • edited Loading

Is there an existing issue for this bug?

🐛 Describe the bug

code(Example code in colossalai inference readme):

run command:

Output:

Environment

GuangyaoZhang commented Nov 4, 2024

Issues-translate-bot commented Nov 4, 2024

GuangyaoZhang commented Nov 4, 2024 •

edited

Loading