You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
code(Example code in colossalai inference readme):
import torch
import transformers
import colossalai
from colossalai.inference import InferenceEngine, InferenceConfig
from pprint import pprint
colossalai.launch_from_torch()
model_path = "lmsys/vicuna-7b-v1.3"
model = transformers.LlamaForCausalLM.from_pretrained(model_path).cuda()
tokenizer = transformers.AutoTokenizer.from_pretrained(model_path)
inference_config = InferenceConfig(
dtype=torch.float16,
max_batch_size=4,
max_input_len=1024,
max_output_len=512,
use_cuda_kernel=True,
)
engine = InferenceEngine(model, tokenizer, inference_config, verbose=True)
prompts = ['Who is the best player in the history of NBA?']
response = engine.generate(prompts)
pprint(response)
run command:
colossalai run --nproc_per_node 1 speed.py
Output:
/data/miniconda/envs/torch/lib/python3.10/site-packages/diffusers/models/transformers/transformer_2d.py:34: FutureWarning: `Transformer2DModelOutput` is deprecated and will be removed in version 1.0.0. Importing `Transformer2DModelOutput` from `diffusers.models.transformer_2d` is deprecated and this will be removed in a future version. Please use `from diffusers.models.modeling_outputs import Transformer2DModelOutput`, instead.
deprecate("Transformer2DModelOutput", "1.0.0", deprecation_message)
/data/coding/ColossalAI/colossalai/shardformer/layer/normalization.py:45: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMSNorm kernel")
[11/04/24 11:04:32] INFO colossalai - colossalai - INFO:
/data/coding/ColossalAI/colossalai/initialize.py:75
launch
INFO colossalai - colossalai - INFO: Distributed
environment is initialized, world size: 1
/data/miniconda/envs/torch/lib/python3.10/site-packages/huggingface_hub/file_download.py:797: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
Loading checkpoint shards: 100%|██████████| 2/2 [00:17<00:00, 8.83s/it]
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
/data/miniconda/envs/torch/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
[extension] Time taken to load inference_ops_cuda op: 0.16129255294799805 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.001485586166381836 seconds
[11/04/24 11:05:06] WARNING colossalai - colossalai.inference.utils - WARNING:
/data/coding/ColossalAI/colossalai/inference/utils.
py:162 can_use_flash_attn2
WARNING colossalai - colossalai.inference.utils - WARNING:
flash_attn2 has not been installed yet, we will use
triton flash attn instead.
[11/04/24 11:05:06] INFO colossalai - colossalai.inference.core.llm_engine -
INFO:
/data/coding/ColossalAI/colossalai/inference/core/l
lm_engine.py:158 init_model
INFO colossalai - colossalai.inference.core.llm_engine -
INFO: the device is cuda:0
INFO colossalai - colossalai.inference.core.llm_engine -
INFO:
/data/coding/ColossalAI/colossalai/inference/core/l
lm_engine.py:163 init_model
INFO colossalai - colossalai.inference.core.llm_engine -
INFO: Before the shard, Rank: [0], model size:
12.551277160644531 GB, model's device is: cuda:0
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0019431114196777344 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0009531974792480469 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007824897766113281 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007727146148681641 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007011890411376953 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008337497711181641 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0006923675537109375 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007014274597167969 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007956027984619141 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0006723403930664062 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007219314575195312 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007529258728027344 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.00080108642578125 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0010461807250976562 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007071495056152344 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007612705230712891 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007638931274414062 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0009360313415527344 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008411407470703125 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0010635852813720703 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008685588836669922 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0010421276092529297 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008721351623535156 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0009806156158447266 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008914470672607422 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0010721683502197266 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008542537689208984 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008599758148193359 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008606910705566406 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008687973022460938 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008411407470703125 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0009608268737792969 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008566379547119141 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0009701251983642578 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008494853973388672 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0009267330169677734 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0010409355163574219 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0009996891021728516 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0009884834289550781 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0010025501251220703 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.001371622085571289 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008530616760253906 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008502006530761719 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008380413055419922 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0010218620300292969 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008378028869628906 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008902549743652344 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008327960968017578 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008392333984375 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008347034454345703 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008482933044433594 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008289813995361328 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008499622344970703 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008411407470703125 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008337497711181641 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008308887481689453 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008511543273925781 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008406639099121094 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008447170257568359 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0008463859558105469 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007150173187255859 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007104873657226562 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007312297821044922 seconds
[extension] Loading the JIT-built inference_ops_cuda kernel during runtime now
[extension] Time taken to load inference_ops_cuda op: 0.0007383823394775391 seconds
[11/04/24 11:05:08] INFO colossalai - colossalai.inference.core.llm_engine -
INFO:
/data/coding/ColossalAI/colossalai/inference/core/l
lm_engine.py:193 init_model
INFO colossalai - colossalai.inference.core.llm_engine -
INFO: After the shard, Rank: [0], model size:
12.551277160644531 GB, model's device is: cuda:0
INFO colossalai - colossalai.inference.core.llm_engine -
INFO:
/data/coding/ColossalAI/colossalai/inference/core/l
lm_engine.py:208 init_model
INFO colossalai - colossalai.inference.core.llm_engine -
INFO: Rank [0], Model Weight Max Occupy 2.33984375
GB, Model size: 12.551277160644531 GB
[11/04/24 11:05:08] INFO colossalai -
colossalai.inference.kv_cache.kvcache_manager -
INFO:
/data/coding/ColossalAI/colossalai/inference/kv_cac
he/kvcache_manager.py:98 __init__
INFO colossalai -
colossalai.inference.kv_cache.kvcache_manager -
INFO: Allocating K cache with shape: (384, 32, 16,
16, 8), V cache with shape: (384, 32, 16, 128)
consisting of 384 blocks.
INFO colossalai -
colossalai.inference.kv_cache.kvcache_manager -
INFO:
/data/coding/ColossalAI/colossalai/inference/kv_cac
he/kvcache_manager.py:115 __init__
INFO colossalai -
colossalai.inference.kv_cache.kvcache_manager -
INFO: Allocated 3.00 GB of KV cache on device
cuda:0.
[]
====== Training on All Nodes =====
127.0.0.1: success
====== Stopping All Nodes =====
127.0.0.1: finish
Environment
pytorch=2.3.1
python=3.10
nvidia-smi
V100 32G, with CUDA=12.4
The text was updated successfully, but these errors were encountered:
Is there an existing issue for this bug?
🐛 Describe the bug
Git commit: 2f583c1(Current master branch)
code(Example code in colossalai inference readme):
run command:
colossalai run --nproc_per_node 1 speed.py
Output:
Environment
pytorch=2.3.1
python=3.10
nvidia-smi
V100 32G, with CUDA=12.4
The text was updated successfully, but these errors were encountered: