Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference Problem #42

Open
LKAMING97 opened this issue Sep 11, 2024 · 13 comments
Open

Inference Problem #42

LKAMING97 opened this issue Sep 11, 2024 · 13 comments

Comments

@LKAMING97
Copy link

Why does it take so long to infer just two pictures?

Uploading image.png…

@LKAMING97
Copy link
Author

python interleaved_generation.py -i 'Please introduce the city of Gyumri with pictures.' -s "./test/" 

@LKAMING97
Copy link
Author

It was running for ages so I stopped

Instruction: draw a dog
Batch size: 2
VQModel loaded from data/tokenizer/vqgan.ckpt
^CTraceback (most recent call last):
  File "/root/autodl-tmp/anole/text2image.py", line 71, in <module>
    main(args)
  File "/root/autodl-tmp/anole/text2image.py", line 46, in main
    image_tokens: torch.LongTensor = model.generate(
  File "/root/autodl-tmp/anole/chameleon/inference/chameleon.py", line 665, in generate
    tokens = [t.id for t in self.stream(*args, **kwargs)]
  File "/root/autodl-tmp/anole/chameleon/inference/chameleon.py", line 665, in <listcomp>
    tokens = [t.id for t in self.stream(*args, **kwargs)]
  File "/root/autodl-tmp/anole/chameleon/inference/chameleon.py", line 649, in stream
    while key_token := self.dctx.res_q.get():
  File "/root/miniconda3/lib/python3.10/multiprocessing/queues.py", line 103, in get
    res = self._recv_bytes()
  File "/root/miniconda3/lib/python3.10/multiprocessing/connection.py", line 221, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/root/miniconda3/lib/python3.10/multiprocessing/connection.py", line 419, in _recv_bytes
    buf = self._recv(4)
  File "/root/miniconda3/lib/python3.10/multiprocessing/connection.py", line 384, in _recv
    chunk = read(handle, remaining)
KeyboardInterrupt
^CException ignored in atexit callback: <function _exit_function at 0x7f7291b91b40>
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.10/multiprocessing/util.py", line 334, in _exit_function
    _run_finalizers(0)
  File "/root/miniconda3/lib/python3.10/multiprocessing/util.py", line 300, in _run_finalizers
    finalizer()
  File "/root/miniconda3/lib/python3.10/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/root/miniconda3/lib/python3.10/multiprocessing/managers.py", line 674, in _finalize_manager
    process.join(timeout=1.0)
  File "/root/miniconda3/lib/python3.10/multiprocessing/process.py", line 149, in join
    res = self._popen.wait(timeout)
  File "/root/miniconda3/lib/python3.10/multiprocessing/popen_fork.py", line 40, in wait
    if not wait([self.sentinel], timeout):
  File "/root/miniconda3/lib/python3.10/multiprocessing/connection.py", line 936, in wait
    ready = selector.select(timeout)
  File "/root/miniconda3/lib/python3.10/selectors.py", line 416, in select
    fd_event_list = self._selector.poll(timeout)
KeyboardInterrupt: 
^C
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.78                 Driver Version: 550.78         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:98:00.0 Off |                  N/A |
|  0%   25C    P8             22W /  370W |     564MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        On  |   00000000:B1:00.0 Off |                  N/A |
|  0%   26C    P8             15W /  370W |       4MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

@LKAMING97
Copy link
Author

I can't generate anything in your example. What did I do wrong?

@EthanC111
Copy link
Collaborator

Thank you for your interest! Inference on Anole-7b requires at least 20GB of memory. It might be related to memory issue. Do you mind try using another GPU with larger memory? Thanks!

@LKAMING97
Copy link
Author

LKAMING97 commented Sep 13, 2024 via email

@EthanC111
Copy link
Collaborator

Hi, quantization might be helpful: https://github.com/GAIR-NLP/anole/pull/21/files
If not, please let us know! Thanks.

@LKAMING97
Copy link
Author

I reinstalled the environment and ran it according to the steps, but this still happens.

Traceback (most recent call last):
  File "interleaved_generation.py", line 5, in <module>
    from chameleon.inference.chameleon import ChameleonInferenceModel, Options
  File "/root/autodl-tmp/anole/chameleon/inference/chameleon.py", line 32, in <module>
    from chameleon.inference import loader
  File "/root/autodl-tmp/anole/chameleon/inference/loader.py", line 13, in <module>
    from chameleon.inference.transformer import ModelArgs, Transformer
  File "/root/autodl-tmp/anole/chameleon/inference/transformer.py", line 19, in <module>
    class ModelArgs:
  File "/root/autodl-tmp/anole/chameleon/inference/transformer.py", line 24, in ModelArgs
    n_kv_heads: int | None = None
TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'

@LKAMING97
Copy link
Author

how to fix it

image

@LKAMING97
Copy link
Author

I have been installing according to your steps, but I keep having problems, which makes me very frustrated.

@Chaoran-F
Copy link

Use Python 3.10 or change them to the type of "rank: Union[int, None] = None", I recommend to use Python 3.10, I found a lot place need to change .

@Lulahei
Copy link

Lulahei commented Sep 25, 2024

Hi, quantization might be helpful: https://github.com/GAIR-NLP/anole/pull/21/files If not, please let us know! Thanks.

after i use the quantization function,the program also says OutOfMemoryError as to:

Instruction: draw a dog
Batch size: 10
VQModel loaded from /data/mjl/model_zoo/Anole-7b-v0.1/tokenizer/vqgan.ckpt
Process SpawnProcess-2:
Traceback (most recent call last):
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/data/mjl/anole-main/chameleon/inference/chameleon.py", line 495, in _worker_impl
    model = loader.load_model(model, rank=rank)
  File "/data/mjl/anole-main/chameleon/inference/loader.py", line 61, in load_model
    return _convert(
  File "/data/mjl/anole-main/chameleon/inference/loader.py", line 23, in _convert
    torch.load(str(consolidated_path), map_location='cuda'),
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/site-packages/torch/serialization.py", line 1014, in load
    return _load(opened_zipfile,
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/site-packages/torch/serialization.py", line 1422, in _load
    result = unpickler.load()
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/site-packages/torch/serialization.py", line 1392, in persistent_load
    typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/site-packages/torch/serialization.py", line 1366, in load_tensor
    wrap_storage=restore_location(storage, location),
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/site-packages/torch/serialization.py", line 1296, in restore_location
    return default_restore_location(storage, map_location)
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/site-packages/torch/serialization.py", line 381, in default_restore_location
    result = fn(storage, location)
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/site-packages/torch/serialization.py", line 279, in _cuda_deserialize
    return obj.cuda(device)
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/site-packages/torch/_utils.py", line 114, in _cuda
    untyped_storage = torch.UntypedStorage(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacty of 23.69 GiB of which 18.75 MiB is free. Process 46513 has 558.00 MiB memory in use. Including non-PyTorch memory, this process has 23.12 GiB memory in use. Of the allocated memory 22.83 GiB is allocated by PyTorch, and 1.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
^CTraceback (most recent call last):
  File "/data/mjl/anole-main/text2image.py", line 83, in <module>
    main(args)
  File "/data/mjl/anole-main/text2image.py", line 29, in main
    unquantized_model = ChameleonInferenceModel(
  File "/data/mjl/anole-main/chameleon/inference/chameleon.py", line 569, in __init__
    self.dctx.ready_barrier.wait()
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/threading.py", line 668, in wait
    self._wait(timeout)
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/threading.py", line 703, in _wait
    if not self._cond.wait_for(lambda : self._state != 0, timeout):
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/multiprocessing/synchronize.py", line 313, in wait_for
    self.wait(waittime)
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/multiprocessing/synchronize.py", line 261, in wait
    return self._wait_semaphore.acquire(True, timeout)
KeyboardInterrupt

my device also RTX 3090 ,i dont know how to solve this problem.If you can help me solve it, I would be extremely grateful

@Lulahei
Copy link

Lulahei commented Sep 25, 2024

also,before quantization the free memory is 18.75 MiB,after quantization the free memory is 18.75 MiB too,is the function is not work?
2a20d05006feb3a29c95db8b96618c3

1084e9597b8ce855fb45fc37fc9335c

@XiaoShuhong
Copy link

Hi, quantization might be helpful: https://github.com/GAIR-NLP/anole/pull/21/files If not, please let us know! Thanks.

after i use the quantization function,the program also says OutOfMemoryError as to:

Instruction: draw a dog
Batch size: 10
VQModel loaded from /data/mjl/model_zoo/Anole-7b-v0.1/tokenizer/vqgan.ckpt
Process SpawnProcess-2:
Traceback (most recent call last):
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/data/mjl/anole-main/chameleon/inference/chameleon.py", line 495, in _worker_impl
    model = loader.load_model(model, rank=rank)
  File "/data/mjl/anole-main/chameleon/inference/loader.py", line 61, in load_model
    return _convert(
  File "/data/mjl/anole-main/chameleon/inference/loader.py", line 23, in _convert
    torch.load(str(consolidated_path), map_location='cuda'),
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/site-packages/torch/serialization.py", line 1014, in load
    return _load(opened_zipfile,
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/site-packages/torch/serialization.py", line 1422, in _load
    result = unpickler.load()
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/site-packages/torch/serialization.py", line 1392, in persistent_load
    typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/site-packages/torch/serialization.py", line 1366, in load_tensor
    wrap_storage=restore_location(storage, location),
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/site-packages/torch/serialization.py", line 1296, in restore_location
    return default_restore_location(storage, map_location)
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/site-packages/torch/serialization.py", line 381, in default_restore_location
    result = fn(storage, location)
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/site-packages/torch/serialization.py", line 279, in _cuda_deserialize
    return obj.cuda(device)
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/site-packages/torch/_utils.py", line 114, in _cuda
    untyped_storage = torch.UntypedStorage(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacty of 23.69 GiB of which 18.75 MiB is free. Process 46513 has 558.00 MiB memory in use. Including non-PyTorch memory, this process has 23.12 GiB memory in use. Of the allocated memory 22.83 GiB is allocated by PyTorch, and 1.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
^CTraceback (most recent call last):
  File "/data/mjl/anole-main/text2image.py", line 83, in <module>
    main(args)
  File "/data/mjl/anole-main/text2image.py", line 29, in main
    unquantized_model = ChameleonInferenceModel(
  File "/data/mjl/anole-main/chameleon/inference/chameleon.py", line 569, in __init__
    self.dctx.ready_barrier.wait()
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/threading.py", line 668, in wait
    self._wait(timeout)
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/threading.py", line 703, in _wait
    if not self._cond.wait_for(lambda : self._state != 0, timeout):
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/multiprocessing/synchronize.py", line 313, in wait_for
    self.wait(waittime)
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/multiprocessing/synchronize.py", line 261, in wait
    return self._wait_semaphore.acquire(True, timeout)
KeyboardInterrupt

my device also RTX 3090 ,i dont know how to solve this problem.If you can help me solve it, I would be extremely grateful

Same device as yours. This issue occurs during model initialization(unquantized_model = ChameleonInferenceModel()), which is why quantization has not taken effect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants