Skip to content
This repository has been archived by the owner on Mar 23, 2023. It is now read-only.

ColossalAI cannot run the shufflenet_v2_x1_0 model as torch do #139

Open
songyuc opened this issue Jun 16, 2022 · 5 comments
Open

ColossalAI cannot run the shufflenet_v2_x1_0 model as torch do #139

songyuc opened this issue Jun 16, 2022 · 5 comments

Comments

@songyuc
Copy link

songyuc commented Jun 16, 2022

🐛 Describe the bug

models.shufflenet_v2_x1_0 can be trained with BATCH_SIZE = 16384, which cannot be run successfully with ColossalAI.
The information is below:

(conda-general) user@user:~/research/Experiments/ColossalAI-Examples/image/resnet$ colossalai run --nproc_per_node 1 train.py
[06/16/22 13:30:42] INFO     colossalai - torch.distributed.distributed_c10d -  
                             INFO: Added key: store_based_barrier_key:1 to store
                             for rank: 0                                        
                    INFO     colossalai - torch.distributed.distributed_c10d -  
                             INFO: Rank 0: Completed store-based barrier for    
                             key:store_based_barrier_key:1 with 1 nodes.        
                    INFO     colossalai - torch.distributed.distributed_c10d -  
                             INFO: Added key: store_based_barrier_key:2 to store
                             for rank: 0                                        
                    INFO     colossalai - torch.distributed.distributed_c10d -  
                             INFO: Rank 0: Completed store-based barrier for    
                             key:store_based_barrier_key:2 with 1 nodes.        
                    INFO     colossalai - torch.distributed.distributed_c10d -  
                             INFO: Added key: store_based_barrier_key:3 to store
                             for rank: 0                                        
                    ...                                     
                    INFO     colossalai - torch.distributed.distributed_c10d -  
                             INFO: Rank 0: Completed store-based barrier for    
                             key:store_based_barrier_key:5 with 1 nodes.        
                    INFO     colossalai - torch.distributed.distributed_c10d -  
                             INFO: Added key: store_based_barrier_key:6 to store
                             for rank: 0                                        
                    INFO     colossalai - torch.distributed.distributed_c10d -  
                             INFO: Rank 0: Completed store-based barrier for    
                             key:store_based_barrier_key:6 with 1 nodes.        
                    INFO     colossalai - torch.distributed.distributed_c10d -  
                             INFO: Added key: store_based_barrier_key:7 to store
                             for rank: 0                                        
                    INFO     colossalai - torch.distributed.distributed_c10d -  
                             INFO: Rank 0: Completed store-based barrier for    
                             key:store_based_barrier_key:7 with 1 nodes.        
                    INFO     colossalai - torch.distributed.distributed_c10d -  
                             INFO: Added key: store_based_barrier_key:8 to store
                             for rank: 0                                        
                    INFO     colossalai - torch.distributed.distributed_c10d -  
                             INFO: Rank 0: Completed store-based barrier for    
                             key:store_based_barrier_key:8 with 1 nodes.        
                    INFO     colossalai - colossalai - INFO: /home/user/softw
                             are/python/anaconda/anaconda3/envs/conda-general/li
                             b/python3.10/site-packages/colossalai/context/paral
                             lel_context.py:521 set_device                      
                    INFO     colossalai - colossalai - INFO: process rank 0 is  
                             bound to device 0                                  
[06/16/22 13:30:43] INFO     colossalai - colossalai - INFO: /home/user/softw
                             are/python/anaconda/anaconda3/envs/conda-general/li
                             b/python3.10/site-packages/colossalai/context/paral
                             lel_context.py:557 set_seed                        
                    INFO     colossalai - colossalai - INFO: initialized seed on
                             rank 0, numpy: 1024, python random: 1024,          
                             ParallelMode.DATA: 1024, ParallelMode.TENSOR:      
                             1024,the default parallel seed is                  
                             ParallelMode.DATA.                                 
                    INFO     colossalai - colossalai - INFO: /home/user/softw
                             are/python/anaconda/anaconda3/envs/conda-general/li
                             b/python3.10/site-packages/colossalai/initialize.py
                             :117 launch                                        
                    INFO     colossalai - colossalai - INFO: Distributed        
                             environment is initialized, data parallel size: 1, 
                             pipeline parallel size: 1, tensor parallel size: 1 
Files already downloaded and verified
[06/16/22 13:30:44] INFO     colossalai - colossalai - INFO: /home/user/softw
                             are/python/anaconda/anaconda3/envs/conda-general/li
                             b/python3.10/site-packages/colossalai/initialize.py
                             :266 initialize                                    
                    INFO     colossalai - colossalai - INFO:                    
                             ========== Your Config ========                    
                             {'BATCH_SIZE': 16384,                              
                              'CONFIG': {'fp16': {'mode': <AMP_TYPE.TORCH:      
                             'torch'>}},                                        
                              'NUM_EPOCHS': 200}                                
                             ================================                   
                                                                                
                    INFO     colossalai - colossalai - INFO: /home/user/softw
                             are/python/anaconda/anaconda3/envs/conda-general/li
                             b/python3.10/site-packages/colossalai/initialize.py
                             :278 initialize                                    
                    INFO     colossalai - colossalai - INFO: cuDNN benchmark =  
                             True, deterministic = False                        
                    WARNING  colossalai - colossalai - WARNING: /home/user/so
                             ftware/python/anaconda/anaconda3/envs/conda-general
                             /lib/python3.10/site-packages/colossalai/initialize
                             .py:304 initialize                                 
                    WARNING  colossalai - colossalai - WARNING: Initializing an 
                             non ZeRO model with optimizer class                
                    WARNING  colossalai - colossalai - WARNING: /home/user/so
                             ftware/python/anaconda/anaconda3/envs/conda-general
                             /lib/python3.10/site-packages/colossalai/initialize
                             .py:436 initialize                                 
                    WARNING  colossalai - colossalai - WARNING: No PyTorch DDP  
                             or gradient handler is set up, please make sure you
                             do not need to all-reduce the gradients after a    
                             training step.                                     
 25%|██▌       | 1/4 [00:05<00:16,  5.59s/it]
Traceback (most recent call last):
  File "/home/user/research/Experiments/ColossalAI-Examples/image/resnet/train.py", line 157, in <module>
    main()
  File "/home/user/research/Experiments/ColossalAI-Examples/image/resnet/train.py", line 103, in main
    output = engine(img)
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/colossalai/engine/_base_engine.py", line 183, in __call__
    return self.model(*args, **kwargs)
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torchvision/models/shufflenetv2.py", line 156, in forward
    return self._forward_impl(x)
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torchvision/models/shufflenetv2.py", line 147, in _forward_impl
    x = self.stage2(x)
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torchvision/models/shufflenetv2.py", line 85, in forward
    out = torch.cat((x1, self.branch2(x2)), dim=1)
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 447, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 443, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: CUDA out of memory. Tried to allocate 58.00 MiB (GPU 0; 10.76 GiB total capacity; 9.54 GiB already allocated; 9.00 MiB free; 9.59 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2549731) of binary: /home/user/software/python/anaconda/anaconda3/envs/conda-general/bin/python
Fatal Python error: Segmentation fault

Thread 0x00007ff209a3e700 (most recent call first):
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/threading.py", line 324 in wait
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/threading.py", line 600 in wait
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/utils.py", line 254 in _run
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/threading.py", line 946 in run
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/threading.py", line 1009 in _bootstrap_inner
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/threading.py", line 966 in _bootstrap

Current thread 0x00007ff2e1d5a740 (most recent call first):
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 31 in get_all
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 53 in synchronize
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 67 in barrier
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 906 in _exit_barrier
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 877 in _invoke_run
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 709 in run
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 125 in wrapper
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 236 in launch_agent
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 131 in __call__
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/run.py", line 715 in run
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/run.py", line 724 in main
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345 in wrapper
  File "/home/user/software/python/anaconda/anaconda3/envs/conda-general/bin/torchrun", line 33 in <module>

Extension modules: torch._C, torch._C._fft, torch._C._linalg, torch._C._nn, torch._C._sparse, torch._C._special, mkl._mklinit, mkl._py_mkl_service, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg.lapack_lite, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator (total: 22)
Error: failed to run torchrun --nproc_per_node=1 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train.py on 127.0.0.1

Environment

CUDA: 11.4

@BoxiangW
Copy link
Contributor

BoxiangW commented Jun 21, 2022

Hi, could you provide your training code for us to reproduce this bug? Besides, could you double-check your dataset settings?

@BoxiangW
Copy link
Contributor

BoxiangW commented Jun 21, 2022

I have tried our code with a simple change of model from resnet to shufflenet. It takes about 32521MiB withBATCH_SIZE = 16384, and no OOM occurred.

@songyuc
Copy link
Author

songyuc commented Jun 22, 2022

Hi, @BoxiangW, here is the script as train.py

@BoxiangW
Copy link
Contributor

BoxiangW commented Jun 23, 2022

Hi @songyuc, you can uninstall your current colossalai and install our latest version with

git clone https://github.com/hpcaitech/ColossalAI.git
cd ColossalAI

# install dependency
pip install -r requirements/requirements.txt

# install colossalai
pip install .

There was a bug in previous release that takes up extra GPU memory. With our latest version, BATCH_SIZE=16384 only takes about 10605MiB. Hope this could solve your issue.

@songyuc
Copy link
Author

songyuc commented Jun 23, 2022

Hi @songyuc, you can uninstall your current colossalai and install our latest version with


git clone https://github.com/hpcaitech/ColossalAI.git

cd ColossalAI



# install dependency

pip install -r requirements/requirements.txt



# install colossalai

pip install .

There was a bug in previous release that takes up extra GPU memory. With our latest version, BATCH_SIZE=16384 only takes about 10605MiB. Hope this could solve your issue.

Thank you for the guide! I will try it later.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants