Error Encountered During Multi-Node Pretraining with Torchrun #737

Zehui127 · 2024-10-21T07:47:51Z

🐛 Describe the bug

Description:

We are conducting pretraining using our own data with the following torchrun command:

torchrun --nnodes=$NODES --nproc_per_node=$GPUS --node_rank=$RANK --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT /scripts/train.py /configs/official/Olmo-7B.yaml

The pretraining works as expected on a single node with multiple GPUs. However, when scaling to multiple nodes, we encounter the following error:

Traceback (most recent call last):
  File "/scratch/amlt_code/OLmo-GFM/scripts/train.py", line 263, in <module>
    main(cfg)
  File "/scratch/amlt_code/OLmo-GFM/scripts/train.py", line 109, in main
    train_loader = build_train_dataloader(cfg)
  File "/scratch/amlt_code/OLmo-GFM/olmo/data/__init__.py", line 99, in build_train_dataloader
    IterableDataset(
  File "/scratch/amlt_code/OLmo-GFM/olmo/data/iterable_dataset.py", line 70, in __init__
    self._build_and_save_global_indices()
  File "/scratch/amlt_code/OLmo-GFM/olmo/data/iterable_dataset.py", line 79, in _build_and_save_global_indices
    global_indices_mmap = np.memmap(
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/numpy/core/memmap.py", line 230, in __new__
    with f_ctx as fid:
OSError: [Errno 5] Input/output error

Additional Details:

The error seems unrelated to GPU settings or data preparation.
We suspect that the issue might be linked to the storage backend, as we are saving the checkpoint directory to a mounted Azure Blob Storage.
The error specifically occurs during multi-node execution, which suggests potential problems with I/O handling in the distributed environment.

Request:

Has anyone on the team encountered a similar issue during development? Any insights or suggestions on troubleshooting this error would be greatly appreciated. We suspect that it may be related to the compatibility of the storage system with multi-node distributed training.

Versions

accelerate==0.34.2
-e git+https://github.com/Zehui127/OLmo-GFM.git@cd9edbb980a245aab29210d32a66e0e8b33ee4a5#egg=ai2_olmo
aiohappyeyeballs==2.4.0
aiohttp==3.10.5
aiosignal==1.3.1
annotated-types==0.7.0
antlr4-python3-runtime==4.9.3
asttokens==2.4.1
async-timeout==4.0.3
attrs==24.2.0
backports.tarfile==1.2.0
beaker-gantry==1.8.3
beaker-py==1.31.2
beautifulsoup4==4.12.3
biopython==1.84
biotite==0.41.2
black==23.12.1
boltons==24.0.0
boto3==1.35.5
botocore==1.35.5
Brotli==1.1.0
build==1.2.1
cached_path==1.6.3
cachetools==5.5.0
certifi==2024.7.4
cffi==1.17.0
charset-normalizer==3.3.2
click==8.1.7
click-help-colors==0.9.4
cloudpathlib==0.18.1
contourpy==1.2.1
cryptography==43.0.0
cycler==0.12.1
datasets==2.21.0
decorator==5.1.1
dill==0.3.8
docker==7.1.0
docker-pycreds==0.4.0
docutils==0.21.2
einops==0.8.0
esm==3.0.2
evaluate==0.4.3
exceptiongroup==1.2.2
executing==2.0.1
face==20.1.1
filelock==3.13.4
fonttools==4.53.1
frozenlist==1.4.1
fsspec==2024.6.1
ftfy==6.2.3
gdown==5.2.0
gitdb==4.0.11
GitPython==3.1.43
glom==23.5.0
google-api-core==2.19.1
google-auth==2.34.0
google-cloud-core==2.4.1
google-cloud-storage==2.18.2
google-crc32c==1.5.0
google-resumable-media==2.7.2
googleapis-common-protos==1.63.2
huggingface-hub==0.23.5
idna==3.7
importlib_metadata==8.4.0
iniconfig==2.0.0
ipython==8.26.0
isort==5.12.0
jaraco.classes==3.4.0
jaraco.context==6.0.1
jaraco.functools==4.0.2
jedi==0.19.1
jeepney==0.8.0
Jinja2==3.1.4
jmespath==1.0.1
joblib==1.4.2
keyring==25.3.0
kiwisolver==1.4.5
lightning-utilities==0.11.6
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib==3.9.2
matplotlib-inline==0.1.7
mdurl==0.1.2
more-itertools==10.4.0
mpmath==1.3.0
msgpack==1.0.8
msgpack-numpy==0.4.8
msgspec==0.18.6
multidict==6.0.5
multiprocess==0.70.16
mypy==1.3.0
mypy-extensions==1.0.0
necessary==0.4.3
networkx==3.3
nh3==0.2.18
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.6.20
nvidia-nvtx-cu12==12.1.105
omegaconf==2.3.0
packaging==24.1
pandas==2.2.2
parso==0.8.4
pathspec==0.12.1
peft==0.12.0
petname==2.6
pexpect==4.9.0
pillow==10.4.0
pkginfo==1.10.0
platformdirs==4.2.2
pluggy==1.5.0
prompt_toolkit==3.0.47
proto-plus==1.24.0
protobuf==5.27.3
psutil==6.0.0
ptyprocess==0.7.0
pure_eval==0.2.3
pyarrow==17.0.0
pyasn1==0.6.0
pyasn1_modules==0.4.0
pycparser==2.22
pydantic==2.8.2
pydantic_core==2.20.1
Pygments==2.18.0
pyparsing==3.1.2
pyproject_hooks==1.1.0
PySocks==1.7.1
pytest==8.3.2
pytest-sphinx==0.6.3
python-dateutil==2.9.0.post0
pytorch-lightning==2.4.0
pytz==2024.1
PyYAML==6.0.2
readme_renderer==44.0
regex==2024.7.24
requests==2.32.3
requests-toolbelt==1.0.0
requirements-parser==0.11.0
rfc3986==2.0.0
rich==13.7.1
rsa==4.9
ruff==0.6.2
s3transfer==0.10.2
safetensors==0.4.4
scikit-learn==1.5.1
scipy==1.14.0
seaborn==0.13.2
SecretStorage==3.3.3
sentry-sdk==2.13.0
setproctitle==1.3.3
six==1.16.0
smart-open==7.0.4
smashed==0.21.5
smmap==5.0.1
soupsieve==2.6
stack-data==0.6.3
sympy==1.13.1
threadpoolctl==3.5.0
tiktoken==0.7.0
tokenizers==0.19.1
tomli==2.0.1
torch==2.4.1
torchmetrics==1.4.1
torchtext==0.18.0
torchvision==0.19.1
tqdm==4.66.5
traitlets==5.14.3
transformers==4.44.0
triton==3.0.0
trouting==0.3.3
twine==5.1.1
types-setuptools==73.0.0.20240822
typing_extensions==4.12.2
tzdata==2024.1
urllib3==2.2.2
wandb==0.17.7
wcwidth==0.2.13
wrapt==1.16.0
xxhash==3.5.0
yarl==1.9.4
zipp==3.20.0

The text was updated successfully, but these errors were encountered:

Zehui127 added the type/bug An issue about a bug label Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error Encountered During Multi-Node Pretraining with Torchrun #737

Error Encountered During Multi-Node Pretraining with Torchrun #737

Zehui127 commented Oct 21, 2024

Error Encountered During Multi-Node Pretraining with Torchrun #737

Error Encountered During Multi-Node Pretraining with Torchrun #737

Comments

Zehui127 commented Oct 21, 2024

🐛 Describe the bug

Versions