Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error Encountered During Multi-Node Pretraining with Torchrun #737

Open
Zehui127 opened this issue Oct 21, 2024 · 0 comments
Open

Error Encountered During Multi-Node Pretraining with Torchrun #737

Zehui127 opened this issue Oct 21, 2024 · 0 comments
Labels
type/bug An issue about a bug

Comments

@Zehui127
Copy link

🐛 Describe the bug

Description:

We are conducting pretraining using our own data with the following torchrun command:

torchrun --nnodes=$NODES --nproc_per_node=$GPUS --node_rank=$RANK --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT /scripts/train.py /configs/official/Olmo-7B.yaml

The pretraining works as expected on a single node with multiple GPUs. However, when scaling to multiple nodes, we encounter the following error:

Traceback (most recent call last):
  File "/scratch/amlt_code/OLmo-GFM/scripts/train.py", line 263, in <module>
    main(cfg)
  File "/scratch/amlt_code/OLmo-GFM/scripts/train.py", line 109, in main
    train_loader = build_train_dataloader(cfg)
  File "/scratch/amlt_code/OLmo-GFM/olmo/data/__init__.py", line 99, in build_train_dataloader
    IterableDataset(
  File "/scratch/amlt_code/OLmo-GFM/olmo/data/iterable_dataset.py", line 70, in __init__
    self._build_and_save_global_indices()
  File "/scratch/amlt_code/OLmo-GFM/olmo/data/iterable_dataset.py", line 79, in _build_and_save_global_indices
    global_indices_mmap = np.memmap(
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/numpy/core/memmap.py", line 230, in __new__
    with f_ctx as fid:
OSError: [Errno 5] Input/output error

Additional Details:

  • The error seems unrelated to GPU settings or data preparation.
  • We suspect that the issue might be linked to the storage backend, as we are saving the checkpoint directory to a mounted Azure Blob Storage.
  • The error specifically occurs during multi-node execution, which suggests potential problems with I/O handling in the distributed environment.

Request:

Has anyone on the team encountered a similar issue during development? Any insights or suggestions on troubleshooting this error would be greatly appreciated. We suspect that it may be related to the compatibility of the storage system with multi-node distributed training.

Versions

accelerate==0.34.2
-e git+https://github.com/Zehui127/OLmo-GFM.git@cd9edbb980a245aab29210d32a66e0e8b33ee4a5#egg=ai2_olmo
aiohappyeyeballs==2.4.0
aiohttp==3.10.5
aiosignal==1.3.1
annotated-types==0.7.0
antlr4-python3-runtime==4.9.3
asttokens==2.4.1
async-timeout==4.0.3
attrs==24.2.0
backports.tarfile==1.2.0
beaker-gantry==1.8.3
beaker-py==1.31.2
beautifulsoup4==4.12.3
biopython==1.84
biotite==0.41.2
black==23.12.1
boltons==24.0.0
boto3==1.35.5
botocore==1.35.5
Brotli==1.1.0
build==1.2.1
cached_path==1.6.3
cachetools==5.5.0
certifi==2024.7.4
cffi==1.17.0
charset-normalizer==3.3.2
click==8.1.7
click-help-colors==0.9.4
cloudpathlib==0.18.1
contourpy==1.2.1
cryptography==43.0.0
cycler==0.12.1
datasets==2.21.0
decorator==5.1.1
dill==0.3.8
docker==7.1.0
docker-pycreds==0.4.0
docutils==0.21.2
einops==0.8.0
esm==3.0.2
evaluate==0.4.3
exceptiongroup==1.2.2
executing==2.0.1
face==20.1.1
filelock==3.13.4
fonttools==4.53.1
frozenlist==1.4.1
fsspec==2024.6.1
ftfy==6.2.3
gdown==5.2.0
gitdb==4.0.11
GitPython==3.1.43
glom==23.5.0
google-api-core==2.19.1
google-auth==2.34.0
google-cloud-core==2.4.1
google-cloud-storage==2.18.2
google-crc32c==1.5.0
google-resumable-media==2.7.2
googleapis-common-protos==1.63.2
huggingface-hub==0.23.5
idna==3.7
importlib_metadata==8.4.0
iniconfig==2.0.0
ipython==8.26.0
isort==5.12.0
jaraco.classes==3.4.0
jaraco.context==6.0.1
jaraco.functools==4.0.2
jedi==0.19.1
jeepney==0.8.0
Jinja2==3.1.4
jmespath==1.0.1
joblib==1.4.2
keyring==25.3.0
kiwisolver==1.4.5
lightning-utilities==0.11.6
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib==3.9.2
matplotlib-inline==0.1.7
mdurl==0.1.2
more-itertools==10.4.0
mpmath==1.3.0
msgpack==1.0.8
msgpack-numpy==0.4.8
msgspec==0.18.6
multidict==6.0.5
multiprocess==0.70.16
mypy==1.3.0
mypy-extensions==1.0.0
necessary==0.4.3
networkx==3.3
nh3==0.2.18
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.6.20
nvidia-nvtx-cu12==12.1.105
omegaconf==2.3.0
packaging==24.1
pandas==2.2.2
parso==0.8.4
pathspec==0.12.1
peft==0.12.0
petname==2.6
pexpect==4.9.0
pillow==10.4.0
pkginfo==1.10.0
platformdirs==4.2.2
pluggy==1.5.0
prompt_toolkit==3.0.47
proto-plus==1.24.0
protobuf==5.27.3
psutil==6.0.0
ptyprocess==0.7.0
pure_eval==0.2.3
pyarrow==17.0.0
pyasn1==0.6.0
pyasn1_modules==0.4.0
pycparser==2.22
pydantic==2.8.2
pydantic_core==2.20.1
Pygments==2.18.0
pyparsing==3.1.2
pyproject_hooks==1.1.0
PySocks==1.7.1
pytest==8.3.2
pytest-sphinx==0.6.3
python-dateutil==2.9.0.post0
pytorch-lightning==2.4.0
pytz==2024.1
PyYAML==6.0.2
readme_renderer==44.0
regex==2024.7.24
requests==2.32.3
requests-toolbelt==1.0.0
requirements-parser==0.11.0
rfc3986==2.0.0
rich==13.7.1
rsa==4.9
ruff==0.6.2
s3transfer==0.10.2
safetensors==0.4.4
scikit-learn==1.5.1
scipy==1.14.0
seaborn==0.13.2
SecretStorage==3.3.3
sentry-sdk==2.13.0
setproctitle==1.3.3
six==1.16.0
smart-open==7.0.4
smashed==0.21.5
smmap==5.0.1
soupsieve==2.6
stack-data==0.6.3
sympy==1.13.1
threadpoolctl==3.5.0
tiktoken==0.7.0
tokenizers==0.19.1
tomli==2.0.1
torch==2.4.1
torchmetrics==1.4.1
torchtext==0.18.0
torchvision==0.19.1
tqdm==4.66.5
traitlets==5.14.3
transformers==4.44.0
triton==3.0.0
trouting==0.3.3
twine==5.1.1
types-setuptools==73.0.0.20240822
typing_extensions==4.12.2
tzdata==2024.1
urllib3==2.2.2
wandb==0.17.7
wcwidth==0.2.13
wrapt==1.16.0
xxhash==3.5.0
yarl==1.9.4
zipp==3.20.0

@Zehui127 Zehui127 added the type/bug An issue about a bug label Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug An issue about a bug
Projects
None yet
Development

No branches or pull requests

1 participant