You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The pretraining works as expected on a single node with multiple GPUs. However, when scaling to multiple nodes, we encounter the following error:
Traceback (most recent call last):
File "/scratch/amlt_code/OLmo-GFM/scripts/train.py", line 263, in <module>
main(cfg)
File "/scratch/amlt_code/OLmo-GFM/scripts/train.py", line 109, in main
train_loader = build_train_dataloader(cfg)
File "/scratch/amlt_code/OLmo-GFM/olmo/data/__init__.py", line 99, in build_train_dataloader
IterableDataset(
File "/scratch/amlt_code/OLmo-GFM/olmo/data/iterable_dataset.py", line 70, in __init__
self._build_and_save_global_indices()
File "/scratch/amlt_code/OLmo-GFM/olmo/data/iterable_dataset.py", line 79, in _build_and_save_global_indices
global_indices_mmap = np.memmap(
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/numpy/core/memmap.py", line 230, in __new__
with f_ctx as fid:
OSError: [Errno 5] Input/output error
Additional Details:
The error seems unrelated to GPU settings or data preparation.
We suspect that the issue might be linked to the storage backend, as we are saving the checkpoint directory to a mounted Azure Blob Storage.
The error specifically occurs during multi-node execution, which suggests potential problems with I/O handling in the distributed environment.
Request:
Has anyone on the team encountered a similar issue during development? Any insights or suggestions on troubleshooting this error would be greatly appreciated. We suspect that it may be related to the compatibility of the storage system with multi-node distributed training.
🐛 Describe the bug
Description:
We are conducting pretraining using our own data with the following
torchrun
command:The pretraining works as expected on a single node with multiple GPUs. However, when scaling to multiple nodes, we encounter the following error:
Additional Details:
Request:
Has anyone on the team encountered a similar issue during development? Any insights or suggestions on troubleshooting this error would be greatly appreciated. We suspect that it may be related to the compatibility of the storage system with multi-node distributed training.
Versions
accelerate==0.34.2
-e git+https://github.com/Zehui127/OLmo-GFM.git@cd9edbb980a245aab29210d32a66e0e8b33ee4a5#egg=ai2_olmo
aiohappyeyeballs==2.4.0
aiohttp==3.10.5
aiosignal==1.3.1
annotated-types==0.7.0
antlr4-python3-runtime==4.9.3
asttokens==2.4.1
async-timeout==4.0.3
attrs==24.2.0
backports.tarfile==1.2.0
beaker-gantry==1.8.3
beaker-py==1.31.2
beautifulsoup4==4.12.3
biopython==1.84
biotite==0.41.2
black==23.12.1
boltons==24.0.0
boto3==1.35.5
botocore==1.35.5
Brotli==1.1.0
build==1.2.1
cached_path==1.6.3
cachetools==5.5.0
certifi==2024.7.4
cffi==1.17.0
charset-normalizer==3.3.2
click==8.1.7
click-help-colors==0.9.4
cloudpathlib==0.18.1
contourpy==1.2.1
cryptography==43.0.0
cycler==0.12.1
datasets==2.21.0
decorator==5.1.1
dill==0.3.8
docker==7.1.0
docker-pycreds==0.4.0
docutils==0.21.2
einops==0.8.0
esm==3.0.2
evaluate==0.4.3
exceptiongroup==1.2.2
executing==2.0.1
face==20.1.1
filelock==3.13.4
fonttools==4.53.1
frozenlist==1.4.1
fsspec==2024.6.1
ftfy==6.2.3
gdown==5.2.0
gitdb==4.0.11
GitPython==3.1.43
glom==23.5.0
google-api-core==2.19.1
google-auth==2.34.0
google-cloud-core==2.4.1
google-cloud-storage==2.18.2
google-crc32c==1.5.0
google-resumable-media==2.7.2
googleapis-common-protos==1.63.2
huggingface-hub==0.23.5
idna==3.7
importlib_metadata==8.4.0
iniconfig==2.0.0
ipython==8.26.0
isort==5.12.0
jaraco.classes==3.4.0
jaraco.context==6.0.1
jaraco.functools==4.0.2
jedi==0.19.1
jeepney==0.8.0
Jinja2==3.1.4
jmespath==1.0.1
joblib==1.4.2
keyring==25.3.0
kiwisolver==1.4.5
lightning-utilities==0.11.6
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib==3.9.2
matplotlib-inline==0.1.7
mdurl==0.1.2
more-itertools==10.4.0
mpmath==1.3.0
msgpack==1.0.8
msgpack-numpy==0.4.8
msgspec==0.18.6
multidict==6.0.5
multiprocess==0.70.16
mypy==1.3.0
mypy-extensions==1.0.0
necessary==0.4.3
networkx==3.3
nh3==0.2.18
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.6.20
nvidia-nvtx-cu12==12.1.105
omegaconf==2.3.0
packaging==24.1
pandas==2.2.2
parso==0.8.4
pathspec==0.12.1
peft==0.12.0
petname==2.6
pexpect==4.9.0
pillow==10.4.0
pkginfo==1.10.0
platformdirs==4.2.2
pluggy==1.5.0
prompt_toolkit==3.0.47
proto-plus==1.24.0
protobuf==5.27.3
psutil==6.0.0
ptyprocess==0.7.0
pure_eval==0.2.3
pyarrow==17.0.0
pyasn1==0.6.0
pyasn1_modules==0.4.0
pycparser==2.22
pydantic==2.8.2
pydantic_core==2.20.1
Pygments==2.18.0
pyparsing==3.1.2
pyproject_hooks==1.1.0
PySocks==1.7.1
pytest==8.3.2
pytest-sphinx==0.6.3
python-dateutil==2.9.0.post0
pytorch-lightning==2.4.0
pytz==2024.1
PyYAML==6.0.2
readme_renderer==44.0
regex==2024.7.24
requests==2.32.3
requests-toolbelt==1.0.0
requirements-parser==0.11.0
rfc3986==2.0.0
rich==13.7.1
rsa==4.9
ruff==0.6.2
s3transfer==0.10.2
safetensors==0.4.4
scikit-learn==1.5.1
scipy==1.14.0
seaborn==0.13.2
SecretStorage==3.3.3
sentry-sdk==2.13.0
setproctitle==1.3.3
six==1.16.0
smart-open==7.0.4
smashed==0.21.5
smmap==5.0.1
soupsieve==2.6
stack-data==0.6.3
sympy==1.13.1
threadpoolctl==3.5.0
tiktoken==0.7.0
tokenizers==0.19.1
tomli==2.0.1
torch==2.4.1
torchmetrics==1.4.1
torchtext==0.18.0
torchvision==0.19.1
tqdm==4.66.5
traitlets==5.14.3
transformers==4.44.0
triton==3.0.0
trouting==0.3.3
twine==5.1.1
types-setuptools==73.0.0.20240822
typing_extensions==4.12.2
tzdata==2024.1
urllib3==2.2.2
wandb==0.17.7
wcwidth==0.2.13
wrapt==1.16.0
xxhash==3.5.0
yarl==1.9.4
zipp==3.20.0
The text was updated successfully, but these errors were encountered: