Skip to content

Commit

Permalink
2.19.1 miscellaneous fixes (#929)
Browse files Browse the repository at this point in the history
Co-authored-by: Kavish Gandhi <[email protected]>
  • Loading branch information
aws-rxgupta and kvshbg-aws authored Jul 22, 2024
1 parent c6feb18 commit 800d00f
Show file tree
Hide file tree
Showing 5 changed files with 6 additions and 7 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -171,7 +171,7 @@ Currently, NeuronCache default root directory is /var/tmp which is local to the

.. code:: bash
KeyError: 'neff_cache2/neuron-compile-cache/USER_neuroncc-1.0.48875.0+7437fbf18/MODULE_7223055628515330524/MODULE_0_SyncTensorsGraph.14_7223055628515330524_compute1-dy-kaena-training-2-1-e859998e-3035-5df63dab5ce63'
KeyError: 'neff_cache2/neuron-compile-cache/USER_neuroncc-1.0.48875.0+7437fbf18/MODULE_7223055628515330524/MODULE_0_SyncTensorsGraph.14_7223055628515330524_compute1-dy-training-2-1-e859998e-3035-5df63dab5ce63'
This is a result of limitations to file locking on NFS. EFS/FSx also exhibit similar limitation. The workaround is to setup separate NeuronCache root directories for each worker instance, such as ``NEURON_CC_FLAGS="--cache_dir=$HOME/neuron_cache/bert/`hostname`"``, where the home directory is shared among worker instances as in ParallelCluster.

Expand Down
1 change: 0 additions & 1 deletion libraries/neuronx-distributed/setup/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,6 @@ You can install the ``neuronx-distributed`` package using the following command:
python -m pip install neuronx_distributed --extra-index-url https://pip.repos.neuron.amazonaws.com
Make sure the transformers version is set to ``4.26.0``
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ Download the Llama2-7B pre-trained checkpoint from HuggingFace.

.. code:: ipython3
ssh compute1-dy-kaena-training-0-1
ssh compute1-dy-training-0-1
source ~/aws_neuron_venv_pytorch/bin/activate
cd ~/examples/tp_zero1_llama2_7b_hf_finetune_ptl
python3 get_model.py
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ If you want to pre-train Llama2 7B, run the following steps -
.. code:: ipython3
python3 -m pip install -r requirements.txt
chmod +x tp_zero1_llama2_7b_hf_pretrain.sh
chmod +x tp_zero1_llama2_7B_hf_pretrain.sh
To tokenize the data, we must request the tokenizer from hugging face and meta by following the instructions at the following link: `HuggingFace Llama 3 8B Model <https://huggingface.co/meta-llama/Meta-Llama-3-8B>`__ .
Expand Down Expand Up @@ -105,10 +105,10 @@ Next let’s download and pre-process the dataset:

.. code:: ipython3
cd ~/examples/tp_zero1_llama2_7b_hf_pretrain
cd ~/examples/tp_zero1_llama_hf_pretrain
python3 get_dataset.py --llama-version 3 # change the version number to 2 for Llama-2 models
`Note:` In case you see an error of the following form when downloading data: ``huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/home/ubuntu/examples/tp_zero1_llama2_7b_hf_pretrain'. Use `repo_type` argument if needed.``
`Note:` In case you see an error of the following form when downloading data: ``huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/home/ubuntu/examples/tp_zero1_llama_hf_pretrain'. Use `repo_type` argument if needed.``
This could be because of a stale cache. Try deleting the cache using:

.. code:: ipython3
Expand Down
2 changes: 1 addition & 1 deletion neuron-runtime/nrt-troubleshoot.rst
Original file line number Diff line number Diff line change
Expand Up @@ -597,7 +597,7 @@ Name resolution failure

.. code:: bash
WARN Invalid NCCL_COMM_ID [compute1-st-kaena-training-0-1.pcluster-trn1-24-pdx80-2n.pcluster:41211], please use format: <ipv4>:<port> or [<ipv6>]:<port>
WARN Invalid NCCL_COMM_ID [compute1-dy-training-0-1.pcluster-trn1-24-pdx80-2n.pcluster:41211], please use format: <ipv4>:<port> or [<ipv6>]:<port>
.. _solution-11:

Expand Down

0 comments on commit 800d00f

Please sign in to comment.