diff --git a/frameworks/torch/torch-neuronx/training-troubleshooting.rst b/frameworks/torch/torch-neuronx/training-troubleshooting.rst index be192e97..edf6efca 100644 --- a/frameworks/torch/torch-neuronx/training-troubleshooting.rst +++ b/frameworks/torch/torch-neuronx/training-troubleshooting.rst @@ -171,7 +171,7 @@ Currently, NeuronCache default root directory is /var/tmp which is local to the .. code:: bash - KeyError: 'neff_cache2/neuron-compile-cache/USER_neuroncc-1.0.48875.0+7437fbf18/MODULE_7223055628515330524/MODULE_0_SyncTensorsGraph.14_7223055628515330524_compute1-dy-kaena-training-2-1-e859998e-3035-5df63dab5ce63' + KeyError: 'neff_cache2/neuron-compile-cache/USER_neuroncc-1.0.48875.0+7437fbf18/MODULE_7223055628515330524/MODULE_0_SyncTensorsGraph.14_7223055628515330524_compute1-dy-training-2-1-e859998e-3035-5df63dab5ce63' This is a result of limitations to file locking on NFS. EFS/FSx also exhibit similar limitation. The workaround is to setup separate NeuronCache root directories for each worker instance, such as ``NEURON_CC_FLAGS="--cache_dir=$HOME/neuron_cache/bert/`hostname`"``, where the home directory is shared among worker instances as in ParallelCluster. diff --git a/libraries/neuronx-distributed/setup/index.rst b/libraries/neuronx-distributed/setup/index.rst index 6c20e752..7b86bc25 100644 --- a/libraries/neuronx-distributed/setup/index.rst +++ b/libraries/neuronx-distributed/setup/index.rst @@ -12,7 +12,6 @@ You can install the ``neuronx-distributed`` package using the following command: python -m pip install neuronx_distributed --extra-index-url https://pip.repos.neuron.amazonaws.com -Make sure the transformers version is set to ``4.26.0`` diff --git a/libraries/neuronx-distributed/tutorials/finetuning_llama2_7b_ptl.rst b/libraries/neuronx-distributed/tutorials/finetuning_llama2_7b_ptl.rst index 35425517..1a220394 100644 --- a/libraries/neuronx-distributed/tutorials/finetuning_llama2_7b_ptl.rst +++ b/libraries/neuronx-distributed/tutorials/finetuning_llama2_7b_ptl.rst @@ -73,7 +73,7 @@ Download the Llama2-7B pre-trained checkpoint from HuggingFace. .. code:: ipython3 - ssh compute1-dy-kaena-training-0-1 + ssh compute1-dy-training-0-1 source ~/aws_neuron_venv_pytorch/bin/activate cd ~/examples/tp_zero1_llama2_7b_hf_finetune_ptl python3 get_model.py diff --git a/libraries/neuronx-distributed/tutorials/training_llama_tp_zero1.rst b/libraries/neuronx-distributed/tutorials/training_llama_tp_zero1.rst index 44dd0210..b760de83 100644 --- a/libraries/neuronx-distributed/tutorials/training_llama_tp_zero1.rst +++ b/libraries/neuronx-distributed/tutorials/training_llama_tp_zero1.rst @@ -74,7 +74,7 @@ If you want to pre-train Llama2 7B, run the following steps - .. code:: ipython3 python3 -m pip install -r requirements.txt - chmod +x tp_zero1_llama2_7b_hf_pretrain.sh + chmod +x tp_zero1_llama2_7B_hf_pretrain.sh To tokenize the data, we must request the tokenizer from hugging face and meta by following the instructions at the following link: `HuggingFace Llama 3 8B Model `__ . @@ -105,10 +105,10 @@ Next let’s download and pre-process the dataset: .. code:: ipython3 - cd ~/examples/tp_zero1_llama2_7b_hf_pretrain + cd ~/examples/tp_zero1_llama_hf_pretrain python3 get_dataset.py --llama-version 3 # change the version number to 2 for Llama-2 models -`Note:` In case you see an error of the following form when downloading data: ``huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/home/ubuntu/examples/tp_zero1_llama2_7b_hf_pretrain'. Use `repo_type` argument if needed.`` +`Note:` In case you see an error of the following form when downloading data: ``huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/home/ubuntu/examples/tp_zero1_llama_hf_pretrain'. Use `repo_type` argument if needed.`` This could be because of a stale cache. Try deleting the cache using: .. code:: ipython3 diff --git a/neuron-runtime/nrt-troubleshoot.rst b/neuron-runtime/nrt-troubleshoot.rst index 57ec1ed2..d8e0bb97 100644 --- a/neuron-runtime/nrt-troubleshoot.rst +++ b/neuron-runtime/nrt-troubleshoot.rst @@ -597,7 +597,7 @@ Name resolution failure .. code:: bash - WARN Invalid NCCL_COMM_ID [compute1-st-kaena-training-0-1.pcluster-trn1-24-pdx80-2n.pcluster:41211], please use format: : or []: + WARN Invalid NCCL_COMM_ID [compute1-dy-training-0-1.pcluster-trn1-24-pdx80-2n.pcluster:41211], please use format: : or []: .. _solution-11: