Skip to content

Commit

Permalink
Neuron SDK Release 2.20.1
Browse files Browse the repository at this point in the history
---------

Co-authored-by: Geeta Gharpure <[email protected]>
Co-authored-by: Eddy Varela <[email protected]>
Co-authored-by: Jeffrey Huynh <[email protected]>
Co-authored-by: roopgran <[email protected]>
Co-authored-by: Esha Lakhotia <[email protected]>
Co-authored-by: Roopnath <[email protected]>
Co-authored-by: musunita <[email protected]>
Co-authored-by: mounchin <[email protected]>
  • Loading branch information
9 people committed Oct 26, 2024
1 parent 9c301c9 commit 3ab9d96
Show file tree
Hide file tree
Showing 17 changed files with 179 additions and 99 deletions.
2 changes: 1 addition & 1 deletion conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -157,7 +157,7 @@

#top_banner_message="<span>&#9888;</span><a class='reference internal' style='color:white;' href='https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/setup-troubleshooting.html#gpg-key-update'> Neuron repository GPG key for Ubuntu installation has expired, see instructions how to update! </a>"

top_banner_message="Neuron 2.20.0 is released! check <a class='reference internal' style='color:white;' href='https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/index.html#latest-neuron-release'> What's New </a> and <a class='reference internal' style='color:white;' href='https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/announcements/index.html'> Announcements </a>"
top_banner_message="Neuron 2.20.1 is released! check <a class='reference internal' style='color:white;' href='https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/index.html#latest-neuron-release'> What's New </a> and <a class='reference internal' style='color:white;' href='https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/announcements/index.html'> Announcements </a>"

html_theme = "sphinx_book_theme"
html_theme_options = {
Expand Down
22 changes: 9 additions & 13 deletions containers/tutorials/k8s-default-scheduler.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,32 +2,28 @@
.. _k8s-default-scheduler:

* Make sure :ref:`Neuron device plugin<k8s-neuron-device-plugin>` is running
* Download the scheduler config map :download:`k8s-neuron-scheduler-configmap.yml </src/k8/k8s-neuron-scheduler-configmap.yml>`
* Download the scheduler extension :download:`k8s-neuron-scheduler.yml </src/k8/k8s-neuron-scheduler.yml>`
* Enable the kube-scheduler with option to use configMap for scheduler policy. In your cluster.yml Please update the spec section with the following

::
.. code:: bash
spec:
kubeScheduler:
usePolicyConfigMap: true
* Launch the cluster

::
.. code:: bash
kops create -f cluster.yml
kops create secret --name neuron-test-1.k8s.local sshpublickey admin -i ~/.ssh/id_rsa.pub
kops update cluster --name neuron-test-1.k8s.local --yes
* Apply the k8s-neuron-scheduler-configmap.yml [Registers neuron-scheduler-extension with kube-scheduler]
* Install the neuron-scheduler-extension [Registers neuron-scheduler-extension with kube-scheduler]

::
.. code:: bash
kubectl apply -f k8s-neuron-scheduler-configmap.yml

* Launch the neuron-scheduler-extension

::

kubectl apply -f k8s-neuron-scheduler.yml
helm upgrade --install neuron-helm-chart oci://public.ecr.aws/neuron/neuron-helm-chart \
--set "scheduler.enabled=true" \
--set "scheduler.customScheduler.enabled=false" \
--set "scheduler.defaultScheduler.enabled=true" \
--set "npd.enabled=false"
33 changes: 11 additions & 22 deletions containers/tutorials/k8s-multiple-scheduler.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,35 +5,28 @@ In cluster environments where there is no access to default scheduler, the neuro
use this new scheduler. Neuron scheduler extension is added to this new scheduler. EKS natively does not yet support the neuron scheduler extension and so in the EKS environment this is the only way to add the neuron scheduler extension.

* Make sure :ref:`Neuron device plugin<k8s-neuron-device-plugin>` is running
* Download the my scheduler :download:`my-scheduler.yml </src/k8/my-scheduler.yml>`
* Download the scheduler extension :download:`k8s-neuron-scheduler-eks.yml </src/k8/k8s-neuron-scheduler-eks.yml>`
* Apply the neuron-scheduler-extension
* Install the neuron-scheduler-extension

::
.. code:: bash
kubectl apply -f k8s-neuron-scheduler-eks.yml


* Apply the my-scheduler.yml

::

kubectl apply -f my-scheduler.yml
helm upgrade --install neuron-helm-chart oci://public.ecr.aws/neuron/neuron-helm-chart \
--set "scheduler.enabled=true" \
--set "npd.enabled=false"
* Check there are no errors in the my-scheduler pod logs and the k8s-neuron-scheduler pod is bound to a node

::
.. code:: bash
kubectl logs -n kube-system my-scheduler-79bd4cb788-hq2sq
::
.. code:: bash
I1012 15:30:21.629611 1 scheduler.go:604] "Successfully bound pod to node" pod="kube-system/k8s-neuron-scheduler-5d9d9d7988-xcpqm" node="ip-192-168-2-25.ec2.internal" evaluatedNodes=1 feasibleNodes=1
* When running new pod's that need to use the neuron scheduler extension, make sure it uses the my-scheduler as the scheduler. Sample pod spec is below

::
.. code:: bash
apiVersion: v1
kind: Pod
Expand All @@ -57,20 +50,19 @@ use this new scheduler. Neuron scheduler extension is added to this new schedule
* Once the neuron workload pod is run, make sure logs in the k8s neuron scheduler has successfull filter/bind request


::
.. code:: bash
kubectl logs -n kube-system k8s-neuron-scheduler-5d9d9d7988-xcpqm
::
.. code:: bash
2022/10/12 15:41:16 POD nrt-test-5038 fits in Node:ip-192-168-2-25.ec2.internal
2022/10/12 15:41:16 Filtered nodes: [ip-192-168-2-25.ec2.internal]
2022/10/12 15:41:16 Failed nodes: map[]
2022/10/12 15:41:16 Finished Processing Filter Request...
::
.. code:: bash
2022/10/12 15:41:16 Executing Bind Request!
2022/10/12 15:41:16 Determine if the pod %v is NeuronDevice podnrt-test-5038
Expand All @@ -96,6 +88,3 @@ use this new scheduler. Neuron scheduler extension is added to this new schedule
2022/10/12 15:41:16 Return aws.amazon.com/neuroncore
2022/10/12 15:41:16 Succesfully updated the DevUsageMap [true true true true true true true true true false false false false false false false] and otherDevUsageMap [true true true false] after alloc for node ip-192-168-2-25.ec2.internal
2022/10/12 15:41:16 Finished executing Bind Request...



12 changes: 5 additions & 7 deletions containers/tutorials/k8s-neuron-device-plugin.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,27 +6,25 @@ Deploy Neuron Device Plugin
~~~~~~~~~~~~~~~~~~~~~~~~~~~

* Make sure :ref:`prequisite<k8s-prerequisite>` are satisified
* Download the neuron device plugin yaml file. :download:`k8s-neuron-device-plugin.yml </src/k8/k8s-neuron-device-plugin.yml>`
* Download the neuron device plugin rbac yaml file. This enables permissions for device plugin to update the node and Pod annotations. :download:`k8s-neuron-device-plugin-rbac.yml </src/k8/k8s-neuron-device-plugin-rbac.yml>`
* Apply the Neuron device plugin as a daemonset on the cluster with the following command

.. code:: bash
kubectl apply -f k8s-neuron-device-plugin-rbac.yml
kubectl apply -f k8s-neuron-device-plugin.yml
helm upgrade --install neuron-helm-chart oci://public.ecr.aws/neuron/neuron-helm-chart \
--set "npd.enabled=false"
* Verify that neuron device plugin is running

.. code:: bash
kubectl get ds neuron-device-plugin-daemonset --namespace kube-system
kubectl get ds neuron-device-plugin -n kube-system
Expected result (with 2 nodes in cluster):

.. code:: bash
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
neuron-device-plugin-daemonset 2 2 2 2 2 <none> 27h
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
neuron-device-plugin 2 2 2 2 2 <none> 18h
* Verify that the node has allocatable neuron cores and devices with the following command

Expand Down
22 changes: 8 additions & 14 deletions containers/tutorials/k8s-neuron-problem-detector-and-recovery.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,30 +4,24 @@ Neuron node problem detector and recovery artifact checks the health of Neuron d

* The Neuron node problem detector and recovery requires Neuron driver 2.15+, and it requires the runtime to be at SDK 2.18 or later.
* Make sure prerequisites are satisfied. This includes prerequisites for getting started with Kubernetes containers and prerequisites for the Neuron node problem detector and recovery.
* Download the Neuron node problem detector and recovery YAML file: :download:`k8s-neuron-problem-detector-and-recovery.yml </src/k8/neuron-problem-detector/k8s-neuron-problem-detector-and-recovery.yml>`.
* Install the Neuron node problem detector and recovery as a DaemonSet on the cluster with the following command:

.. note::

This YAML pulls the container image from the upstream repository for node problem detector registry.k8s.io/node-problem-detector.

* Download the Neuron node problem detector and recovery configuration file: :download:`k8s-neuron-problem-detector-and-recovery-config.yml </src/k8/neuron-problem-detector/k8s-neuron-problem-detector-and-recovery-config.yml>`.
* Download the Neuron node problem detector and recovery RBAC YAML file. This enables permissions for the Neuron node problem detector and recovery to update the node condition: :download:`k8s-neuron-problem-detector-and-recovery-rbac.yml </src/k8/neuron-problem-detector/k8s-neuron-problem-detector-and-recovery-rbac.yml>`.
* By default, the Neuron node problem detector and recovery has monitor only mode enabled. To enable the recovery functionality, update the environment variable in the YAML file:
The installation pulls the container image from the upstream repository for node problem detector registry.k8s.io/node-problem-detector.

.. code:: bash
- name: ENABLE_RECOVERY
value: "true"
helm upgrade --install neuron-helm-chart oci://public.ecr.aws/neuron/neuron-helm-chart
Apply the Neuron node problem detector and recovery as a DaemonSet on the cluster with the following command:
* By default, the Neuron node problem detector and recovery has monitor only mode enabled. To enable the recovery functionality:

.. code:: bash
kubectl apply -f k8s-neuron-problem-detector-and-recovery-rbac.yml
kubectl apply -f k8s-neuron-problem-detector-and-recovery-config.yml
kubectl apply -f k8s-neuron-problem-detector-and-recovery.yml
helm upgrade --install neuron-helm-chart oci://public.ecr.aws/neuron/neuron-helm-chart \
--set "npd.nodeRecovery.enabled=true"
Verify that the Neuron device plugin is running:
* Verify that the Neuron device plugin is running:

.. code:: bash
Expand All @@ -44,4 +38,4 @@ Verify that the Neuron device plugin is running:
node-problem-detector-vpjtk 1/1 Running 0 59s
When any unrecoverable error occurs, Neuron node problem detector and recovery publishes a metric under the CloudWatch namespace NeuronHealthCheck. It also reflects in NodeCondition and can be seen with kubectl describe node.
* When any unrecoverable error occurs, Neuron node problem detector and recovery publishes a metric under the CloudWatch namespace NeuronHealthCheck. It also reflects in NodeCondition and can be seen with kubectl describe node.
7 changes: 5 additions & 2 deletions containers/tutorials/k8s-neuron-scheduler.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,13 +25,16 @@ could be assigned to a container given a request for 2 devices.

Devices on Trn1.32xlarge and Trn1n.32xlarge nodes are connected via a 2D torus topology. On Trn1 nodes
containers can request 1, 4, 8, or all 16 devices. In the case you request an invalid number of devices, such as 7,
your pod will not be scheduled and you will receive a warning
``Instance type trn1.32xlarge does not support requests for device: 7. Please request a different number of devices.```.
your pod will not be scheduled and you will receive a warning:

``Instance type trn1.32xlarge does not support requests for device: 7. Please request a different number of devices.``

When requesting 4 devices, your container will be allocated one of the following sets of devices if they are available.

|eks-trn1-device-set4|

When requesting 8 devices, your container will be allocated one of the following sets of devices if they are available.

|eks-trn1-device-set8|

For all instance types, requesting one or all Neuron cores or devices is valid.
Expand Down
33 changes: 6 additions & 27 deletions dlami/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,9 @@ Multi Framework DLAMIs supported
* - Ubuntu 22.04
- Inf1, Inf2, Trn1, Trn1n
- Deep Learning AMI Neuron (Ubuntu 22.04)
* - Amazon Linux 2023
- Inf1, Inf2, Trn1, Trn1n
- Deep Learning AMI Neuron (Amazon Linux 2023)



Expand Down Expand Up @@ -154,23 +157,10 @@ Virtual Environments pre-installed
- torch-neuron
- /opt/aws_neuron_venv_pytorch_inf1

* - Deep Learning AMI Neuron PyTorch 1.13 (Amazon Linux 2)
- torch-neuronx, neuronx-distributed
- /opt/aws_neuron_venv_pytorch

* - Deep Learning AMI Neuron PyTorch 1.13 (Amazon Linux 2)
- torch-neuron
- /opt/aws_neuron_venv_pytorch_inf1

* - Deep Learning AMI Neuron TensorFlow 2.10 (Ubuntu 20.04)
- tensorflow-neuronx
- /opt/aws_neuron_venv_tensorflow

* - Deep Learning AMI Neuron TensorFlow 2.10 (Amazon Linux 2)
- tensorflow-neuronx
- /opt/aws_neuron_venv_tensorflow


You can easily get started with the single framework DLAMI through AWS console by following one of the corresponding setup guides . If you are looking to
use the Neuron DLAMI in your cloud automation flows , Neuron also supports :ref:`SSM parameters <ssm-parameter-neuron-dlami>` to easily retrieve the latest DLAMI id.

Expand Down Expand Up @@ -203,11 +193,6 @@ Base DLAMIs supported
- Inf1, Inf2, Trn1, Trn1n
- Deep Learning Base Neuron AMI (Ubuntu 20.04)

* - Amazon Linux 2
- Inf1, Inf2, Trn1, Trn1n
- Deep Learning Base Neuron AMI (Amazon Linux 2)



.. _ssm-parameter-neuron-dlami:

Expand Down Expand Up @@ -251,6 +236,9 @@ SSM Parameter Prefix

* - Deep Learning AMI Neuron (Ubuntu 22.04)
- /aws/service/neuron/dlami/multi-framework/ubuntu-22.04

* - Deep Learning AMI Neuron (Amazon Linux 2023)
- /aws/service/neuron/dlami/multi-framework/amazon-linux-2023

* - Deep Learning AMI Neuron PyTorch 2.1 (Ubuntu 22.04)
- /aws/service/neuron/dlami/pytorch-2.1/ubuntu-22.04
Expand All @@ -261,18 +249,9 @@ SSM Parameter Prefix
* - Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04)
- /aws/service/neuron/dlami/pytorch-1.13/ubuntu-20.04

* - Deep Learning AMI Neuron PyTorch 1.13 (Amazon Linux 2)
- /aws/service/neuron/dlami/pytorch-1.13/amazon-linux-2

* - Deep Learning AMI Neuron TensorFlow 2.10 (Ubuntu 20.04)
- /aws/service/neuron/dlami/tensorflow-2.10/ubuntu-20.04

* - Deep Learning AMI Neuron TensorFlow 2.10 (Amazon Linux 2)
- /aws/service/neuron/dlami/tensorflow-2.10/amazon-linux-2

* - Deep Learning Base Neuron AMI (Amazon Linux 2)
- /aws/service/neuron/dlami/base/amazon-linux-2

* - Deep Learning Base Neuron AMI (Ubuntu 22.04)
- /aws/service/neuron/dlami/base/ubuntu-22.04

Expand Down
2 changes: 1 addition & 1 deletion general/devflows/plugins/npd-ecs-flows.txt
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ Follow these steps to create a task definition for NPD and recovery:
},
{
"name": "recovery",
"image": "public.ecr.aws/neuron/neuron-node-recovery:1.2.0",
"image": "public.ecr.aws/neuron/neuron-node-recovery:1.3.0",
"cpu": 0,
"portMappings": [],
"essential": true,
Expand Down
16 changes: 16 additions & 0 deletions release-notes/containers/neuron-dlami.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
.. _neuron-dlami-release-notes:

Neuron DLAMI Release Notes
===============================

.. contents:: Table of contents
:local:
:depth: 1


Neuron 2.20.1
-------------

Date: 10/25/2024

- Added support for Amazon Linux 2023 to Neuron Multi Framework DLAMI. Customers will have two operating system options when using the multi framework DLAMI. See :ref:`neuron-dlami-overview`.
7 changes: 7 additions & 0 deletions release-notes/containers/neuron-dlc.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,13 @@ Neuron DLC Release Notes
:depth: 1


Neuron 2.20.1
-------------

Date: 10/25/2024
- Neuron 2.20.1 DLC includes prerequisites for `Neuronx Distributed Training framework <https://github.com/aws-neuron/neuronx-distributed-training/blob/main/docs/general/installation_guide.rst#building-apex>`. Customers can expect to use NxDT out of the box.


Neuron 2.20.0
-------------

Expand Down
23 changes: 18 additions & 5 deletions release-notes/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,19 @@ What's New
.. _latest-neuron-release:
.. _neuron-2.20.0-whatsnew:


Neuron 2.20.1 (10/25/2024)
---------------------------

Neuron 2.20.1 release addresses an issue with the Neuron Persistent Cache that was brought forth in 2.20 release. In the 2.20 release, the Neuron persistent cache issue resulted in a cache-miss scenario when attempting to load a previously compiled Neuron Executable File Format (NEFF) from a different path or Python environment than the one used for the initial Neuron SDK installation and NEFF compilation. This release resolves the cache-miss problem, ensuring that NEFFs can be loaded correctly regardless of the path or Python environment used to install the Neuron SDK, as long as they were compiled using the same Neuron SDK version.

This release also addresses the excessive lock wait time issue during neuron_parallel_compile graph extraction for large cluster training. See :ref:`torch-neuronx-rn` and :ref:`libneuronxla-rn`.

Additionally, Neuron 2.20.1 introduces new Multi Framework DLAMI for Amazon Linux 2023 (AL2023) that customers can use to easily get started with latest Neuron SDK on multiple frameworks that Neuron supports. See :ref:`neuron-dlami-release-notes`.

Neuron 2.20.1 Training DLC is also updated to pre-install the necessary dependencies and support NxD Training library out of the box. See :ref:`neuron-dlc-release-notes`


Neuron 2.20.0 (09/16/2024)
---------------------------
.. contents:: Table of contents
Expand Down Expand Up @@ -386,27 +399,27 @@ Release Artifacts
Trn1 packages
^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.20.0
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.20.1

Inf2 packages
^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.20.0
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.20.1

Inf1 packages
^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.20.0
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.20.1

Supported Python Versions for Inf1 packages
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=pyversions --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.20.0
.. program-output:: python3 src/helperscripts/n2-helper.py --list=pyversions --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.20.1

Supported Python Versions for Inf2/Trn1 packages
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=pyversions --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.20.0
.. program-output:: python3 src/helperscripts/n2-helper.py --list=pyversions --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.20.1

Supported Numpy Versions
^^^^^^^^^^^^^^^^^^^^^^^^
Expand Down
Loading

0 comments on commit 3ab9d96

Please sign in to comment.