Skip to content

Commit

Permalink
Neuron SDK Release 2.14.1
Browse files Browse the repository at this point in the history
Release notes for Neuron SDK 2.14.1
  • Loading branch information
awsjoshir authored Sep 26, 2023
1 parent 45e1209 commit a390205
Show file tree
Hide file tree
Showing 12 changed files with 345 additions and 27 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ Available Commands:
[--auto-cast <cast_mode>]
[--auto-cast-type <data_type>]
[--distribution-strategy <distribution_type>]
[--O <opt-level>]
[--optlevel <opt-level>], or [-O <opt-level>]
[--enable-saturate-infinity]
[--enable-fast-context-switch>]
[--enable-fast-loading-neuron-binaries]
Expand Down Expand Up @@ -125,7 +125,7 @@ Available Commands:

- ``NEMO``: Enable the compiler to perform optimizations applicable to models that use the `NeMo <https://github.com/NVIDIA/NeMo>`_ APIs to shard parameters, gradients, and optimizer states across data-parallel workers.

- :option:`--O <opt_level>`: Specify the level of optimization the compiler should perform. Possible numeric values are {1, 2, 3}. (Default: ``2``)
- :option:`--optlevel <opt_level>`: Specify the level of optimization the compiler should perform. Possible numeric values are {1, 2, 3}. (Default: ``2``)

Valid values:

Expand Down
8 changes: 8 additions & 0 deletions general/benchmarks/trn1/trn1-performance.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,17 @@ Training Performance (Trn1 / Trn1n)
:file: trn1_trn1n_nlp_data.csv
:header-rows: 1

.. note::
**TP (Tensor Parallel), PP (Pipeline Parallel) and DP (Data Parallel)** Topology configuration refers to the degrees of 3D Parallelism (How the model and data is sharded across neuron cores).

TP and PP are specified in the run script and DP is calculated by dividing **world size**(Number of nodes/instances * Number of neuron cores per instance) by TP * PP degrees.
For example : TP = 4, PP = 4 and Number of instances is 32 (trn1.32xlarge). The world size will be : 32 (num instances) * 32(neuron cores per instance) = 1024. Now, DP degree = 1024 (World size)/ 4 (TP) * 4 (PP) = 64

.. note::
Read more about strong vs weak scaling here :ref:`neuron-training-faq`


Inference Performance
---------------------

Expand Down
Empty file added inference-features-support.rst
Empty file.
18 changes: 13 additions & 5 deletions release-notes/compiler/neuronx-cc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,19 +7,27 @@ Neuron Compiler (``neuronx-cc``) release notes

:depth: 2

Neuron Compiler [2.10.0.35]
-----------------------------
Date: 09/26/2023

* This release addresses a compilation regression for certain configurations of Llama and Llama-2 inference models when it fails compilation with this error "IndirectLoad/Save requires contiguous indirect access per partition" .

There is still a known issue for some configurations of the model with the error "Too many instructions after unroll for function sg0000" . To mitigate this, please try with -O1 compiler option (or --optlevel 1) . A complete fix will be coming in the future release which will not require this option

Neuron Compiler [2.10.0.34]
-----------------------------
Date: 09/15/2023

* This release introduces a new ``--O`` compiler option. This option allows the user to balance between compile-time and optimizations performed.
Three levels are supported. Level ``--O1`` aims to minimize compile-time and allow for a more rapid model development cycle. Model execution
time may be reduced. Level ``--O3`` performs whole-model optimization. This level will deliver the best performance however there will be longer
* This release introduces a new ``--optlevel (-O)`` compiler option. This option allows the user to balance between compile-time and optimizations performed.
Three levels are supported. Level ``--optlevel 1 (-O1)`` aims to minimize compile-time and allow for a more rapid model development cycle. Model execution
time may be reduced. Level ``--optlevel 3 (-O3)`` performs whole-model optimization. This level will deliver the best performance however there will be longer
compile-times and the compiler will use more host DRAM, potentially requiring a larger instance to compile the model.
The default is ``-O2`` which provides a balance between model performance and compile time.
The default is ``--optlevel 2 (-O2)`` which provides a balance between model performance and compile time.

The previous ``—enable-experimental-O1`` flag introduced in the 02/08/2023 Neuron Compiler [2.4.0.21] release is now deprecated. Using this flag
will generate a message similar to:
WARNING: Option —enable-experimental-O1 is deprecated and will be removed in a future release." Use ``--O1`` instead.
WARNING: Option —enable-experimental-O1 is deprecated and will be removed in a future release." Use ``--optlevel 1 (-O1)`` instead.

Neuron Compiler [2.9.0.16]
-----------------------------
Expand Down
21 changes: 18 additions & 3 deletions release-notes/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,21 @@ What's New
.. _neuron-2.14.0-whatsnew:


Neuron 2.14.1 (09/26/2023)
--------------------------

This is a patch release that fixes compiler issues in certain configurations of ``Llama`` and ``Llama-2`` model inference using ``transformers-neuronx``.

.. note::

There is still a known compiler issue for inference of some configurations of ``Llama`` and ``Llama-2`` models that will be addressed in future Neuron release.
Customers are advised to use ``--optlevel 1 (or -O1)`` compiler flag to mitigate this known compiler issue.

See :ref:`neuron-compiler-cli-reference-guide` on the usage of ``--optlevel 1`` compiler flag. Please see more on the compiler fix and known issues in :ref:`neuronx-cc-rn` and :ref:`transformers-neuronx-rn`




Neuron 2.14.0 (09/15/2023)
--------------------------

Expand Down Expand Up @@ -67,7 +82,7 @@ This release introduces the following:
- Trn1/Trn1n,Inf2

* - Neuron Compiler (neuronx-cc)
- * New ``--O`` compiler option that enables different optimizations with tradeoff between faster model compile time and faster model execution. See more at :ref:`neuron-compiler-cli-reference-guide`
- * New ``--optlevel``(or ``-O``) compiler option that enables different optimizations with tradeoff between faster model compile time and faster model execution. See more at :ref:`neuron-compiler-cli-reference-guide`
* See more at :ref:`neuronx-cc-rn`
- Inf2/Trn1/Trn1n

Expand Down Expand Up @@ -299,11 +314,11 @@ Release Artifacts

Trn1 packages

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.14.0
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.14.1

Inf2 packages

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.14.0
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.14.1

Inf1 packages

Expand Down
16 changes: 16 additions & 0 deletions release-notes/prev/content.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,22 @@ Previous Releases Artifacts (Neuron 2.x)
:depth: 1


Neuron 2.14.0 (09/15/2023)
--------------------------------------

Trn1 packages
^^^^^^^^^^^^^

.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=trn1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.14.0

Inf2 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf2 --file=src/helperscripts/n2-manifest.json --neuron-version=2.14.0

Inf1 packages
^^^^^^^^^^^^^
.. program-output:: python3 src/helperscripts/n2-helper.py --list=packages --instance=inf1 --file=src/helperscripts/n2-manifest.json --neuron-version=2.14.0


Neuron 2.13.2 (09/01/2023)
--------------------------------------
Expand Down
4 changes: 4 additions & 0 deletions release-notes/torch/transformers-neuronx/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,10 @@ Resolved Issues
Known Issues and Limitations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- Some configurations of LLaMA and LLaMA-2 inference models fail compilation with the error ``IndirectLoad/Save requires contiguous indirect access per partition``. This is fixed in the compiler version 2.10.0.35 (Neuron SDK 2.14.1).

- Some configurations of LLaMA and LLaMA-2 inference model fail compilation with the error ``Too many instructions after unroll for function sg0000``. To mitigate this, please try with ``-O1`` compiler option (or ``--optlevel 1``) by adding ``os.environ["NEURON_CC_FLAGS"] = "-O1"`` to your script or set in the environment. A complete fix will be coming in the future release which will not require this option. Note: Using -O1 in the LLaMA-2 13B tutorial results in about 50% increase in latency compared to Neuron SDK 2.13.2. If this is not acceptable, please use compiler version from Neuron SDK 2.13.2.


Release [0.6.106]
----------------------
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -194,9 +194,14 @@
"source": [
"import t5_models \n",
"import neuronx_distributed\n",
"import time \n",
"\n",
"# This can take up to 20 minutes\n",
"encoder_compile_start_time = time.time()\n",
"traced_encoder = t5_models.parallel_trace_encoder(model_name, max_length, num_beams, tp_degree)\n",
"neuronx_distributed.trace.parallel_model_save(traced_encoder, \"TracedParallelEncoder.pt\")\n"
"print(\"Encoder compilation time {}\".format(time.time() - encoder_compile_start_time))\n",
"\n",
"neuronx_distributed.trace.parallel_model_save(traced_encoder, \"TracedParallelEncoder.pt\")"
]
},
{
Expand All @@ -205,7 +210,11 @@
"metadata": {},
"outputs": [],
"source": [
"# This can take up to 15 minutes\n",
"decoder_compile_start_time = time.time()\n",
"traced_decoder = t5_models.parallel_trace_decoder(model, model_name, num_beams, max_length, tp_degree)\n",
"print(\"Decoder compilation time {}\".format(time.time() - decoder_compile_start_time))\n",
"\n",
"neuronx_distributed.trace.parallel_model_save(traced_decoder, \"TracedParallelDecoder.pt\")"
]
},
Expand Down Expand Up @@ -266,6 +275,192 @@
"for i, summary in enumerate(results):\n",
" print(i + 1, summary)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Benchmarking\n",
"\n",
"Let us benchmark the per token decoder latency"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Let us install NeuronPerf. We will use it to measure the performance.\n",
"! pip install neuronperf --extra-index-url=https://pip.repos.neuron.amazonaws.com"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os \n",
"import neuronperf as npf\n",
"\n",
"d_model = model.config.d_model\n",
"model_dir = \"TracedParallelDecoder.pt\"\n",
"decoder_run_count = 128\n",
"\n",
"def load_fn(model_path, **kwargs):\n",
" return neuronx_distributed.trace.parallel_model_load(model_path)\n",
" \n",
"# NeuronPerf can't see tp_degree at the moment, so just expose all cores\n",
"def env_setup_fn(*_):\n",
" del os.environ[\"NEURON_RT_VISIBLE_CORES\"]\n",
"\n",
"def benchmark():\n",
"\n",
" # Create some sample inputs for the decoder\n",
" decoder_input_ids = torch.ones((num_beams, 1), dtype=torch.int64)\n",
" decoder_attention_mask = torch.ones((num_beams, max_length), dtype=torch.int32)\n",
" encoder_attention_mask = torch.ones((num_beams, max_length), dtype=torch.int64)\n",
" encoder_hidden_states = torch.ones((num_beams, max_length, d_model), dtype=torch.float32)\n",
" beam_idx = torch.arange(0, num_beams, dtype=torch.int64)\n",
" beam_scores = torch.zeros((num_beams,), dtype=torch.float)\n",
"\n",
" inputs = (decoder_input_ids,\n",
" decoder_attention_mask,\n",
" encoder_hidden_states,\n",
" encoder_attention_mask,\n",
" beam_idx,\n",
" beam_scores)\n",
"\n",
" reports = npf.benchmark(\n",
" load_fn,\n",
" model_dir,\n",
" [inputs], \n",
" batch_sizes=1,\n",
" n_models=1,\n",
" max_infers=decoder_run_count,\n",
" workers_per_model=1, # no bottleneck on model inputs, so 1 is fine\n",
" env_setup_fn=env_setup_fn,\n",
" multiprocess=False,\n",
" )\n",
" \n",
" report = reports[0]\n",
"\n",
" # let's update throughput to be tokens / second and add a new recor\n",
" latency_in_s = report[\"latency_ms_avg\"] / 1000\n",
" tokens_per_s = decoder_run_count / latency_in_s\n",
" report[\"throughput_avg\"] = tokens_per_s\n",
" \n",
" # display and save results\n",
" npf.print_reports(reports, cols=[\"throughput_avg\", \"latency_ms_p50\", \"latency_ms_p99\"])\n",
" print(f\"Results saved to: {npf.write_json(reports[0])}\")\n",
"\n",
"benchmark()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now lets benchmark inference as a whole including sampling. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import torch\n",
"import neuronx_distributed\n",
"import neuronperf as npf\n",
"\n",
"from transformers import T5Tokenizer\n",
"from wrapper import T5Wrapper\n",
"\n",
"tokenizer = T5Tokenizer.from_pretrained(model_name)\n",
"\n",
"generated_token_count = 0\n",
"\n",
"class Wrapper(torch.nn.Module):\n",
" def __init__(self, \n",
" traced_encoder,\n",
" traced_decoder):\n",
" super().__init__()\n",
" self.model = T5Wrapper.from_pretrained(model_name)\n",
" self.model.encoder = traced_encoder\n",
" self.model.decoder = traced_decoder\n",
" setattr(self.model.encoder, 'main_input_name', 'input_ids') # Attribute required by beam search\n",
"\n",
" def forward(self, *inputs):\n",
" input_ids = inputs[0]['input_ids']\n",
" attention_mask = inputs[0]['attention_mask']\n",
" return self.model.parallel_infer(input_ids=input_ids,\n",
" attention_mask=attention_mask,\n",
" max_length=max_length,\n",
" num_beams=num_beams,\n",
" num_return_sequences=num_return_sequences)\n",
"\n",
"def load_fn(filename, **kwargs):\n",
" traced_encoder = neuronx_distributed.trace.parallel_model_load(filename + \"TracedParallelEncoder.pt\")\n",
" traced_decoder = neuronx_distributed.trace.parallel_model_load(filename + \"TracedParallelDecoder.pt\")\n",
" return Wrapper(traced_encoder, traced_decoder)\n",
"\n",
"# NeuronPerf can't see tp_degree at the moment, so just expose all cores\n",
"def env_setup_fn(*_):\n",
" del os.environ[\"NEURON_RT_VISIBLE_CORES\"]\n",
"\n",
"def preprocess_fn(inputs):\n",
" \n",
" encoding = []\n",
" for text in inputs:\n",
" batch_encoding = tokenizer(text, \n",
" max_length=max_length, \n",
" truncation=True, \n",
" padding='max_length',\n",
" return_tensors=\"pt\")\n",
" input_ids = batch_encoding['input_ids']\n",
" attention_mask = batch_encoding['attention_mask']\n",
" encoding.append({\"input_ids\": input_ids,\n",
" \"attention_mask\": attention_mask})\n",
" return encoding\n",
"\n",
"def postprocess_fn(outputs):\n",
" output = [tokenizer.decode(seq) for seq in outputs]\n",
" global generated_token_count \n",
" generated_token_count = len(outputs[0])\n",
" return output\n",
"\n",
"def benchmark():\n",
" inputs = [\"summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes.\"]\n",
" reports = npf.benchmark(\n",
" load_fn,\n",
" \"\", # Model dir\n",
" [inputs], \n",
" batch_sizes=1,\n",
" n_models=1,\n",
" max_infers=5,\n",
" max_duration=0, # sampling can take a while, so let's not timeout\n",
" workers_per_model=1, \n",
" env_setup_fn=env_setup_fn,\n",
" preprocess_fn=preprocess_fn,\n",
" postprocess_fn=postprocess_fn,\n",
" multiprocess=False,\n",
" )\n",
" \n",
" report = reports[0]\n",
"\n",
" report[\"throughput_avg\"] = round(generated_token_count / (report[\"latency_ms_avg\"] / 1000), 2)\n",
" report[\"latency_per_token_ms_p50\"] = round((report[\"latency_ms_p50\"])/generated_token_count, 2)\n",
" report[\"latency_per_token_ms_p99\"] = round((report[\"latency_ms_p99\"])/generated_token_count, 2)\n",
"\n",
" # display and save results\n",
" npf.print_reports(reports, cols=[\"throughput_avg\", \"latency_per_token_ms_p50\", \"latency_per_token_ms_p99\"])\n",
" print(f\"Results saved to: {npf.write_json(report)}\")\n",
"\n",
"benchmark()"
]
}
],
"metadata": {
Expand Down
Loading

0 comments on commit a390205

Please sign in to comment.