[Pytorch] Nvidia-DLFramework-Inspect support #1441

pggPL · 2025-01-30T11:49:40Z

Description

Nvidia-DLFramework-Inspect will be the common debug/logging API for Nvidia frameworks. Integration to the Transformer Engine has 3 aims:

allow to disable/enable FP8 in the particular GEMMs, run current scaling in some GEMMs etc.
allow to easily log the statistics for each of the tensor in every GEMM,
make testing new precision/recipes integrated with the TE easier.

Link to the nvidia-dlframework-inspect. IMPORTANT To run this PR one need to use branch from that PR.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring
Testing

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Przemek Tredak <[email protected]>

* Fix linter warnings in basic linear op Signed-off-by: Tim Moon <[email protected]> * Fix linter warnings in grouped linear module Signed-off-by: Tim Moon <[email protected]> * Disable Userbuffers support in te.Sequential Signed-off-by: Tim Moon <[email protected]> --------- Signed-off-by: Tim Moon <[email protected]>

* Add path to disable cudnn norm for mxfp8 Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Kirthi Shankar Sivamani <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Create scale_inv for block scaling already padded Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * fix Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Remove old file, fix CG test Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Fixes Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Change default value of env Signed-off-by: Kirthi Shankar Sivamani <[email protected]> --------- Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

…VIDIA#1440) Respect existing quantizer usages in functional linear API Signed-off-by: Tim Moon <[email protected]>

Update FE 1.10-rc to 1.10 Signed-off-by: Charlene Yang <[email protected]>

for more information, see https://pre-commit.ci Signed-off-by: Pawel Gadzinski <[email protected]>

Signed-off-by: Pawel Gadzinski <[email protected]>

for more information, see https://pre-commit.ci Signed-off-by: Pawel Gadzinski <[email protected]>

Signed-off-by: Pawel Gadzinski <[email protected]>

for more information, see https://pre-commit.ci Signed-off-by: Pawel Gadzinski <[email protected]>

Signed-off-by: Pawel Gadzinski <[email protected]>

for more information, see https://pre-commit.ci Signed-off-by: Pawel Gadzinski <[email protected]>

Signed-off-by: Pawel Gadzinski <[email protected]>

Debug errors with NeMo distributed optimizer Avoid internal quantized tensor class in params and when setting data attr. Debug view function in MXFP8Tensor. Signed-off-by: Tim Moon <[email protected]>

Rename MXFP8 recipe Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

transformer_engine/pytorch/module/linear.py

…ons (NVIDIA#1437) * Generalized MXFP8 fused kernels w.r.t. input tensor dimensions Signed-off-by: Oleg Goncharov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update transformer_engine/common/common.cu Co-authored-by: Tim Moon <[email protected]> Signed-off-by: Oleg Goncharov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removed unnecessary test scenarios Signed-off-by: Oleg Goncharov <[email protected]> * Reverted the previous commit as it generated a compilation error (caused by to string conversion) Signed-off-by: Oleg Goncharov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update transformer_engine/common/common.cu Co-authored-by: Tim Moon <[email protected]> Signed-off-by: Oleg Goncharov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update test_cast_mxfp8.cu Signed-off-by: Oleg Goncharov <[email protected]> * Fixed the bug with partial dbias writes in trimmed chunks Signed-off-by: Oleg Goncharov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Generalized MXFP8 dequantize kernel Signed-off-by: Oleg Goncharov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Oleg Goncharov <[email protected]> Signed-off-by: Oleg Goncharov <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Tim Moon <[email protected]>

Add the virtual destructor to the Quantizer Signed-off-by: Przemek Tredak <[email protected]>

* Skip MXFP8 dequantize tests with invalid alignment Signed-off-by: Tim Moon <[email protected]> * Remove test case with unaligned rows Signed-off-by: Tim Moon <[email protected]> --------- Signed-off-by: Tim Moon <[email protected]>

* Relax FP8 gated activations requirements Expanded MXFP8 and FP8 tests coverage Signed-off-by: Przemek Tredak <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix scale_inv check in test Signed-off-by: Przemek Tredak <[email protected]> * Update tests/cpp/operator/test_cast_mxfp8.cu Co-authored-by: Tim Moon <[email protected]> Signed-off-by: Przemyslaw Tredak <[email protected]> * Changes from review Signed-off-by: Przemek Tredak <[email protected]> * Lift the 2D restriction on MXFP8 scales Signed-off-by: Przemek Tredak <[email protected]> * Fix the scale_inv dimension check for MXFP8 Signed-off-by: Przemek Tredak <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Skip columnwise MXFP8 tests for 1D tensors Signed-off-by: Przemek Tredak <[email protected]> * Skip 2x MXFP8 tests with 1D tensors Signed-off-by: Przemek Tredak <[email protected]> * Adjusting tolerances for dbias Signed-off-by: Przemek Tredak <[email protected]> * Smaller test cases Signed-off-by: Przemek Tredak <[email protected]> --------- Signed-off-by: Przemek Tredak <[email protected]> Signed-off-by: Przemyslaw Tredak <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Tim Moon <[email protected]>

Signed-off-by: Pawel Gadzinski <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Pawel Gadzinski <[email protected]>

for more information, see https://pre-commit.ci

…ect_support

Signed-off-by: Pawel Gadzinski <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Pawel Gadzinski <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Pawel Gadzinski <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Pawel Gadzinski <[email protected]>

ptrendx · 2025-02-07T22:52:10Z

Please move this PR to be against main.

Signed-off-by: Pawel Gadzinski <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Pawel Gadzinski <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Pawel Gadzinski <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Pawel Gadzinski <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Pawel Gadzinski <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Pawel Gadzinski <[email protected]>

for more information, see https://pre-commit.ci

ptrendx and others added 15 commits January 27, 2025 13:53

TE 2.0 code drop

819a752

Signed-off-by: Przemek Tredak <[email protected]>

[PyTorch] Respect existing quantizer usages in functional linear API (N…

5904a80

…VIDIA#1440) Respect existing quantizer usages in functional linear API Signed-off-by: Tim Moon <[email protected]>

Nvidia-DLFramework-Inspect support

6a5c9ff

Update FE from 1.10-rc to 1.10 (NVIDIA#1438)

058540e

Update FE 1.10-rc to 1.10 Signed-off-by: Charlene Yang <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

479556b

for more information, see https://pre-commit.ci Signed-off-by: Pawel Gadzinski <[email protected]>

removed unnecesssary files

ac8c225

Signed-off-by: Pawel Gadzinski <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

af57274

for more information, see https://pre-commit.ci Signed-off-by: Pawel Gadzinski <[email protected]>

removed unnecesssary files

7bb2e32

Signed-off-by: Pawel Gadzinski <[email protected]>

fixes

3bf946a

Signed-off-by: Pawel Gadzinski <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

7f3a668

for more information, see https://pre-commit.ci Signed-off-by: Pawel Gadzinski <[email protected]>

lint fix

daa7ccc

Signed-off-by: Pawel Gadzinski <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

f940ba3

for more information, see https://pre-commit.ci Signed-off-by: Pawel Gadzinski <[email protected]>

pggPL force-pushed the nvdlfw_inspect_support branch from 8f6dbd5 to f940ba3 Compare January 30, 2025 21:31

pggPL and others added 3 commits January 30, 2025 13:35

license fix

a58a5e6

Signed-off-by: Pawel Gadzinski <[email protected]>

[PyTorch] Debug NeMo distributed optimizer (NVIDIA#1444)

b5e6657

Debug errors with NeMo distributed optimizer Avoid internal quantized tensor class in params and when setting data attr. Debug view function in MXFP8Tensor. Signed-off-by: Tim Moon <[email protected]>

Rename block scaling recipe (NVIDIA#1442)

5955f7e

Rename MXFP8 recipe Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

timmoon10 requested changes Jan 31, 2025

View reviewed changes

transformer_engine/pytorch/module/linear.py Outdated Show resolved Hide resolved

Oleg-Goncharov and others added 10 commits January 31, 2025 16:13

Add the virtual destructor to the Quantizer class (NVIDIA#1446)

f5f2872

Add the virtual destructor to the Quantizer Signed-off-by: Przemek Tredak <[email protected]>

one test api fix

a4ffdf1

Signed-off-by: Pawel Gadzinski <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

ed79ea3

for more information, see https://pre-commit.ci

fixes

fa0719a

Signed-off-by: Pawel Gadzinski <[email protected]>

fixes

fbf5b53

Signed-off-by: Pawel Gadzinski <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

94c56d9

for more information, see https://pre-commit.ci

Merge remote-tracking branch 'upstream/release_v2.0' into nvdlfw_insp…

46fdc51

…ect_support

pggPL and others added 8 commits February 7, 2025 18:53

fix

f229d5a

Signed-off-by: Pawel Gadzinski <[email protected]>

fix

eb55420

Signed-off-by: Pawel Gadzinski <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

63d3da1

for more information, see https://pre-commit.ci

fixes all tests

fadc1ad

Signed-off-by: Pawel Gadzinski <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

65dfd63

for more information, see https://pre-commit.ci

fixes

3047d57

Signed-off-by: Pawel Gadzinski <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

197f806

for more information, see https://pre-commit.ci

fix

08440f9

Signed-off-by: Pawel Gadzinski <[email protected]>

Merge remote-tracking branch 'upstream/main' into nvdlfw_inspect_support

9e9359f

pggPL changed the base branch from release_v2.0 to main February 7, 2025 23:16

pggPL and others added 18 commits February 7, 2025 15:40

fixes

67cbcb5

Signed-off-by: Pawel Gadzinski <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

6be4c0f

for more information, see https://pre-commit.ci

fixes

97833ca

Signed-off-by: Pawel Gadzinski <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

a64ef56

for more information, see https://pre-commit.ci

fixes

559beec

Signed-off-by: Pawel Gadzinski <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

d7cbfd3

for more information, see https://pre-commit.ci

fixes

f183eb1

Signed-off-by: Pawel Gadzinski <[email protected]>

fix

7c88c2d

Signed-off-by: Pawel Gadzinski <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

1ad58d0

for more information, see https://pre-commit.ci

fix

af99722

Signed-off-by: Pawel Gadzinski <[email protected]>

fix

1b65185

Signed-off-by: Pawel Gadzinski <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

ab22ff5

for more information, see https://pre-commit.ci

fix

d8e48d5

Signed-off-by: Pawel Gadzinski <[email protected]>

fix

e167cea

Signed-off-by: Pawel Gadzinski <[email protected]>

fix

3a62d2e

Signed-off-by: Pawel Gadzinski <[email protected]>

fix

3dbcebe

Signed-off-by: Pawel Gadzinski <[email protected]>

lint fix

b3970fb

Signed-off-by: Pawel Gadzinski <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

a89a4be

for more information, see https://pre-commit.ci

pggPL marked this pull request as ready for review February 10, 2025 11:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Pytorch] Nvidia-DLFramework-Inspect support #1441

[Pytorch] Nvidia-DLFramework-Inspect support #1441

pggPL commented Jan 30, 2025 •

edited

Loading

ptrendx commented Feb 7, 2025

[Pytorch] Nvidia-DLFramework-Inspect support #1441

Are you sure you want to change the base?

[Pytorch] Nvidia-DLFramework-Inspect support #1441

Conversation

pggPL commented Jan 30, 2025 • edited Loading

Description

Type of change

Checklist:

ptrendx commented Feb 7, 2025

pggPL commented Jan 30, 2025 •

edited

Loading