Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Pytorch] Nvidia-DLFramework-Inspect support #1441

Open
wants to merge 84 commits into
base: main
Choose a base branch
from

Conversation

pggPL
Copy link
Collaborator

@pggPL pggPL commented Jan 30, 2025

Description

Nvidia-DLFramework-Inspect will be the common debug/logging API for Nvidia frameworks. Integration to the Transformer Engine has 3 aims:

  • allow to disable/enable FP8 in the particular GEMMs, run current scaling in some GEMMs etc.
  • allow to easily log the statistics for each of the tensor in every GEMM,
  • make testing new precision/recipes integrated with the TE easier.

Link to the nvidia-dlframework-inspect. IMPORTANT To run this PR one need to use branch from that PR.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring
  • Testing

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

ptrendx and others added 15 commits January 27, 2025 13:53
Signed-off-by: Przemek Tredak <[email protected]>
* Fix linter warnings in basic linear op

Signed-off-by: Tim Moon <[email protected]>

* Fix linter warnings in grouped linear module

Signed-off-by: Tim Moon <[email protected]>

* Disable Userbuffers support in te.Sequential

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: Tim Moon <[email protected]>
* Add path to disable cudnn norm for mxfp8

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Create scale_inv for block scaling already padded

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* fix

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* Remove old file, fix CG test

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* Fixes

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* Change default value of env

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

---------

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>
…VIDIA#1440)

Respect existing quantizer usages in functional linear API

Signed-off-by: Tim Moon <[email protected]>
Update FE 1.10-rc to 1.10

Signed-off-by: Charlene Yang <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
@pggPL pggPL force-pushed the nvdlfw_inspect_support branch from 8f6dbd5 to f940ba3 Compare January 30, 2025 21:31
pggPL and others added 3 commits January 30, 2025 13:35
Signed-off-by: Pawel Gadzinski <[email protected]>
Debug errors with NeMo distributed optimizer

Avoid internal quantized tensor class in params and when setting data attr. Debug view function in MXFP8Tensor.

Signed-off-by: Tim Moon <[email protected]>
Rename MXFP8 recipe

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>
Oleg-Goncharov and others added 10 commits January 31, 2025 16:13
…ons (NVIDIA#1437)

* Generalized MXFP8 fused kernels w.r.t. input tensor dimensions

Signed-off-by: Oleg Goncharov <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update transformer_engine/common/common.cu

Co-authored-by: Tim Moon <[email protected]>
Signed-off-by: Oleg Goncharov <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Removed unnecessary test scenarios

Signed-off-by: Oleg Goncharov <[email protected]>

* Reverted the previous commit as it generated a compilation error (caused by to string conversion)

Signed-off-by: Oleg Goncharov <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update transformer_engine/common/common.cu

Co-authored-by: Tim Moon <[email protected]>
Signed-off-by: Oleg Goncharov <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update test_cast_mxfp8.cu

Signed-off-by: Oleg Goncharov <[email protected]>

* Fixed the bug with partial dbias writes in trimmed chunks

Signed-off-by: Oleg Goncharov <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Generalized MXFP8 dequantize kernel

Signed-off-by: Oleg Goncharov <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Oleg Goncharov <[email protected]>
Signed-off-by: Oleg Goncharov <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <[email protected]>
Add the virtual destructor to the Quantizer

Signed-off-by: Przemek Tredak <[email protected]>
* Skip MXFP8 dequantize tests with invalid alignment

Signed-off-by: Tim Moon <[email protected]>

* Remove test case with unaligned rows

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: Tim Moon <[email protected]>
* Relax FP8 gated activations requirements
Expanded MXFP8 and FP8 tests coverage

Signed-off-by: Przemek Tredak <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix scale_inv check in test

Signed-off-by: Przemek Tredak <[email protected]>

* Update tests/cpp/operator/test_cast_mxfp8.cu

Co-authored-by: Tim Moon <[email protected]>
Signed-off-by: Przemyslaw Tredak <[email protected]>

* Changes from review

Signed-off-by: Przemek Tredak <[email protected]>

* Lift the 2D restriction on MXFP8 scales

Signed-off-by: Przemek Tredak <[email protected]>

* Fix the scale_inv dimension check for MXFP8

Signed-off-by: Przemek Tredak <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Skip columnwise MXFP8 tests for 1D tensors

Signed-off-by: Przemek Tredak <[email protected]>

* Skip 2x MXFP8 tests with 1D tensors

Signed-off-by: Przemek Tredak <[email protected]>

* Adjusting tolerances for dbias

Signed-off-by: Przemek Tredak <[email protected]>

* Smaller test cases

Signed-off-by: Przemek Tredak <[email protected]>

---------

Signed-off-by: Przemek Tredak <[email protected]>
Signed-off-by: Przemyslaw Tredak <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
pggPL and others added 8 commits February 7, 2025 18:53
Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
@ptrendx
Copy link
Member

ptrendx commented Feb 7, 2025

Please move this PR to be against main.

@pggPL pggPL changed the base branch from release_v2.0 to main February 7, 2025 23:16
pggPL and others added 18 commits February 7, 2025 15:40
Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
@pggPL pggPL marked this pull request as ready for review February 10, 2025 11:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants