Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PR #21708: NUMA-pin host memory buffers for D2H/H2D transfers #22243

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

copybara-service[bot]
Copy link

PR #21708: NUMA-pin host memory buffers for D2H/H2D transfers

Imported from GitHub PR #21708

This ensures that the pinned host buffers used for transfers between host and device are pinned to the NUMA node closest to the device. It had a previous life as #15216.

In a benchmark that triggers large, concurrent, copies from all devices to the host then achieved D2H throughput is around 33 GiB/s with NUMA pinning on a DGX H100 node (2xCPU, 8xH100). Without pinning, the achieved throughput is around 13.5 GiB/s from the same benchmark.

While it is already possible to achieve the correct NUMA pinning in process-per-GPU and process-per-NUMA-node configurations using numactl or similar, achieving correct pinning in process-per-node configuration requires logic inside XLA.
Copybara import of the project:

--
1a2d98b by Olli Lupton [email protected]:

NUMA-pin host memory buffers for D2H/H2D transfers

--
60a4659 by Olli Lupton [email protected]:

256 byte alignment for host allocations when NUMA is not enabled

--
839da45 by Olli Lupton [email protected]:

Address review comments

--
b61ce94 by Olli Lupton [email protected]:

Drop TENSORFLOW_USE_NUMA

--
793fde0 by Olli Lupton [email protected]:

std::string_view -> absl::string_view

Merging this change closes #21708

FUTURE_COPYBARA_INTEGRATE_REVIEW=#21708 from olupton:numa 793fde0

Imported from GitHub PR #21708

This ensures that the pinned host buffers used for transfers between host and device are pinned to the NUMA node closest to the device. It had a previous life as #15216.

In a benchmark that triggers large, concurrent, copies from all devices to the host then achieved D2H throughput is around 33 GiB/s with NUMA pinning on a DGX H100 node (2xCPU, 8xH100). Without pinning, the achieved throughput is around 13.5 GiB/s from the same benchmark.

While it is already possible to achieve the correct NUMA pinning in process-per-GPU and process-per-NUMA-node configurations using `numactl` or similar, achieving correct pinning in process-per-node configuration requires logic inside XLA.
Copybara import of the project:

--
1a2d98b by Olli Lupton <[email protected]>:

NUMA-pin host memory buffers for D2H/H2D transfers

--
60a4659 by Olli Lupton <[email protected]>:

256 byte alignment for host allocations when NUMA is not enabled

--
839da45 by Olli Lupton <[email protected]>:

Address review comments

--
b61ce94 by Olli Lupton <[email protected]>:

Drop TENSORFLOW_USE_NUMA

--
793fde0 by Olli Lupton <[email protected]>:

std::string_view -> absl::string_view

Merging this change closes #21708

FUTURE_COPYBARA_INTEGRATE_REVIEW=#21708 from olupton:numa 793fde0
PiperOrigin-RevId: 722688719
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant