PR #21708: NUMA-pin host memory buffers for D2H/H2D transfers #22243

copybara-service · 2025-02-03T18:12:07Z

PR #21708: NUMA-pin host memory buffers for D2H/H2D transfers

Imported from GitHub PR #21708

This ensures that the pinned host buffers used for transfers between host and device are pinned to the NUMA node closest to the device. It had a previous life as #15216.

In a benchmark that triggers large, concurrent, copies from all devices to the host then achieved D2H throughput is around 33 GiB/s with NUMA pinning on a DGX H100 node (2xCPU, 8xH100). Without pinning, the achieved throughput is around 13.5 GiB/s from the same benchmark.

While it is already possible to achieve the correct NUMA pinning in process-per-GPU and process-per-NUMA-node configurations using numactl or similar, achieving correct pinning in process-per-node configuration requires logic inside XLA.
Copybara import of the project:

--
1a2d98b by Olli Lupton [email protected]:

NUMA-pin host memory buffers for D2H/H2D transfers

--
60a4659 by Olli Lupton [email protected]:

256 byte alignment for host allocations when NUMA is not enabled

--
839da45 by Olli Lupton [email protected]:

Address review comments

--
b61ce94 by Olli Lupton [email protected]:

Drop TENSORFLOW_USE_NUMA

--
793fde0 by Olli Lupton [email protected]:

std::string_view -> absl::string_view

Merging this change closes #21708

FUTURE_COPYBARA_INTEGRATE_REVIEW=#21708 from olupton:numa 793fde0

Imported from GitHub PR #21708 This ensures that the pinned host buffers used for transfers between host and device are pinned to the NUMA node closest to the device. It had a previous life as #15216. In a benchmark that triggers large, concurrent, copies from all devices to the host then achieved D2H throughput is around 33 GiB/s with NUMA pinning on a DGX H100 node (2xCPU, 8xH100). Without pinning, the achieved throughput is around 13.5 GiB/s from the same benchmark. While it is already possible to achieve the correct NUMA pinning in process-per-GPU and process-per-NUMA-node configurations using `numactl` or similar, achieving correct pinning in process-per-node configuration requires logic inside XLA. Copybara import of the project: -- 1a2d98b by Olli Lupton <[email protected]>: NUMA-pin host memory buffers for D2H/H2D transfers -- 60a4659 by Olli Lupton <[email protected]>: 256 byte alignment for host allocations when NUMA is not enabled -- 839da45 by Olli Lupton <[email protected]>: Address review comments -- b61ce94 by Olli Lupton <[email protected]>: Drop TENSORFLOW_USE_NUMA -- 793fde0 by Olli Lupton <[email protected]>: std::string_view -> absl::string_view Merging this change closes #21708 FUTURE_COPYBARA_INTEGRATE_REVIEW=#21708 from olupton:numa 793fde0 PiperOrigin-RevId: 722688719

copybara-service bot force-pushed the test_722688719 branch from 1d31798 to 82fdb46 Compare February 3, 2025 20:19

copybara-service bot force-pushed the test_722688719 branch from 82fdb46 to 051518c Compare February 3, 2025 21:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR #21708: NUMA-pin host memory buffers for D2H/H2D transfers #22243

PR #21708: NUMA-pin host memory buffers for D2H/H2D transfers #22243

copybara-service bot commented Feb 3, 2025

PR #21708: NUMA-pin host memory buffers for D2H/H2D transfers #22243

Are you sure you want to change the base?

PR #21708: NUMA-pin host memory buffers for D2H/H2D transfers #22243

Conversation

copybara-service bot commented Feb 3, 2025