Spark RAPIDS tuning on Nvidia HGX #11431

melvin-koh · 2024-09-06T02:29:21Z

melvin-koh
Sep 6, 2024

Hi experts, I am performing benchmark testing using Spark RAPIDS and need some advise on some tuning properties. The hardware is HGX with 8x H100 GPUs. Right now, I am just running a transcoding workload similar to the nds_transcode.py script from the nds benchmark.

My aim right now is to push the utilisation of the GPU as high as possible. My question is around spark.rapids.sql.concurrentGpuTasks. I have tried setting this to 4 and 8, but from nvidia-smi output I see that most GPUs are at most around 15% utilised. Property spark.task.resource.gpu.amount is set to 0.0625 (I am running with 16 cores Spark executors).

Since the H100 has so many cores, should I be setting the concurrent GPU task to much higher? Any other properties should I be looking at?

Software versions:

RAPIDS version: 24.06.1
cuDF version: 22.02.0
Nvidia driver: 550.54.15
Cuda version: 12.4
Spark version: 3.2.3

abellina · 2024-09-06T18:18:32Z

abellina
Sep 6, 2024
Collaborator

Hi @melvin-koh, please note that the spark rapids plugin contains cuDF, so I am a little confused where the cuDF 22.02.0 comes from, if you are running the Spark RAPIDS plugin 24.06.1, you are running a cuDF that is included in it (24.06.x).

That said, nds_transcode.py, this is going to transcode CSV to Parquet. The CSV scan is off by default because of incompatibilities: we only support UTF-8 encoding and there are incompatibilities with decimals and other types. The NDS dataset (based on TPCDS) can have ISO/IEC 8859-1 encoded characters, and we don't support that.

For the purposes of performance testing, I would run the transcode CSV->Parquet and then focus on nds_power.py. This is the "power run" where we run a series of queries against the parquet data generated in the transcode step. Then there's a bit of tuning involved, but here are some guidelines.

For an HGX (8 GPUs) we will want to run 8 executors (1 per GPU). I'd recommend running with JVMs that use 16 cores for spark tasks and 16GB of JVM heap. A driver with the same amount or larger JVM may be good, but cross that bridge if we see high GC in the driver. I'd also pay attention to the spillStorageSize and I've set it somewhat low here at 4GB, but if we spill a lot to disk we may want to increase this value. The pinned pool size of 8GB is going to be allocated up front and that's going to be used for fast GPU<->Host transfers, we don't run with lower pinned pool sizes usually unless we are in a constrained environment.

We recommend adjusting other Spark configs (see below). I'd call out the large maxPartitionBytes as we want our map tasks to be fairly large, and the adaptive configs that should allow AQE to coalesce and these values are sensible defaults we have found through experimentation in NDS. We base our configs for SF3K (3TB).

concurrentGpuTasks=4 is a sensible default for this case and it shows the best performance if we have GPU memory to spare.

The shuffle manager configs use CPU threads to encode/decode shuffle. This would be in aggregate 256 threads for 8 executors, so 32 may be a bit high. That said you can think of this as having 2 threads per executor thread that is able to serialize/compress deserialize/decompress data about twice as fast for shuffle. In the future we may suggest UCX shuffle (NVLink) but right now the multi-threaded approach is going to be better.

                   "--conf" "spark.plugins=com.nvidia.spark.SQLPlugin"
                   "--conf" "spark.locality.wait=0"
                   "--conf" "spark.sql.files.maxPartitionBytes=2gb"
                   "--conf" "spark.driver.maxResultSize=2GB"
	           "--conf" "spark.sql.adaptive.coalescePartitions.minPartitionSize=32mb"
	           "--conf" "spark.sql.adaptive.advisoryPartitionSizeInBytes=160mb"
                   "--conf" "spark.driver.memory=16G"
                   "--conf" "spark.executor.cores=16"
                   "--conf" "spark.executor.memory=16G"
                   "--conf" "spark.executor.resource.gpu.amount=1"
                   "--conf" "spark.task.resource.gpu.amount=0.0625"
                   "--conf" "spark.rapids.memory.host.spillStorageSize=4G"
                   "--conf" "spark.rapids.memory.pinnedPool.size=8g"
                   "--conf" "spark.rapids.sql.concurrentGpuTasks=4"
                   "--conf" "spark.shuffle.manager=com.nvidia.spark.rapids.spark323.RapidsShuffleManager"
                   "--conf" "spark.rapids.shuffle.multiThreaded.writer.threads=32"
                   "--conf" "spark.rapids.shuffle.multiThreaded.reader.threads=32"

0 replies

abellina · 2024-09-06T18:22:09Z

abellina
Sep 6, 2024
Collaborator

I forgot to ask a couple of things:

Is it local to the HGX or remote?
Have you configured the temp directories for Spark to point to a fast mount? This can be configured by setting SPARK_LOCAL_DIRS in the spark-env file in the spark distro before launching the Workers.

0 replies

melvin-koh · 2024-09-10T02:09:06Z

melvin-koh
Sep 10, 2024
Author

Hi @abellina, thanks for all the help, great tips! Sorry about the cuDF, it is not 22.02.0, that was a mistake in my copy-paste. I am also aware of the UTF-8 encoding. After performing more testing, I realised that the bottleneck is my HDFS/network and it is limiting the speed of the GPU acceleration. The Spark job is running locally on the HGX, reading from HDFS in another cluster.

A question on spark.rapids.sql.concurrentGpuTasks - is GPU task has any direct relation to Executor task? If my Executor has 8 cores, meaning 8 active tasks, should I also be setting concurrentGpuTasks to 8?

1 reply

abellina Sep 10, 2024
Collaborator

Spark's core count per executor should be larger than concurrentGpuTasks. The idea is that N spark cores are doing CPU things like IO to a distributed store or local disk and "concurrentGpuTasks" is how many of those cores we allow to go to the GPU at the same time. We recommend at most 4 for this config (and that's the value I'd choose for the HGX). The idea behind this is that we are able to tune lower memory pressure if we see GPU out of memory, but also as the 4 tasks are busy with the GPU, we want to see the other Spark cores busy with IO or other CPU-intensive work (shuffle compressio/decompression for example). So we use it as a way to pipeline work.

jlowe · 2024-09-10T14:26:01Z

jlowe
Sep 10, 2024
Maintainer

is GPU task has any direct relation to Executor task?

See the FAQ entry at https://docs.nvidia.com/spark-rapids/user-guide/24.06/faq.html#how-are-spark-executor-cores-spark-task-resource-gpu-amount-and-spark-rapids-sql-concurrentgputasks-related which should provide some more details.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark RAPIDS tuning on Nvidia HGX #11431

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Spark RAPIDS tuning on Nvidia HGX #11431

melvin-koh Sep 6, 2024

Replies: 4 comments · 1 reply

abellina Sep 6, 2024 Collaborator

abellina Sep 6, 2024 Collaborator

melvin-koh Sep 10, 2024 Author

abellina Sep 10, 2024 Collaborator

jlowe Sep 10, 2024 Maintainer

melvin-koh
Sep 6, 2024

Replies: 4 comments 1 reply

abellina
Sep 6, 2024
Collaborator

abellina
Sep 6, 2024
Collaborator

melvin-koh
Sep 10, 2024
Author

abellina Sep 10, 2024
Collaborator

jlowe
Sep 10, 2024
Maintainer