Spark RAPIDS tuning on Nvidia HGX #11431
Replies: 4 comments 1 reply
-
Hi @melvin-koh, please note that the spark rapids plugin contains cuDF, so I am a little confused where the cuDF 22.02.0 comes from, if you are running the Spark RAPIDS plugin 24.06.1, you are running a cuDF that is included in it (24.06.x). That said, For the purposes of performance testing, I would run the transcode CSV->Parquet and then focus on For an HGX (8 GPUs) we will want to run 8 executors (1 per GPU). I'd recommend running with JVMs that use 16 cores for spark tasks and 16GB of JVM heap. A driver with the same amount or larger JVM may be good, but cross that bridge if we see high GC in the driver. I'd also pay attention to the We recommend adjusting other Spark configs (see below). I'd call out the large maxPartitionBytes as we want our map tasks to be fairly large, and the adaptive configs that should allow AQE to coalesce and these values are sensible defaults we have found through experimentation in NDS. We base our configs for SF3K (3TB). concurrentGpuTasks=4 is a sensible default for this case and it shows the best performance if we have GPU memory to spare. The shuffle manager configs use CPU threads to encode/decode shuffle. This would be in aggregate 256 threads for 8 executors, so 32 may be a bit high. That said you can think of this as having 2 threads per executor thread that is able to serialize/compress deserialize/decompress data about twice as fast for shuffle. In the future we may suggest UCX shuffle (NVLink) but right now the multi-threaded approach is going to be better.
|
Beta Was this translation helpful? Give feedback.
-
I forgot to ask a couple of things:
|
Beta Was this translation helpful? Give feedback.
-
Hi @abellina, thanks for all the help, great tips! Sorry about the cuDF, it is not 22.02.0, that was a mistake in my copy-paste. I am also aware of the UTF-8 encoding. After performing more testing, I realised that the bottleneck is my HDFS/network and it is limiting the speed of the GPU acceleration. The Spark job is running locally on the HGX, reading from HDFS in another cluster. A question on spark.rapids.sql.concurrentGpuTasks - is GPU task has any direct relation to Executor task? If my Executor has 8 cores, meaning 8 active tasks, should I also be setting concurrentGpuTasks to 8? |
Beta Was this translation helpful? Give feedback.
-
See the FAQ entry at https://docs.nvidia.com/spark-rapids/user-guide/24.06/faq.html#how-are-spark-executor-cores-spark-task-resource-gpu-amount-and-spark-rapids-sql-concurrentgputasks-related which should provide some more details. |
Beta Was this translation helpful? Give feedback.
-
Hi experts, I am performing benchmark testing using Spark RAPIDS and need some advise on some tuning properties. The hardware is HGX with 8x H100 GPUs. Right now, I am just running a transcoding workload similar to the nds_transcode.py script from the nds benchmark.
My aim right now is to push the utilisation of the GPU as high as possible. My question is around
spark.rapids.sql.concurrentGpuTasks
. I have tried setting this to 4 and 8, but from nvidia-smi output I see that most GPUs are at most around 15% utilised. Propertyspark.task.resource.gpu.amount
is set to 0.0625 (I am running with 16 cores Spark executors).Since the H100 has so many cores, should I be setting the concurrent GPU task to much higher? Any other properties should I be looking at?
Software versions:
Beta Was this translation helpful? Give feedback.
All reactions