[QST] Some question about `spark.rapids.sql.format.parquet.multiThreadedRead` #5383

JustPlay · 2020-09-29T09:52:15Z

JustPlay
Sep 29, 2020

1, spark.rapids.sql.format.parquet.multiThreadedRead.enabled

When set to true, reads multiple small files within a partition more efficiently by reading each file in a separate thread in parallel on the CPU side before sending to the GPU. Limited by spark.rapids.sql.format.parquet.multiThreadedRead.numThreads and spark.rapids.sql.format.parquet.multiThreadedRead.maxNumFileProcessed

q1: what is multiple small files within a partition ?
q2: does there exist a size threshold to judge whether a file is a small file?
q3: whether this thread pool is async or sync with spark task? (i mean: Whether threads in this pool start reading only when a spark task demand a batch OR they can do file reading async to spark task's execution?)
q4: how much data will be read before threads in this pool will stop reading (a batch ? a row group ? or will read as much as possible until the buffer full)

2, spark.rapids.sql.format.parquet.multiThreadedRead.numThreads

The maximum number of threads, on the executor, to use for reading small parquet files in parallel. This can not be changed at runtime after the executor has started.

q5: the task pool and the number threads in this pool is per-executor or per-spark-task ?

Answered by tgravescs

Sep 29, 2020

partition = task in Spark. Each task is set to read a certain amount of data (128MB) by default. If you have a lot of small files, that task can be assigned multiple of those small files.
no - in my testing I didn't see any difference in large file performance with it on.
once next() is called it starts to read the files in the background thread pool and then it blocks on the next file being ready (it reads the files in the same order as the cpu version would)
thread reads all of the file it is assigned. ie you get a different thread per file. if its a large file generally it would be assigned a portion of the file (128mb chunk). The config for number of threads and then how many to have…

View full answer

tgravescs · 2020-09-29T13:09:34Z

tgravescs
Sep 29, 2020
Maintainer

partition = task in Spark. Each task is set to read a certain amount of data (128MB) by default. If you have a lot of small files, that task can be assigned multiple of those small files.
no - in my testing I didn't see any difference in large file performance with it on.
once next() is called it starts to read the files in the background thread pool and then it blocks on the next file being ready (it reads the files in the same order as the cpu version would)
thread reads all of the file it is assigned. ie you get a different thread per file. if its a large file generally it would be assigned a portion of the file (128mb chunk). The config for number of threads and then how many to have running in parallel can help control memory usage.
per executor

note that this feature currently mostly helps in cloud environments. I'm working on other improvements when reading from local disk or node local HDFS type file systems.

0 replies

tgravescs · 2020-09-29T13:09:56Z

tgravescs
Sep 29, 2020
Maintainer

reopen if further questions

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Some question about `spark.rapids.sql.format.parquet.multiThreadedRead` #5383

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

[QST] Some question about spark.rapids.sql.format.parquet.multiThreadedRead #5383

JustPlay Sep 29, 2020

Replies: 2 comments

tgravescs Sep 29, 2020 Maintainer

tgravescs Sep 29, 2020 Maintainer

[QST] Some question about `spark.rapids.sql.format.parquet.multiThreadedRead` #5383

JustPlay
Sep 29, 2020

tgravescs
Sep 29, 2020
Maintainer

tgravescs
Sep 29, 2020
Maintainer