Replies: 2 comments
-
note that this feature currently mostly helps in cloud environments. I'm working on other improvements when reading from local disk or node local HDFS type file systems. |
Beta Was this translation helpful? Give feedback.
0 replies
Answer selected by
sameerz
-
reopen if further questions |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
1,
spark.rapids.sql.format.parquet.multiThreadedRead.enabled
q1: what is
multiple small files within a partition
?q2: does there exist a size threshold to judge whether a file is a small file?
q3: whether this thread pool is async or sync with spark task? (i mean: Whether threads in this pool start reading only when a spark task demand a batch OR they can do file reading async to spark task's execution?)
q4: how much data will be read before threads in this pool will stop reading (a batch ? a row group ? or will read as much as possible until the buffer full)
2,
spark.rapids.sql.format.parquet.multiThreadedRead.numThreads
q5: the task pool and the number threads in this pool is per-executor or per-spark-task ?
Beta Was this translation helpful? Give feedback.
All reactions