Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Parquet reader list microkernel (#16538)
This PR refactors fixed-width parquet list reader decoding into its own set of micro-kernels, templatizing the existing fixed-width microkernels. When skipping rows for lists, this will skip ahead the decoding of the definition, repetition, and dictionary rle_streams as well. The list kernel uses 128 threads per block and 71 registers per thread, so I've changed the launch_bounds to enforce a minimum of 8 blocks per SM. This causes a small register spill but the benchmarks are still faster, as seen below: DEVICE_BUFFER list benchmarks (decompress + decode, not bound by IO): run_length 1, cardinality 0, no byte_limit: 24.7% faster run_length 32, cardinality 1000, no byte_limit: 18.3% faster run_length 1, cardinality 0, 500kb byte_limit: 57% faster run_length 32, cardinality 1000, 500kb byte_limit: 53% faster Compressed list of ints on hard drive: 5.5% faster Sample real data on hard drive (many columns not lists): 0.5% faster Authors: - Paul Mattione (https://github.com/pmattione-nvidia) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - https://github.com/nvdbaranec - Nghia Truong (https://github.com/ttnghia) URL: #16538
- Loading branch information