Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix
contiguous_split
performance (#13342)
This fixes a performance issue in `contiguous_split` that is due to `pack_metadata` not being implemented by an efficient way. In particular, the output bytes are copied from the internal buffer to the output buffer byte-by-byte, through `std::back_inserter`: ``` std::copy(metadata_begin, metadata_begin + (metadata.size() * sizeof(detail::serialized_column)), std::back_inserter(metadata_bytes)); ``` This was probably optimized somehow by the compiler, but recent refactors made some changes to the code and probably prevent such optimization. ### Benchmark Latest cudf commit: ``` ---------------------------------------------------------------------------------------------------------------------------------------------------- ContiguousSplit/6Gb512ColsNoValidity/6442450944/512/256/0/iterations:8/manual_time 46.1 ms 46.1 ms 8 bytes_per_second=260.086G/s ContiguousSplit/6Gb512ColsValidity/6442450944/512/256/1/iterations:8/manual_time 48.1 ms 48.0 ms 8 bytes_per_second=257.527G/s ContiguousSplit/6Gb10ColsNoValidity/6442450944/10/256/0/iterations:8/manual_time 27.4 ms 27.4 ms 8 bytes_per_second=438.188G/s ContiguousSplit/6Gb10ColsValidity/6442450944/10/256/1/iterations:8/manual_time 28.5 ms 28.5 ms 8 bytes_per_second=434.381G/s ContiguousSplit/4Gb512ColsNoValidity/4294967296/512/256/0/iterations:8/manual_time 34.5 ms 34.5 ms 8 bytes_per_second=231.825G/s ContiguousSplit/4Gb512ColsValidity/4294967296/512/256/1/iterations:8/manual_time 37.4 ms 37.4 ms 8 bytes_per_second=220.521G/s ContiguousSplit/4Gb10ColsNoValidity/4294967296/10/256/0/iterations:8/manual_time 18.9 ms 18.9 ms 8 bytes_per_second=422.259G/s ContiguousSplit/4Gb10ColsValidity/4294967296/10/256/1/iterations:8/manual_time 19.4 ms 19.4 ms 8 bytes_per_second=424.595G/s ContiguousSplit/4Gb4ColsNoSplits/1073741824/4/0/1/iterations:8/manual_time 4.35 ms 4.35 ms 8 bytes_per_second=474.47G/s ContiguousSplit/4Gb4ColsValidityNoSplits/1073741824/4/0/1/iterations:8/manual_time 4.35 ms 4.36 ms 8 bytes_per_second=473.665G/s ContiguousSplit/1Gb512ColsNoValidity/1073741824/512/256/0/iterations:8/manual_time 22.2 ms 22.2 ms 8 bytes_per_second=90.1502G/s ContiguousSplit/1Gb512ColsValidity/1073741824/512/256/1/iterations:8/manual_time 25.1 ms 25.1 ms 8 bytes_per_second=82.1379G/s ContiguousSplit/1Gb10ColsNoValidity/1073741824/10/256/0/iterations:8/manual_time 5.08 ms 5.08 ms 8 bytes_per_second=393.98G/s ContiguousSplit/1Gb10ColsValidity/1073741824/10/256/1/iterations:8/manual_time 5.28 ms 5.28 ms 8 bytes_per_second=390.85G/s ContiguousSplit/1Gb1ColNoSplits/1073741824/1/0/1/iterations:8/manual_time 4.34 ms 4.35 ms 8 bytes_per_second=474.715G/s ContiguousSplit/1Gb1ColValidityNoSplits/1073741824/1/0/1/iterations:8/manual_time 4.47 ms 4.47 ms 8 bytes_per_second=461.788G/s ContiguousSplitStrings/4Gb512ColsNoValidity/4294967296/512/256/0/iterations:8/manual_time 98.1 ms 98.0 ms 8 bytes_per_second=81.6345G/s ContiguousSplitStrings/4Gb512ColsValidity/4294967296/512/256/1/iterations:8/manual_time 89.5 ms 89.5 ms 8 bytes_per_second=90.843G/s ContiguousSplitStrings/4Gb10ColsNoValidity/4294967296/10/256/0/iterations:8/manual_time 28.9 ms 29.9 ms 8 bytes_per_second=290.261G/s ContiguousSplitStrings/4Gb10ColsValidity/4294967296/10/256/1/iterations:8/manual_time 20.4 ms 20.4 ms 8 bytes_per_second=417.033G/s ContiguousSplitStrings/4Gb4ColsNoSplits/1073741824/4/0/0/iterations:8/manual_time 6.70 ms 7.32 ms 8 bytes_per_second=335.9G/s ContiguousSplitStrings/4Gb4ColsValidityNoSplits/1073741824/4/0/1/iterations:8/manual_time 4.35 ms 4.36 ms 8 bytes_per_second=524.386G/s ContiguousSplitStrings/1Gb512ColsNoValidity/1073741824/512/256/0/iterations:8/manual_time 77.8 ms 77.8 ms 8 bytes_per_second=25.7184G/s ContiguousSplitStrings/1Gb512ColsValidity/1073741824/512/256/1/iterations:8/manual_time 79.2 ms 79.1 ms 8 bytes_per_second=25.6833G/s ContiguousSplitStrings/1Gb10ColsNoValidity/1073741824/10/256/0/iterations:8/manual_time 8.57 ms 8.81 ms 8 bytes_per_second=245.062G/s ContiguousSplitStrings/1Gb10ColsValidity/1073741824/10/256/1/iterations:8/manual_time 7.83 ms 6.15 ms 8 bytes_per_second=272.089G/s ContiguousSplitStrings/1Gb1ColNoSplits/1073741824/1/0/0/iterations:8/manual_time 6.66 ms 9.17 ms 8 bytes_per_second=450.551G/s ContiguousSplitStrings/1Gb1ColValidityNoSplits/1073741824/1/0/1/iterations:8/manual_time 4.41 ms 4.41 ms 8 bytes_per_second=687.88G/s ``` With this fix: ``` ---------------------------------------------------------------------------------------------------------------------------------------------------- ContiguousSplit/6Gb512ColsNoValidity/6442450944/512/256/0/iterations:8/manual_time 38.5 ms 38.4 ms 8 bytes_per_second=311.981G/s ContiguousSplit/6Gb512ColsValidity/6442450944/512/256/1/iterations:8/manual_time 42.8 ms 42.7 ms 8 bytes_per_second=289.289G/s ContiguousSplit/6Gb10ColsNoValidity/6442450944/10/256/0/iterations:8/manual_time 27.6 ms 27.5 ms 8 bytes_per_second=435.365G/s ContiguousSplit/6Gb10ColsValidity/6442450944/10/256/1/iterations:8/manual_time 28.4 ms 28.3 ms 8 bytes_per_second=436.145G/s ContiguousSplit/4Gb512ColsNoValidity/4294967296/512/256/0/iterations:8/manual_time 27.2 ms 27.2 ms 8 bytes_per_second=293.677G/s ContiguousSplit/4Gb512ColsValidity/4294967296/512/256/1/iterations:8/manual_time 29.9 ms 29.9 ms 8 bytes_per_second=276.137G/s ContiguousSplit/4Gb10ColsNoValidity/4294967296/10/256/0/iterations:8/manual_time 19.0 ms 19.0 ms 8 bytes_per_second=421.185G/s ContiguousSplit/4Gb10ColsValidity/4294967296/10/256/1/iterations:8/manual_time 19.1 ms 19.1 ms 8 bytes_per_second=431.306G/s ContiguousSplit/4Gb4ColsNoSplits/1073741824/4/0/1/iterations:8/manual_time 4.35 ms 4.35 ms 8 bytes_per_second=474.311G/s ContiguousSplit/4Gb4ColsValidityNoSplits/1073741824/4/0/1/iterations:8/manual_time 4.34 ms 4.35 ms 8 bytes_per_second=475.281G/s ContiguousSplit/1Gb512ColsNoValidity/1073741824/512/256/0/iterations:8/manual_time 14.6 ms 14.6 ms 8 bytes_per_second=137.131G/s ContiguousSplit/1Gb512ColsValidity/1073741824/512/256/1/iterations:8/manual_time 17.2 ms 17.2 ms 8 bytes_per_second=119.946G/s ContiguousSplit/1Gb10ColsNoValidity/1073741824/10/256/0/iterations:8/manual_time 4.89 ms 4.89 ms 8 bytes_per_second=409.281G/s ContiguousSplit/1Gb10ColsValidity/1073741824/10/256/1/iterations:8/manual_time 5.09 ms 5.10 ms 8 bytes_per_second=404.981G/s ContiguousSplit/1Gb1ColNoSplits/1073741824/1/0/1/iterations:8/manual_time 4.40 ms 4.41 ms 8 bytes_per_second=469.011G/s ContiguousSplit/1Gb1ColValidityNoSplits/1073741824/1/0/1/iterations:8/manual_time 4.40 ms 4.41 ms 8 bytes_per_second=468.577G/s ContiguousSplitStrings/4Gb512ColsNoValidity/4294967296/512/256/0/iterations:8/manual_time 76.0 ms 75.9 ms 8 bytes_per_second=105.396G/s ContiguousSplitStrings/4Gb512ColsValidity/4294967296/512/256/1/iterations:8/manual_time 70.6 ms 70.5 ms 8 bytes_per_second=115.205G/s ContiguousSplitStrings/4Gb10ColsNoValidity/4294967296/10/256/0/iterations:8/manual_time 28.6 ms 29.6 ms 8 bytes_per_second=293.253G/s ContiguousSplitStrings/4Gb10ColsValidity/4294967296/10/256/1/iterations:8/manual_time 19.0 ms 19.0 ms 8 bytes_per_second=448.676G/s ContiguousSplitStrings/4Gb4ColsNoSplits/1073741824/4/0/0/iterations:8/manual_time 6.69 ms 7.32 ms 8 bytes_per_second=336.342G/s ContiguousSplitStrings/4Gb4ColsValidityNoSplits/1073741824/4/0/1/iterations:8/manual_time 4.40 ms 4.39 ms 8 bytes_per_second=518.755G/s ContiguousSplitStrings/1Gb512ColsNoValidity/1073741824/512/256/0/iterations:8/manual_time 55.4 ms 55.4 ms 8 bytes_per_second=36.1167G/s ContiguousSplitStrings/1Gb512ColsValidity/1073741824/512/256/1/iterations:8/manual_time 57.0 ms 56.9 ms 8 bytes_per_second=35.6588G/s ContiguousSplitStrings/1Gb10ColsNoValidity/1073741824/10/256/0/iterations:8/manual_time 8.48 ms 8.73 ms 8 bytes_per_second=247.664G/s ContiguousSplitStrings/1Gb10ColsValidity/1073741824/10/256/1/iterations:8/manual_time 5.99 ms 6.00 ms 8 bytes_per_second=355.742G/s ContiguousSplitStrings/1Gb1ColNoSplits/1073741824/1/0/0/iterations:8/manual_time 6.69 ms 9.30 ms 8 bytes_per_second=448.359G/s ContiguousSplitStrings/1Gb1ColValidityNoSplits/1073741824/1/0/1/iterations:8/manual_time 4.33 ms 4.33 ms 8 bytes_per_second=700.639G/s ``` Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Alessandro Bellina (https://github.com/abellina) - David Wendt (https://github.com/davidwendt) URL: #13342
- Loading branch information