Performance improvement ideas #7

mxmlnkn · 2022-11-24T09:10:26Z

Untested ideas for performance improvements:

Use hpx::thread, hpx::lock, ... instead of std::thread.
Use rpmalloc (showed ~10% speedup), I also tested mimalloc and jemalloc. This only needs to be integrated.
Use link optimization tool bolt. Probably won't help much because the most performance-critical codes are (force-)inlined.
Add support for huge pages.
- Hard to configure on systems though, so only a few might profit from this.
- rpmalloc has support for this but it needs to be enabled and tested and benchmarked.
Better I/O for reading and writing using manually managed mmap or io_uring. Note that, according to trace results, the read call already uses mmap in the background.
Merge the Huffman decoding with deflate decoding in a shared lookup table.
- Done by delegating this to ISA-l. This showed a speedup of 0% (random base64, no back-references), 20% (FASTQ), and 40% (Silesia).
Try automatic SIMD dispatch with ISPC as mentioned on HN?
Automatically determine the best of the implemented Huffman decoders given a compression ratio or number of back-references or other metadata
Use SIMD shuffle to resolve references. Maybe test first, how much this could possibly improve by completely skipping the reference resolving and/or using memset to some symbol or marker value. The marker resolving implementation in pugz doesn't look very fast. I am wondering why pugz without synchronized output is so fast. But as performance degrades for rapidgzip for more than one socket for FASTQ, I assume that it really is a NUMA issue, which would be solved by trying to get the same threads to resolve the markers, who also did the decompression. Or maybe the same NUMA domain if I can get that information.
Currently, first-stage decompression and marker replacement might happen on different threads. Maybe, trying harder to get the markers replaced on the same thread would help for NUMA architectures. This would also benefit memory allocations and deallocations because memory deallocated on the same thread it was allocated on should be much cheaper!
I have the feeling that memory allocations are still slowing rapidgzip down, especially for files that need to holds lots of large marker buffers, which are twice the size of the actual decompressed data. And statistics have shown that, e.g., for FASTQ and Silesia, those marker buffers actually only contain 15% of markers, the other 85% all contain 50% zeros and only would require 8-bit to store. See this code comment:

The result for this statistics for:

SRR22403185_2.fastq.gz:
    Replaced marker buffers : 329 MiB 550 KiB 191 B
    Actual marker count     : 46 MiB 705 KiB 331 B (14.168 %)
silesia.tar.gz
    Replaced marker buffers : 70 MiB 575 KiB 654 B
    Actual marker count     : 21 MiB 523 KiB 94 B (30.4849 %)
4GiB-base64.gz
    Replaced marker buffers : 21 MiB 582 KiB 252 B
    Actual marker count     : 158 KiB 538 B (0.717756 %)
CTU-13-Dataset.tar.gz
    Replaced marker buffers : 22 GiB 273 MiB 34 KiB 574 B
    Actual marker count     : 2 GiB 687 MiB 490 KiB 119 B (11.9972 %)

An alternative format that uses a mix of 8-bit and 16-bit and maybe a separate 1-bit buffer
to store which byte is which, would reduce memory usage, and therefore also allocation
overhead by 80%! Or maybe run-time encode it a la: <n 8-bit values> <8-bit value> ... <m 16-bit values>
This would hopefully speed up window applying because hopefully long runs of 8-bit values could
simply be memcopied and even runs of 16-bit values could be processed in a loop.
This kind of compression would also add overhead though and it proabably would be too difficult
to do inside deflate::Block, so it should probably be applied in post in
ChunkData::append( DecodedDataViews ). This might be something that could be optimzied with SIMD,
the same applies to the equally necessary new ChunkData::applyWindow method.

The count could be 7-bit so that the 8-th bit can be used to store the 8/16-bit value flag.
In the worst case: interleaved 8-bit and 16-bit values, this would add an overhead of 25%:
<8><16hi><16lo> <8>...
Ideally a format that has no overhead even in the worst-case would be nice.
This would be possible by using 4-bit values for but then the maximum runlength would be 3-bit -> 7,
which seems insufficient as it might lead to lots of slow execution branching in the applyWindow method.

The text was updated successfully, but these errors were encountered:

mxmlnkn added the performance Something is slower than it could be label Jan 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvement ideas #7

Performance improvement ideas #7

mxmlnkn commented Nov 24, 2022 •

edited

Loading

Performance improvement ideas #7

Performance improvement ideas #7

Comments

mxmlnkn commented Nov 24, 2022 • edited Loading

mxmlnkn commented Nov 24, 2022 •

edited

Loading