Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad frame decompression performance regression for large data sets in Rust 1.73 #147

Closed
bonsairobo opened this issue Nov 2, 2023 · 9 comments

Comments

@bonsairobo
Copy link

bonsairobo commented Nov 2, 2023

line-graph(1)

Both measurements were taken for lz4_flex version 0.11.1. Only bumping the Rust compiler version caused the regression.

What I find very interesting is that essentially the same regression has occurred for the lz4 crate, and it seems both regressions have only occurred for decompression of the frame format, not the block format.

I think the next step is to investigate which specific scope of code in the frame decompression has gotten slower.

Benchmark
use std::{
    fs::File,
    io::Cursor,
    path::{Path, PathBuf},
    time::Instant,
};

fn main() {
    for entry_result in std::fs::read_dir("./data").unwrap() {
        let entry = entry_result.unwrap();
        do_bench(
            "lz4_flex",
            &entry.path(),
            compress_lz4_flex,
            decompress_lz4_flex,
        );
        // do_bench("lz4", &entry.path(), compress_lz4, decompress_lz4);
    }
}

fn do_bench(
    algo_name: &str,
    input_path: &Path,
    compress_fn: impl Fn(&Path, File) -> File,
    decompress_fn: impl Fn(&[u8]) -> Vec<u8>,
) {
    let in_f = std::fs::File::open(input_path).unwrap();
    let start_size = in_f.metadata().unwrap().len();

    let out_path = PathBuf::from(format!(
        "output/out-{}.lz4",
        input_path.file_stem().unwrap().to_str().unwrap()
    ));
    let compress_start = Instant::now();

    let output = compress_fn(&out_path, in_f);

    let compress_elapsed = compress_start.elapsed().as_millis();
    let end_size = output.metadata().unwrap().len();
    println!(
        "[{algo_name} {}] Compression: {:.4}%, Time: {:.2}s",
        input_path.display(),
        100.0 * end_size as f32 / start_size as f32,
        compress_elapsed as f32 / 1000.0,
    );

    let compressed_bytes = std::fs::read(&out_path).unwrap();
    let decode_start = Instant::now();

    let decompressed_buf = decompress_fn(&compressed_bytes);

    println!(
        "Decompress in: {:.5}s",
        decode_start.elapsed().as_millis() as f32 / 1000.0
    );
    let out_path = format!(
        "output/roundtrip-{}.lz4",
        input_path.file_stem().unwrap().to_str().unwrap()
    );
    std::fs::write(out_path, decompressed_buf).unwrap();
}

fn compress_lz4_flex(out_path: &Path, mut in_f: std::fs::File) -> std::fs::File {
    let out_f = std::fs::File::create(out_path).unwrap();
    let mut encoder = lz4_flex::frame::FrameEncoder::new(out_f);
    std::io::copy(&mut in_f, &mut encoder).unwrap();
    encoder.finish().unwrap()
}

fn decompress_lz4_flex(compressed_bytes: &[u8]) -> Vec<u8> {
    let mut decoder = lz4_flex::frame::FrameDecoder::new(Cursor::new(compressed_bytes));
    let mut uncompressed_buf = Vec::new();
    std::io::copy(&mut decoder, &mut uncompressed_buf).unwrap();
    uncompressed_buf
}

// fn compress_lz4(out_path: &Path, mut in_f: File) -> File {
//     let out_f = std::fs::File::create(out_path).unwrap();
//     let mut encoder = lz4::EncoderBuilder::new().build(out_f).unwrap();
//     std::io::copy(&mut in_f, &mut encoder).unwrap();
//     let (out_f, maybe_err) = encoder.finish();
//     maybe_err.unwrap();
//     out_f
// }

// fn decompress_lz4(decompress_bytes: &[u8]) -> Vec<u8> {
//     let mut decoder = lz4::Decoder::new(Cursor::new(decompress_bytes)).unwrap();
//     let mut uncompressed_buf = Vec::new();
//     std::io::copy(&mut decoder, &mut uncompressed_buf).unwrap();
//     uncompressed_buf
// }
@bonsairobo
Copy link
Author

bonsairobo commented Nov 2, 2023

I just noticed there is a benchmark in this repo using larger files. Here are the decompression results of that:

Rust 1.72

bench_files/dickens - Decompression 2544.32 Mb/s
bench_files/mozilla - Decompression 2188.20 Mb/s
bench_files/mr - Decompression 1765.24 Mb/s
bench_files/nci - Decompression 2934.73 Mb/s
bench_files/ooffice - Decompression 2597.32 Mb/s
bench_files/osdb - Decompression 1508.96 Mb/s
bench_files/reymont - Decompression 2460.89 Mb/s
bench_files/samba - Decompression 2533.86 Mb/s
bench_files/webster - Decompression 2064.81 Mb/s

Rust 1.73

bench_files/dickens - Decompression 1414.35 Mb/s
bench_files/mozilla - Decompression 663.37 Mb/s
bench_files/mr - Decompression 1496.99 Mb/s
bench_files/nci - Decompression 652.04 Mb/s
bench_files/ooffice - Decompression 1887.27 Mb/s
bench_files/osdb - Decompression 1401.62 Mb/s
bench_files/reymont - Decompression 1905.72 Mb/s
bench_files/samba - Decompression 1061.93 Mb/s
bench_files/sao - Decompression 1789.27 Mb/s
bench_files/webster - Decompression 604.82 Mb/s
bench_files/x-ray - Decompression 1446.80 Mb/s
bench_files/xml - Decompression 2472.21 Mb/s

@PSeitz
Copy link
Owner

PSeitz commented Nov 3, 2023

I don't see that much change on the benchmark. Some variation is normal, as the tests are quite flaky.

cargo +1.72 bench FrameDecompress/lz4_flex
...
cargo bench FrameDecompress/lz4_flex

FrameDecompress/lz4_flex_rust_indep/725
                        time:   [894.00 ns 911.89 ns 933.52 ns]
                        thrpt:  [740.65 MiB/s 758.22 MiB/s 773.39 MiB/s]
                 change:
                        time:   [+10.272% +12.321% +14.746%] (p = 0.00 < 0.05)
                        thrpt:  [-12.851% -10.969% -9.3148%]
FrameDecompress/lz4_flex_rust_linked/725
                        time:   [903.73 ns 919.16 ns 938.63 ns]
                        thrpt:  [736.62 MiB/s 752.22 MiB/s 765.07 MiB/s]
                 change:
                        time:   [+4.8091% +7.0499% +9.3629%] (p = 0.00 < 0.05)
                        thrpt:  [-8.5613% -6.5856% -4.5884%]
FrameDecompress/lz4_flex_rust_indep/9991663
                        time:   [5.6644 ms 5.6929 ms 5.7234 ms]
                        thrpt:  [1.6259 GiB/s 1.6346 GiB/s 1.6428 GiB/s]
                 change:
                        time:   [-1.2619% -0.2034% +0.8968%] (p = 0.72 > 0.05)
                        thrpt:  [-0.8889% +0.2038% +1.2780%]
FrameDecompress/lz4_flex_rust_linked/9991663
                        time:   [4.2277 ms 4.2424 ms 4.2600 ms]
                        thrpt:  [2.1844 GiB/s 2.1934 GiB/s 2.2011 GiB/s]
                 change:
                        time:   [-11.640% -10.925% -10.235%] (p = 0.00 < 0.05)
                        thrpt:  [+11.401% +12.264% +13.173%]
FrameDecompress/lz4_flex_rust_indep/7023566
                        time:   [1.7817 ms 1.8419 ms 1.9063 ms]
                        thrpt:  [3.4313 GiB/s 3.5513 GiB/s 3.6714 GiB/s]
                 change:
                        time:   [-6.1937% -3.5502% -0.5663%] (p = 0.02 < 0.05)
                        thrpt:  [+0.5695% +3.6808% +6.6027%]
FrameDecompress/lz4_flex_rust_linked/7023566
                        time:   [1.8447 ms 1.8756 ms 1.9146 ms]
                        thrpt:  [3.4164 GiB/s 3.4875 GiB/s 3.5460 GiB/s]
                 change:
                        time:   [-28.078% -25.791% -23.129%] (p = 0.00 < 0.05)
                        thrpt:  [+30.088% +34.754% +39.039%]
FrameDecompress/lz4_flex_rust_indep/6627202
                        time:   [3.9133 ms 3.9359 ms 3.9592 ms]
                        thrpt:  [1.5589 GiB/s 1.5682 GiB/s 1.5772 GiB/s]
                 change:
                        time:   [-0.2018% +0.6744% +1.5541%] (p = 0.13 > 0.05)
                        thrpt:  [-1.5303% -0.6699% +0.2022%]
FrameDecompress/lz4_flex_rust_linked/6627202
                        time:   [3.7983 ms 3.8800 ms 3.9678 ms]
                        thrpt:  [1.5556 GiB/s 1.5908 GiB/s 1.6249 GiB/s]
                 change:
                        time:   [+13.916% +16.699% +19.543%] (p = 0.00 < 0.05)
                        thrpt:  [-16.348% -14.310% -12.216%]
FrameDecompress/lz4_flex_rust_indep/5322575
                        time:   [2.2280 ms 2.2641 ms 2.3020 ms]
                        thrpt:  [2.1534 GiB/s 2.1894 GiB/s 2.2249 GiB/s]
                 change:
                        time:   [+4.0524% +6.5407% +9.2544%] (p = 0.00 < 0.05)
                        thrpt:  [-8.4705% -6.1392% -3.8945%]
FrameDecompress/lz4_flex_rust_linked/5322575
                        time:   [2.0397 ms 2.0732 ms 2.1078 ms]
                        thrpt:  [2.3517 GiB/s 2.3910 GiB/s 2.4303 GiB/s]
                 change:
                        time:   [-12.918% -10.255% -7.4586%] (p = 0.00 < 0.05)
                        thrpt:  [+8.0597% +11.427% +14.834%]

@MichaelBradetLegris
Copy link

I've had the same issue. It's for larger files on 1.73 that there are issues. For smaller files, the time difference isn't much. Looks like the default benchmark files are all <10MB. I noticed you also used rust 1.72 for that test above @PSeitz

@PSeitz
Copy link
Owner

PSeitz commented Nov 3, 2023

Indeed, that is quite unexpected, as lz4_flex operates on max 4Mb chunks.

The issue seems to be the new ensure_init method in BorrowedCursor, which takes most of the time.
The method was introduced in this PR: rust-lang/rust#97015 and judging from the the graph in the first post it seems to be exponential in cost.

Now
flamegraph2

Before
flamegraph

@PSeitz
Copy link
Owner

PSeitz commented Nov 3, 2023

I created a ticket in the rust repo. This seems to be a major performance issue, so it's surprising this wasn't reported before (or i didn't find it). Thanks for the bug report @bonsairobo

@bonsairobo
Copy link
Author

Looks like a fix was merged: rust-lang/rust#117576

Although it also says "milestone 1.75" which is surprising. Wouldn't they want to make a minor release of 1.73 to fix this huge regression that affects multiple crates?

@bonsairobo
Copy link
Author

Apparently 1.74 got the fix. I can try taking a measurement to confirm.

@bonsairobo
Copy link
Author

Seems to be fixed in 1.74!

@PSeitz
Copy link
Owner

PSeitz commented Nov 17, 2023

Thanks for confirming!

@PSeitz PSeitz closed this as completed Nov 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants