Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use StringView by default when reading from parquet #11723

Closed
wants to merge 13 commits into from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jul 30, 2024

Draft until:

Which issue does this PR close?

Closes #11682

Rationale for this change

@XiangpengHao @a10y @PsiACE @Weijun-H and others have been working this summer of using StringView: #10918 to improve performance,.

It is currently disabled behind a config setting. Let's turn it on and improve query performance against parquet files

What changes are included in this PR?

  1. set datafusion.execution.parquet.schema_force_string_view to true by default
  2. Update tests

Are these changes tested?

Yes by CI

Benchmark results: Several ClickBench queries get 10-40% faster

--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  main_base ┃ alamb_string_view_default ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     0.72ms │                    0.70ms │     no change │
│ QQuery 1     │   103.93ms │                   90.46ms │ +1.15x faster │
│ QQuery 2     │   202.20ms │                  199.83ms │     no change │
│ QQuery 3     │   198.99ms │                  200.25ms │     no change │
│ QQuery 4     │  2230.52ms │                 2231.22ms │     no change │
│ QQuery 5     │  1927.29ms │                 2131.28ms │  1.11x slower │
│ QQuery 6     │    79.69ms │                   81.94ms │     no change │
│ QQuery 7     │    95.45ms │                   90.79ms │     no change │
│ QQuery 8     │  2949.84ms │                 3006.49ms │     no change │
│ QQuery 9     │  2341.97ms │                 2345.57ms │     no change │
│ QQuery 10    │   837.30ms │                  611.32ms │ +1.37x faster │
│ QQuery 11    │   919.39ms │                  690.86ms │ +1.33x faster │
│ QQuery 12    │  2047.26ms │                 2022.77ms │     no change │
│ QQuery 13    │  4511.05ms │                 4571.88ms │     no change │
│ QQuery 14    │  2775.56ms │                 2777.79ms │     no change │
│ QQuery 15    │  2439.24ms │                 2452.09ms │     no change │
│ QQuery 16    │  5720.56ms │                 5978.60ms │     no change │
│ QQuery 17    │  5718.81ms │                 5895.96ms │     no change │
│ QQuery 18    │ 11698.18ms │                12226.38ms │     no change │
│ QQuery 19    │   169.70ms │                  172.81ms │     no change │
│ QQuery 20    │  2692.90ms │                 2365.29ms │ +1.14x faster │
│ QQuery 21    │  3405.29ms │                 2968.78ms │ +1.15x faster │
│ QQuery 22    │  8586.02ms │                 8119.18ms │ +1.06x faster │
│ QQuery 23    │ 21805.02ms │                19018.68ms │ +1.15x faster │
│ QQuery 24    │  1309.78ms │                 1144.30ms │ +1.14x faster │
│ QQuery 25    │  1082.19ms │                  917.85ms │ +1.18x faster │
│ QQuery 26    │  1422.32ms │                 1237.54ms │ +1.15x faster │
│ QQuery 27    │  3958.70ms │                 3936.67ms │     no change │
│ QQuery 28    │ 29745.73ms │                29598.31ms │     no change │
│ QQuery 29    │  1018.08ms │                 1019.14ms │     no change │
│ QQuery 30    │  2456.96ms │                 2248.57ms │ +1.09x faster │
│ QQuery 31    │  3064.59ms │                 2844.96ms │ +1.08x faster │
│ QQuery 32    │ 16442.64ms │                16337.92ms │     no change │
│ QQuery 33    │  9399.73ms │                 8291.57ms │ +1.13x faster │
│ QQuery 34    │  9425.59ms │                 8329.78ms │ +1.13x faster │
│ QQuery 35    │  3719.74ms │                 3677.70ms │     no change │
│ QQuery 36    │   341.13ms │                  300.31ms │ +1.14x faster │
│ QQuery 37    │   217.12ms │                  192.43ms │ +1.13x faster │
│ QQuery 38    │   186.82ms │                  183.49ms │     no change │
│ QQuery 39    │  1123.50ms │                 1012.67ms │ +1.11x faster │
│ QQuery 40    │    92.93ms │                   90.38ms │     no change │
│ QQuery 41    │    81.06ms │                   83.14ms │     no change │
│ QQuery 42    │    94.51ms │                   96.22ms │     no change │
└──────────────┴────────────┴───────────────────────────┴───────────────┘

--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ main_base ┃ alamb_string_view_default ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │ 3704.72ms │                 3678.85ms │     no change │
│ QQuery 1     │ 1489.44ms │                 1322.15ms │ +1.13x faster │
│ QQuery 2     │ 2959.88ms │                 3117.41ms │  1.05x slower │
└──────────────┴───────────┴───────────────────────────┴───────────────┘

Are there any user-facing changes?

Yes. With this change, it means that all strings will be read from parquet files as StringView by default.
This should result in a significant performance improvement for queries that involve string columns, especially for highly selective ones

It can still be disabled by default if needed

Bug notes

(The idea is I will work through these errors and file tickets)

Cast error

cargo test --test sqllogictests  -- parquet.slt
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.13s
     Running bin/sqllogictests.rs (target/debug/deps/sqllogictests-1d806c30cd9016c4)
Running "parquet.slt"
External error: query failed: DataFusion error: Error during planning: Cannot cast file schema field string_col of type BinaryView to table schema field of type Utf8
[SQL] SELECT id, CAST(string_col AS varchar) FROM alltypes_plain
at test_files/parquet.slt:193

Panic about statistics (that extracting parquet statistics as StringView is not implemented):

$ cargo test --test sqllogictests -- predicates
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.14s
     Running bin/sqllogictests.rs (target/debug/deps/sqllogictests-1d806c30cd9016c4)
Running "predicates.slt"
thread 'tokio-runtime-worker' panicked at datafusion/core/src/datasource/physical_plan/parquet/statistics.rs:1047:5:
not implemented
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
External error: query failed: DataFusion error: Join Error
caused by
External error: task 208 panicked
[SQL] SELECT * FROM data_index_bloom_encoding_stats WHERE "String" = 'foo';
at test_files/predicates.slt:525

Clickbench exteded:

Q0: SELECT COUNT(DISTINCT "SearchPhrase"), COUNT(DISTINCT "MobilePhone"), COUNT(DISTINCT "MobilePhoneModel") FROM hits;

Running clickbench (1 file) extended benchmark...
    Finished `release` profile [optimized] target(s) in 0.77s
     Running `/home/alamb/arrow-datafusion2/target/release/dfbench clickbench --iterations 5 --path /home/alamb/arrow-datafusion/benchmarks/data/hits.par\
quet --queries-path /home/alamb/arrow-datafusion/benchmarks/queries/clickbench/extended.sql -o /home/alamb/arrow-datafusion/benchmarks/results/alamb_stri\
ng_view_default/clickbench_extended.json`
Running benchmarks with the following options: RunOpt { query: None, common: CommonOpt { iterations: 5, partitions: None, batch_size: 8192, debug: false \
}, path: "/home/alamb/arrow-datafusion/benchmarks/data/hits.parquet", queries_path: "/home/alamb/arrow-datafusion/benchmarks/queries/clickbench/extended.\
sql", output_path: Some("/home/alamb/arrow-datafusion/benchmarks/results/alamb_string_view_default/clickbench_extended.json") }
Q0: SELECT COUNT(DISTINCT "SearchPhrase"), COUNT(DISTINCT "MobilePhone"), COUNT(DISTINCT "MobilePhoneModel") FROM hits;
thread 'tokio-runtime-worker' panicked at datafusion/physical-expr-common/src/binary_view_map.rs:220:18:
internal error: entered unreachable code: Utf8/Binary should use `ArrowBytesSet`

@github-actions github-actions bot added documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt) labels Jul 30, 2024
@@ -901,6 +921,29 @@ macro_rules! get_data_page_statistics {
}
Ok(Arc::new(builder.finish()))
},
// TODO file upstream in Arrowrs --
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed upstream as apache/arrow-rs#6164

Suggested change
// TODO file upstream in Arrowrs --
// https://github.com/apache/arrow-rs/issues/6164

@alamb
Copy link
Contributor Author

alamb commented Aug 18, 2024

superceded by #11862

@alamb alamb closed this Aug 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable reading StringView by default from Parquet (schema_force_string_view) by default
2 participants