Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Log warning for tests that are not really testing dictionary-encoded Parquet files #752

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented Aug 1, 2024

Which issue does this PR close?

N/A

Rationale for this change

Many tests use the following pattern to cover testing with dictionary-encoded data.

    Seq("true", "false").foreach { dictionary =>
        withParquetTable(
          (-5 until 5).map(i => (i.toDouble + 0.3, i.toDouble + 0.8)),
          "tbl",
          withDictionary = dictionary.toBoolean) {

However, in many cases, including this example, the data does not contain repeated values and therefore does not cause any dictionary-encoded arrays to be created.

What changes are included in this PR?

Log a warning when withParquetTable is called with withDictionary=true and when the parquet file does not actually contain dictionary-encoded data.

I was originally going to add an assertion instead of a log message but it was too disruptive.

Example output from running tests:

- test multiple pages with mixed PLAIN_DICTIONARY and PLAIN encoding (prefetch enabled) (709 milliseconds)
- test multiple pages with mixed PLAIN_DICTIONARY and PLAIN encoding (524 milliseconds)
WARN: withParquetFile was called with withDictionary=true but did not write any dictionary-encoded columns
- skip vector re-loading (prefetch enabled) (669 milliseconds)
WARN: withParquetFile was called with withDictionary=true but did not write any dictionary-encoded columns

How are these changes tested?

No functional change to test. Manually tested that logging appears.

@andygrove andygrove changed the title chore: Assert that parquet files are written with dictionary-encoded data chore: Log warning for tests that are not really testing dictionary-encoded Parquet files Aug 29, 2024
@andygrove andygrove marked this pull request as ready for review August 29, 2024 03:48
@andygrove andygrove marked this pull request as draft August 29, 2024 03:57
@andygrove andygrove marked this pull request as ready for review September 4, 2024 22:00
Copy link
Contributor

@parthchandra parthchandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a lot of tests not have dictionary encoded data?
If the number is not large, might be better to fix the tests and throw an error here.

if (withDictionary) {
// if the test specified to write dictionary-encoded data, we should check that we actually wrote some
// dictionary-encoded data
val files = file.listFiles(new FilenameFilter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a nit: I feel this code checks if any column has RLE encoding, however RLE is on row group level? that means every page of the every column may or may not encoded. If lets say column C1 encoded, C2 nope, than no warning shown but we use exactly C2 for testing

@andygrove andygrove marked this pull request as draft October 25, 2024 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants