-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-45283: [C++][Parquet] Omit level histogram when max level is 0 #45285
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -64,14 +64,19 @@ void SizeStatistics::IncrementUnencodedByteArrayDataBytes(int64_t value) { | |
} | ||
|
||
void SizeStatistics::Validate(const ColumnDescriptor* descr) const { | ||
if (repetition_level_histogram.size() != | ||
static_cast<size_t>(descr->max_repetition_level() + 1)) { | ||
throw ParquetException("Repetition level histogram size mismatch"); | ||
} | ||
if (definition_level_histogram.size() != | ||
static_cast<size_t>(descr->max_definition_level() + 1)) { | ||
throw ParquetException("Definition level histogram size mismatch"); | ||
} | ||
auto validate_histogram = [](const std::vector<int64_t>& histogram, int16_t max_level, | ||
const std::string& name) { | ||
if (max_level == 0 && histogram.empty()) { | ||
return; | ||
} | ||
wgtmac marked this conversation as resolved.
Show resolved
Hide resolved
|
||
if (histogram.size() != static_cast<size_t>(max_level + 1)) { | ||
throw ParquetException(name + " level histogram size mismatch"); | ||
} | ||
}; | ||
validate_histogram(repetition_level_histogram, descr->max_repetition_level(), | ||
"Repetition"); | ||
validate_histogram(definition_level_histogram, descr->max_definition_level(), | ||
"Definition"); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You may want to add a threshold argument to |
||
if (unencoded_byte_array_data_bytes.has_value() && | ||
descr->physical_type() != Type::BYTE_ARRAY) { | ||
throw ParquetException("Unencoded byte array data bytes does not support " + | ||
|
@@ -93,8 +98,15 @@ void SizeStatistics::Reset() { | |
|
||
std::unique_ptr<SizeStatistics> SizeStatistics::Make(const ColumnDescriptor* descr) { | ||
auto size_stats = std::make_unique<SizeStatistics>(); | ||
size_stats->repetition_level_histogram.resize(descr->max_repetition_level() + 1, 0); | ||
size_stats->definition_level_histogram.resize(descr->max_definition_level() + 1, 0); | ||
// If the max level is 0, the level histogram can be omitted because it contains | ||
// only single level (a.k.a. 0) and its count is equivalent to `num_values` of the | ||
// column chunk or data page. | ||
if (descr->max_repetition_level() != 0) { | ||
size_stats->repetition_level_histogram.resize(descr->max_repetition_level() + 1, 0); | ||
} | ||
if (descr->max_definition_level() != 0) { | ||
size_stats->definition_level_histogram.resize(descr->max_definition_level() + 1, 0); | ||
} | ||
if (descr->physical_type() == Type::BYTE_ARRAY) { | ||
size_stats->unencoded_byte_array_data_bytes = 0; | ||
} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -376,4 +376,22 @@ TEST_F(SizeStatisticsRoundTripTest, LargePage) { | |
/*byte_array_bytes=*/{90000}})); | ||
} | ||
|
||
TEST_F(SizeStatisticsRoundTripTest, MaxLevelZero) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we add a file to parquet-testing that has max_def==1 and omits the def histogram? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think if any open-sourced Parquet writer is able to create this yet... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I could coerce one into doing so. But with the change allowing empty histograms, it's probably not worth creating one now. |
||
auto schema = | ||
::arrow::schema({::arrow::field("a", ::arrow::utf8(), /*nullable=*/false)}); | ||
WriteFile(SizeStatisticsLevel::PageAndColumnChunk, | ||
::arrow::TableFromJSON(schema, {R"([["foo"],["bar"]])"}), | ||
/*max_row_group_length=*/2, | ||
/*page_size=*/1024); | ||
ReadSizeStatistics(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does it also read the full file? We should ensure we don't crash elsewhere. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
EXPECT_THAT(row_group_stats_, | ||
::testing::ElementsAre(SizeStatistics{/*def_levels=*/{}, | ||
/*rep_levels=*/{}, | ||
/*byte_array_bytes=*/6})); | ||
EXPECT_THAT(page_stats_, | ||
::testing::ElementsAre(PageSizeStatistics{/*def_levels=*/{}, | ||
/*rep_levels=*/{}, | ||
/*byte_array_bytes=*/{6}})); | ||
} | ||
|
||
} // namespace parquet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit: should this be a constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO this is already clear.