-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix reading MAP_KEY_VALUE Parquet SchemaElement #7966
Conversation
✅ Deploy Preview for meta-velox canceled.
|
6af7c51
to
371ae04
Compare
371ae04
to
8d66f48
Compare
VELOX_CHECK_EQ( | ||
schemaElement.repetition_type, | ||
thrift::FieldRepetitionType::REPEATED); | ||
assert(children.size() == 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VELOX_CHECK_EQ
VELOX_CHECK_EQ(children.size(), 1); | ||
auto child = children[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto& child
8d66f48
to
31bd9c5
Compare
@Yuhta Resolved the comments, could you please review again? Thank you! |
@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@@ -94,6 +94,7 @@ class ReaderBase { | |||
uint32_t maxSchemaElementIdx, | |||
uint32_t maxRepeat, | |||
uint32_t maxDefine, | |||
uint32_t parentSchemaIdx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int32_t
since you are using -1 as empty value and check parentSchemaIdx >= 0
below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yuhta Thanks for catching this! I forgot to change the type before I submitted the PR. However, I just changed the initial value from -1 to 0 to make the code simpler (no need to check if it's >=0). This is ok because it's never required to check the parent of the root in getParquetColumnInfo(). The root is always named "hive_schema" with annotation of "UTF8" and cannot be any of the MAP_KEY_VALUE, MAP, LIST converted types.
// Setting the parent schema index of the root("hive_schema") to be 0, which
// is the root itself. This is ok because it's never required to check the
// parent of the root in getParquetColumnInfo().
schemaWithId_ = getParquetColumnInfo(
maxSchemaElementIdx, maxRepeat, maxDefine, 0, schemaIdx, columnIdx);
|
||
const std::string sample(getExampleFilePath("map_key_value.parquet")); | ||
|
||
facebook::velox::dwio::common::ReaderOptions readerOptions{defaultPool.get()}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defaultPool
is not longer there in the trunk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defaultPool is not longer there in the trunk
Updated to leafPool_
In Parquet, the map type is annotated as MAP converted type nomally. It should contain a repeated group annotated with MAP_KEY_VALUE, which in turn contains two children key and value: <map-repetition> group <name> (MAP) { repeated group key_value (MAP_KEY_VALUE) { required <key-type> key; <value-repetition> <value-type> value; } } But sometimes a group annotated with MAP_KEY_VALUE was incorrectly used in place of MAP. <map-repetition> group my_map (MAP_KEY_VALUE) { repeated group map { required binary key (UTF8); optional int32 value; } } For backward-compatibility, a MAP_KEY_VALUE that is not contained by MAP should be treated as MAP. This commit makes the following changes: 1. Adds a parentSchemaIdx to Parquet reader's getParquetColumnInfo() function to pass the parent schema. 2. Differenciate the situations where a MAP_KEY_VALUE's parent is or is not a MAP. If it is, then it should be the repeated group that contains the key and value. If it is not, it should be treated the same as MAP. For more information please check https://github.com/apache/parquet- format/blob/master/LogicalTypes.md#maps
31bd9c5
to
3fc657e
Compare
@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
The benchmark failure is due to #8109 |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
This PR fixes issue #7777
In Parquet, the map type is annotated as MAP converted type nomally.
It should contain a repeated group annotated with MAP_KEY_VALUE,
which in turn contains two children key and value:
But sometimes a group annotated with MAP_KEY_VALUE was incorrectly
used in place of MAP.
For backward-compatibility, a MAP_KEY_VALUE that is not contained by
MAP should be treated as MAP. This commit makes the following changes:
Adds a parentSchemaIdx to Parquet reader's
getParquetColumnInfo() function to pass the parent schema.
Differenciate the situations where a MAP_KEY_VALUE's parent is or
is not a MAP. If it is, then it should be the repeated group that
contains the key and value. If it is not, it should be treated the same
as MAP.
For more information please check https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps