-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix parquet reader to handle when MAP_KEY_VALUE is used instead of MAP #7812
Conversation
Hi @hitarth! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
✅ Deploy Preview for meta-velox canceled.
|
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
std::nullopt, | ||
maxRepeat, | ||
maxDefine); | ||
if (children.size() == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hitarth I don't think this fix is very accurate. According to the spec, the rule to test the backward compatibility is "a group annotated with MAP_KEY_VALUE that is not contained by a MAP-annotated group should be handled as a MAP-annotated group.". You're checking its children count, not if its parent is a MAP node. This may somewhat work for the map type, but won't work for list.
I think the correct fix would be following exactly the rules defined in the spec, i.e. check its parent node's type. If it's parent is not MAP, then it would be treated as MAP instead of MAP_KEY_VALUE. For that, we will need to pass in the parent schemaIdx. We can add a new parameter "uint32_t parentSchemaIdx" so that when a node is MAP_KEY_VALUE, we can check its parent type by schema[parentSchemaIdx].converted_type. Similarly we can do the same for list. Will you be able to try this approach to see if it works?
Please also try to reduce the amount of repeated code in your new rewrite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion, let me try adding parentSchemaIdx and rewrite this.
maxDefine); | ||
if (parentSchemaElement.converted_type != | ||
thrift::ConvertedType::MAP) { | ||
// MAP_KEY_VALUE is incorrectly used in place of MAP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems this is assuming here that, if it is not thrift::ConvertedType::MAP, it must be a thrift::ConvertedType::MAP_KEY_VALUE?
For more better robustness, to actually handle unexpected types, would it be better if we change it to explicit checking? e.g.
if (parentSchemaElement.converted_type == thrift::ConvertedType::MAP) {
// do something...
} else if (parentSchemaElement.converted_type == thrift::ConvertedType::MAP_KEY_VALUE) {
// do something
} else {
// throw an unexpected type error
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to follow the spec
For backward-compatibility, a group annotated with MAP_KEY_VALUE that is not contained by a MAP-annotated group should be handled as a MAP-annotated group.
As per spec the parent element cannot be MAP which is what we are checking here. Spec is not clear on what the actual converted_type should be.
#7966 is to replace this PR |
Some existing parquet files incorrectly uses MAP_KEY_VALUE in place of MAP.
For backward compatibility, a group annotated with MAP_KEY_VALUE that is
not contained by a MAP-annotated group should be handled as a MAP-annotated
group. Details in
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1
Fixes issue #7777