Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix parquet reader to handle when MAP_KEY_VALUE is used instead of MAP #7812

Closed
wants to merge 1 commit into from

Conversation

hitarth
Copy link
Collaborator

@hitarth hitarth commented Nov 30, 2023

Some existing parquet files incorrectly uses MAP_KEY_VALUE in place of MAP.
For backward compatibility, a group annotated with MAP_KEY_VALUE that is
not contained by a MAP-annotated group should be handled as a MAP-annotated
group. Details in
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1
Fixes issue #7777

@facebook-github-bot
Copy link
Contributor

Hi @hitarth!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

Copy link

netlify bot commented Nov 30, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 781f3ef
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/65721faf5108510008c5b525

@facebook-github-bot
Copy link
Contributor

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 30, 2023
@hitarth hitarth changed the title Fixed parquet reader to handle when MAP_KEY_VALUE is used instead of MAP Fix parquet reader to handle when MAP_KEY_VALUE is used instead of MAP Nov 30, 2023
@yingsu00 yingsu00 self-requested a review December 1, 2023 17:23
std::nullopt,
maxRepeat,
maxDefine);
if (children.size() == 1) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hitarth I don't think this fix is very accurate. According to the spec, the rule to test the backward compatibility is "a group annotated with MAP_KEY_VALUE that is not contained by a MAP-annotated group should be handled as a MAP-annotated group.". You're checking its children count, not if its parent is a MAP node. This may somewhat work for the map type, but won't work for list.

I think the correct fix would be following exactly the rules defined in the spec, i.e. check its parent node's type. If it's parent is not MAP, then it would be treated as MAP instead of MAP_KEY_VALUE. For that, we will need to pass in the parent schemaIdx. We can add a new parameter "uint32_t parentSchemaIdx" so that when a node is MAP_KEY_VALUE, we can check its parent type by schema[parentSchemaIdx].converted_type. Similarly we can do the same for list. Will you be able to try this approach to see if it works?
Please also try to reduce the amount of repeated code in your new rewrite.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, let me try adding parentSchemaIdx and rewrite this.

maxDefine);
if (parentSchemaElement.converted_type !=
thrift::ConvertedType::MAP) {
// MAP_KEY_VALUE is incorrectly used in place of MAP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems this is assuming here that, if it is not thrift::ConvertedType::MAP, it must be a thrift::ConvertedType::MAP_KEY_VALUE?

For more better robustness, to actually handle unexpected types, would it be better if we change it to explicit checking? e.g.


if (parentSchemaElement.converted_type == thrift::ConvertedType::MAP) {
    // do something...
} else if (parentSchemaElement.converted_type == thrift::ConvertedType::MAP_KEY_VALUE) {
    // do something
} else {
    // throw an unexpected type error
}

Copy link
Collaborator Author

@hitarth hitarth Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to follow the spec

For backward-compatibility, a group annotated with MAP_KEY_VALUE that is not contained by a MAP-annotated group should be handled as a MAP-annotated group.

As per spec the parent element cannot be MAP which is what we are checking here. Spec is not clear on what the actual converted_type should be.

velox/dwio/parquet/reader/ParquetReader.cpp Outdated Show resolved Hide resolved
velox/dwio/parquet/tests/reader/ParquetReaderTest.cpp Outdated Show resolved Hide resolved
@yingsu00
Copy link
Collaborator

#7966 is to replace this PR

@yingsu00
Copy link
Collaborator

@hitarth Can we close this PR now? #7966 is already merged.

@hitarth hitarth closed this Jan 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. parquet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants