Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix reading MAP_KEY_VALUE Parquet SchemaElement #7966

Closed
wants to merge 1 commit into from

Conversation

yingsu00
Copy link
Collaborator

@yingsu00 yingsu00 commented Dec 11, 2023

This PR fixes issue #7777

In Parquet, the map type is annotated as MAP converted type nomally.
It should contain a repeated group annotated with MAP_KEY_VALUE,
which in turn contains two children key and value:

<map-repetition> group <name> (MAP) {
  repeated group key_value (MAP_KEY_VALUE) {
    required <key-type> key;
    <value-repetition> <value-type> value;
  }
}

But sometimes a group annotated with MAP_KEY_VALUE was incorrectly
used in place of MAP.

<map-repetition> group my_map (MAP_KEY_VALUE) {
  repeated group map {
    required binary key (UTF8);
    optional int32 value;
  }
}

For backward-compatibility, a MAP_KEY_VALUE that is not contained by
MAP should be treated as MAP. This commit makes the following changes:

  1. Adds a parentSchemaIdx to Parquet reader's
    getParquetColumnInfo() function to pass the parent schema.

  2. Differenciate the situations where a MAP_KEY_VALUE's parent is or
    is not a MAP. If it is, then it should be the repeated group that
    contains the key and value. If it is not, it should be treated the same
    as MAP.

For more information please check https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps

Copy link

netlify bot commented Dec 11, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 3fc657e
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/6581a23abe3d6a0008e9ee69

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 11, 2023
@yingsu00
Copy link
Collaborator Author

cc @hitarth @chliang71 @qqibrow

@yingsu00 yingsu00 force-pushed the ParquetColumnInfo branch 3 times, most recently from 6af7c51 to 371ae04 Compare December 13, 2023 23:59
@yingsu00 yingsu00 marked this pull request as ready for review December 14, 2023 04:07
@yingsu00 yingsu00 requested review from xiaoxmeng and Yuhta December 14, 2023 04:07
VELOX_CHECK_EQ(
schemaElement.repetition_type,
thrift::FieldRepetitionType::REPEATED);
assert(children.size() == 2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VELOX_CHECK_EQ

VELOX_CHECK_EQ(children.size(), 1);
auto child = children[0];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto& child

@yingsu00
Copy link
Collaborator Author

@Yuhta Resolved the comments, could you please review again? Thank you!

@facebook-github-bot
Copy link
Contributor

@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@@ -94,6 +94,7 @@ class ReaderBase {
uint32_t maxSchemaElementIdx,
uint32_t maxRepeat,
uint32_t maxDefine,
uint32_t parentSchemaIdx,
Copy link
Contributor

@Yuhta Yuhta Dec 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int32_t since you are using -1 as empty value and check parentSchemaIdx >= 0 below

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yuhta Thanks for catching this! I forgot to change the type before I submitted the PR. However, I just changed the initial value from -1 to 0 to make the code simpler (no need to check if it's >=0). This is ok because it's never required to check the parent of the root in getParquetColumnInfo(). The root is always named "hive_schema" with annotation of "UTF8" and cannot be any of the MAP_KEY_VALUE, MAP, LIST converted types.

  // Setting the parent schema index of the root("hive_schema") to be 0, which
  // is the root itself. This is ok because it's never required to check the
  // parent of the root in getParquetColumnInfo().
  schemaWithId_ = getParquetColumnInfo(
      maxSchemaElementIdx, maxRepeat, maxDefine, 0, schemaIdx, columnIdx);


const std::string sample(getExampleFilePath("map_key_value.parquet"));

facebook::velox::dwio::common::ReaderOptions readerOptions{defaultPool.get()};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defaultPool is not longer there in the trunk

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defaultPool is not longer there in the trunk

Updated to leafPool_

In Parquet, the map type is annotated as MAP converted type nomally.
It should contain a repeated group annotated with MAP_KEY_VALUE,
which in turn contains two children key and value:

<map-repetition> group <name> (MAP) {
  repeated group key_value (MAP_KEY_VALUE) {
    required <key-type> key;
    <value-repetition> <value-type> value;
  }
}

But sometimes a group annotated with MAP_KEY_VALUE was incorrectly
used in place of MAP.

<map-repetition> group my_map (MAP_KEY_VALUE) {
  repeated group map {
    required binary key (UTF8);
    optional int32 value;
  }
}

For backward-compatibility, a MAP_KEY_VALUE that is not contained by
MAP should be treated as MAP. This commit makes the following changes:

1. Adds a parentSchemaIdx to Parquet reader's
getParquetColumnInfo() function to pass the parent schema.

2. Differenciate the situations where a MAP_KEY_VALUE's parent is or
is not a MAP. If it is, then it should be the repeated group that
contains the key and value. If it is not, it should be treated the same
as MAP.

For more information please check https://github.com/apache/parquet-
format/blob/master/LogicalTypes.md#maps
@facebook-github-bot
Copy link
Contributor

@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@yingsu00
Copy link
Collaborator Author

The benchmark failure is due to #8109

@facebook-github-bot
Copy link
Contributor

@Yuhta merged this pull request in 32e21c9.

Copy link

Conbench analyzed the 1 benchmark run on commit 32e21c9f.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants