Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better E2EFilterTest for parquet in velox #7478

Open
qqibrow opened this issue Nov 8, 2023 · 6 comments
Open

Better E2EFilterTest for parquet in velox #7478

qqibrow opened this issue Nov 8, 2023 · 6 comments
Labels
enhancement New feature or request parquet

Comments

@qqibrow
Copy link
Collaborator

qqibrow commented Nov 8, 2023

Description

Hi, we recently started evaluating velox and notice there are issues in parquet reader. We generate some parquet test files used in presto unit test and try read them using native parquet reader in velox. there are lots of test failures:

 parquet_test/testNestedMaps/test_12.parquet
 parquet_test/testNestedMaps/test_10.parquet
 parquet_test/testNestedMaps/test_18.parquet
 parquet_test/testNestedMaps/test_14.parquet
 parquet_test/testNestedMaps/test_16.parquet
 parquet_test/testArrayOfMapOfStruct/test_12.parquet
 parquet_test/testArrayOfMapOfStruct/test_10.parquet
 parquet_test/testArrayOfMapOfStruct/test_18.parquet
 parquet_test/testArrayOfMapOfStruct/test_14.parquet
 parquet_test/testArrayOfMapOfStruct/test_16.parquet
 parquet_test/testCustomSchemaArrayOfStructs/test_6.parquet
 parquet_test/testCustomSchemaArrayOfStructs/test_4.parquet
 parquet_test/testCustomSchemaArrayOfStructs/test_8.parquet
 parquet_test/testCustomSchemaArrayOfStructs/test_0.parquet
 parquet_test/testCustomSchemaArrayOfStructs/test_2.parquet
 parquet_test/testMapSchemas/test_46.parquet
 parquet_test/testMapSchemas/test_34.parquet
 parquet_test/testMapSchemas/test_12.parquet
 parquet_test/testMapSchemas/test_60.parquet
 parquet_test/testMapSchemas/test_36.parquet
 parquet_test/testMapSchemas/test_62.parquet
 parquet_test/testMapSchemas/test_10.parquet
 parquet_test/testMapSchemas/test_18.parquet
 parquet_test/testMapSchemas/test_32.parquet
 parquet_test/testMapSchemas/test_66.parquet
 parquet_test/testMapSchemas/test_14.parquet
 parquet_test/testMapSchemas/test_58.parquet
 parquet_test/testMapSchemas/test_38.parquet
 parquet_test/testMapSchemas/test_42.parquet
 parquet_test/testMapSchemas/test_30.parquet
 parquet_test/testMapSchemas/test_16.parquet
 parquet_test/testMapSchemas/test_64.parquet
 parquet_test/testMapOfSingleLevelArray/test_12.parquet
 parquet_test/testMapOfSingleLevelArray/test_10.parquet
 parquet_test/testMapOfSingleLevelArray/test_18.parquet
 parquet_test/testMapOfSingleLevelArray/test_14.parquet
 parquet_test/testMapOfSingleLevelArray/test_16.parquet
 parquet_test/testArrayOfMapOfArray/test_12.parquet
 parquet_test/testArrayOfMapOfArray/test_10.parquet
 parquet_test/testArrayOfMapOfArray/test_18.parquet
 parquet_test/testArrayOfMapOfArray/test_14.parquet
 parquet_test/testArrayOfMapOfArray/test_16.parquet
 parquet_test/testStructOfNullableMapBetweenNonNullFields/test_12.parquet
 parquet_test/testStructOfNullableMapBetweenNonNullFields/test_10.parquet
 parquet_test/testStructOfNullableMapBetweenNonNullFields/test_18.parquet
 parquet_test/testStructOfNullableMapBetweenNonNullFields/test_14.parquet
 parquet_test/testStructOfNullableMapBetweenNonNullFields/test_16.parquet
 parquet_test/testStructOfMaps/test_12.parquet
 parquet_test/testStructOfMaps/test_10.parquet
 parquet_test/testStructOfMaps/test_18.parquet
 parquet_test/testStructOfMaps/test_14.parquet
 parquet_test/testStructOfMaps/test_16.parquet
 parquet_test/testMapOfArrayKeys/test_12.parquet
 parquet_test/testMapOfArrayKeys/test_10.parquet
 parquet_test/testMapOfArrayKeys/test_18.parquet
 parquet_test/testMapOfArrayKeys/test_14.parquet
 parquet_test/testMapOfArrayKeys/test_16.parquet
 parquet_test/testMapOfStruct/test_12.parquet
 parquet_test/testMapOfStruct/test_10.parquet
 parquet_test/testMapOfStruct/test_18.parquet
 parquet_test/testMapOfStruct/test_14.parquet
 parquet_test/testMapOfStruct/test_16.parquet
 parquet_test/testMap/test_12.parquet
 parquet_test/testMap/test_10.parquet
 parquet_test/testMap/test_18.parquet
 parquet_test/testMap/test_14.parquet
 parquet_test/testMap/test_16.parquet
 parquet_test/testSingleLevelArrayOfMapOfArray/test_12.parquet
 parquet_test/testSingleLevelArrayOfMapOfArray/test_10.parquet
 parquet_test/testSingleLevelArrayOfMapOfArray/test_18.parquet
 parquet_test/testSingleLevelArrayOfMapOfArray/test_14.parquet
 parquet_test/testSingleLevelArrayOfMapOfArray/test_16.parquet
 parquet_test/testSingleLevelArrayOfMapOfStruct/test_12.parquet
 parquet_test/testSingleLevelArrayOfMapOfStruct/test_10.parquet
 parquet_test/testSingleLevelArrayOfMapOfStruct/test_18.parquet
 parquet_test/testSingleLevelArrayOfMapOfStruct/test_14.parquet
 parquet_test/testSingleLevelArrayOfMapOfStruct/test_16.parquet
 parquet_test/testMapWithNullValues/test_12.parquet
 parquet_test/testMapWithNullValues/test_10.parquet
 parquet_test/testMapWithNullValues/test_18.parquet
 parquet_test/testMapWithNullValues/test_14.parquet
 parquet_test/testMapWithNullValues/test_16.parquet
 parquet_test/testSingleLevelSchemaArrayOfMaps/test_12.parquet
 parquet_test/testSingleLevelSchemaArrayOfMaps/test_10.parquet
 parquet_test/testSingleLevelSchemaArrayOfMaps/test_18.parquet
 parquet_test/testSingleLevelSchemaArrayOfMaps/test_14.parquet
 parquet_test/testSingleLevelSchemaArrayOfMaps/test_16.parquet
 parquet_test/testSingleLevelSchemaArrayOfStructs/test_6.parquet
 parquet_test/testSingleLevelSchemaArrayOfStructs/test_4.parquet
 parquet_test/testSingleLevelSchemaArrayOfStructs/test_8.parquet
 parquet_test/testSingleLevelSchemaArrayOfStructs/test_0.parquet
 parquet_test/testSingleLevelSchemaArrayOfStructs/test_2.parquet
 parquet_test/testMapOfArrayValues/test_12.parquet
 parquet_test/testMapOfArrayValues/test_10.parquet
 parquet_test/testMapOfArrayValues/test_18.parquet
 parquet_test/testMapOfArrayValues/test_14.parquet
 parquet_test/testMapOfArrayValues/test_16.parquet
 parquet_test/testArrayOfMaps/test_12.parquet
 parquet_test/testArrayOfMaps/test_10.parquet
 parquet_test/testArrayOfMaps/test_18.parquet
 parquet_test/testArrayOfMaps/test_14.parquet
 parquet_test/testArrayOfMaps/test_16.parquet 

( name testArrayOfMaps means the test files are generated from testArrayOfMaps() in AbstractTestParquetReader )

There are two errors: one is #7002 and the other one is:

E1107 05:44:17.083400 1412506 Exceptions.h:69] Line: ../../velox/dwio/parquet/reader/ParquetReader.cpp:282, Function:getParquetColumnInfo, Expression: schemaElement.repetition_type == thrift::FieldRepetitionType::REPEATED (OPTIONAL vs. REPEATED), Source: RUNTIME, ErrorCode: INVALID_STATE
terminate called after throwing an instance of 'facebook::velox::VeloxRuntimeError'
  what():  Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: (OPTIONAL vs. REPEATED)
Retriable: False
Expression: schemaElement.repetition_type == thrift::FieldRepetitionType::REPEATED
Function: getParquetColumnInfo
File: ../../velox/dwio/parquet/reader/ParquetReader.cpp
Line: 282

for the time being we only tested meta data part not touch data correctness yet.

Can we think of building a better unit test infra for parquet? Is there some unit test best practice we can learn from orc support in velox? e.g, leveraging existing parquet testing in presto project is a good direction. To better test complex types, there is even a customized hive parquet writer (e.g, SingleLevelArraySchemaConverter) helps surfacing issues (this is the reason why #7002 is blocked at first because it's hard to reproduce)

@qqibrow qqibrow added the enhancement New feature or request label Nov 8, 2023
@qqibrow
Copy link
Collaborator Author

qqibrow commented Nov 28, 2023

We've implemented testing velox parquet using presto unit test. Here is the mini design doc: https://gist.github.com/qqibrow/689ed97b91cc0b58337be96a86291301

@qqibrow
Copy link
Collaborator Author

qqibrow commented Nov 28, 2023

Here are some of the issues we discovered:

add example file to reproduce: #7002
#7617
#7776
#7777
#7778
#7779

@waitinfuture
Copy link
Contributor

waitinfuture commented Dec 19, 2023

Hi @qqibrow , we also encountered data incorrectness when reading Map types, wonder if you also met such bugs, thanks!

@yingsu00
Copy link
Collaborator

@qqibrow Let's use this issue just for adding the new Velox unit tests, and use another separate issue for Presto Parquet reader tests for Velox. I created #8102 for that.

@xumingming
Copy link
Contributor

@qqibrow This is a very informative post, thanks! What's the status of Parquet support in Velox from your perspective?

@yingsu00
Copy link
Collaborator

@qqibrow Let's use this issue for enhancing the Velox e2eFilterTest. Adding some contents here:

Description
The current E2EFilterTests for DWRF and Parquet uses Velox DWRF writer and Arrow Parquet writer to test the readers with the following coverages:

With Varint or not (DWRF only)
With nulls or not
Wrap in struct or not
With filters or not
Test row group skipping or not
numCombinations number of different ScanSpecs
Customizations
What are missing when comparing with Presto's Parquet tests

Different compression schemes
Enable dictionary or not (In Velox E2EFilterTests, the tests with dictionaries were written in separate test cases)
Parquet writer version, V1 and V2
forward order and reverse order
forward order with nulls, reverse order with nulls
Wrap each element in ARRAY or not
With the limit of max Block(vector) size
Allow multiple sessions
More ARRAY tests:
empty arrays
random nesting levels up to 15
More combined types
Array of Structs of more nesting levels
Arrays of Arrays
Array of Array of Structs
Array of Array of Structs of Arrays
More MAP tests (similar to ARRAY)
More STRUCT tests (similar to ARRAY)

@yingsu00 yingsu00 changed the title Better unit test for parquet in velox Better E2EFilterTest for parquet in velox May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request parquet
Projects
None yet
Development

No branches or pull requests

4 participants