-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-42971: [C++] Parquet stream writer: Allow writing BYTE_ARRAY with converted type NONE #44739
base: main
Are you sure you want to change the base?
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format?
or
In the case of PARQUET issues on JIRA the title also supports:
See also: |
mind:
|
Updated the existing test to cover this case, and modified the PR title. The tests I found does not seem to validate the parquet file contents, is there another place where I could check that? |
This allows to store binary data of arbitrary length in a parquet file, without having to wrongly declare it as UTF-8. Fixes the writer part of apache#42971 The reader part has already been fixed in 4d82549 and this uses a similar implementation, but with a stricter set of "exceptions" (only byte arrays with NONE type are allowed).
|
"' has converted type[" + | ||
ConvertedTypeToString(node->converted_type()) + "] not '" + | ||
ConvertedTypeToString(converted_type) + "'"); | ||
// The converted type does not always match with the value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The root cause should be at this line:
arrow/cpp/src/parquet/stream_writer.cc
Line 145 in 4c2aef7
CheckColumn(Type::BYTE_ARRAY, ConvertedType::UTF8); |
I think a clean fix might be creating a thin wrapper around const char*
, std::string
and std:string_view
for binary data. Just like FixedStringView
for the fixed length type:
arrow/cpp/src/parquet/stream_writer.h
Line 147 in 4c2aef7
StreamWriter& operator<<(FixedStringView v); |
In this approach, we can safely call CheckColumn(Type::BYTE_ARRAY, ConvertedType::NONE);
in it.
This allows to store binary data of arbitrary length in a parquet file, without having to wrongly declare it as UTF-8.
Fixes the writer part of #42971
The reader part has already been fixed in 4d82549 and this uses a similar implementation, but with a stricter set of "exceptions" (only byte arrays with NONE type are allowed).
Rationale for this change
Hello,
We are trying to store binary data (in our case, dump of captured CAN messages) in a parquet file. The data has a variable length (from 0 to 8 bytes) and is not an UTF-8 string (or a text string at all). For this, physical type BYTE_ARRAY and logical type NONE seems appropriate.
Unfortunately, the parquet writer will not let us do that. We can do either fixed length and converted type NONE, or variable length and converted type UTF-8. This change relaxes the type check on byte arrays to allow use of the NONE converted type.
What changes are included in this PR?
Allow the parquet stream writer to store data in a BYTE_ARRAY with NONE logical type. The changes are based to similar changes made earlier to the stream reader.
Are these changes tested?
I'm not sure if this is the right way to fix this problem. I'm happy to add tests if needed after the general idea has been validated.
In particular, the NONE type does not assume ASCII text (with no NULL bytes inside), so the
operator<<(const char* v)
method may need to be excluded from this (and only allow UTF-8), what do you think? In that case, what would be the way of implementing this without making slightly different versions of CheckColumn for each case?Are there any user-facing changes?
Parquet stream writer allows using BYTE_ARRAY witn NONE converted type for storage of arbitrary binary data.