Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-42971: [C++] Parquet stream writer: Allow writing BYTE_ARRAY with converted type NONE #44739

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

pulkomandy
Copy link

@pulkomandy pulkomandy commented Nov 15, 2024

This allows to store binary data of arbitrary length in a parquet file, without having to wrongly declare it as UTF-8.

Fixes the writer part of #42971

The reader part has already been fixed in 4d82549 and this uses a similar implementation, but with a stricter set of "exceptions" (only byte arrays with NONE type are allowed).

Rationale for this change

Hello,

We are trying to store binary data (in our case, dump of captured CAN messages) in a parquet file. The data has a variable length (from 0 to 8 bytes) and is not an UTF-8 string (or a text string at all). For this, physical type BYTE_ARRAY and logical type NONE seems appropriate.

Unfortunately, the parquet writer will not let us do that. We can do either fixed length and converted type NONE, or variable length and converted type UTF-8. This change relaxes the type check on byte arrays to allow use of the NONE converted type.

What changes are included in this PR?

Allow the parquet stream writer to store data in a BYTE_ARRAY with NONE logical type. The changes are based to similar changes made earlier to the stream reader.

Are these changes tested?

I'm not sure if this is the right way to fix this problem. I'm happy to add tests if needed after the general idea has been validated.

In particular, the NONE type does not assume ASCII text (with no NULL bytes inside), so the operator<<(const char* v) method may need to be excluded from this (and only allow UTF-8), what do you think? In that case, what would be the way of implementing this without making slightly different versions of CheckColumn for each case?

Are there any user-facing changes?

Parquet stream writer allows using BYTE_ARRAY witn NONE converted type for storage of arbitrary binary data.

Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@mapleFU
Copy link
Member

mapleFU commented Nov 15, 2024

mind:

  1. a test for this
  2. Mark this pr for parquet stream writer?

@pulkomandy pulkomandy changed the title Allow writing BYTE_ARRAY with converted type NONE GH-#42971: [C++] Parquet stream writer: Allow writing BYTE_ARRAY with converted type NONE Nov 15, 2024
@pulkomandy
Copy link
Author

Updated the existing test to cover this case, and modified the PR title.

The tests I found does not seem to validate the parquet file contents, is there another place where I could check that?

This allows to store binary data of arbitrary length in a parquet file,
without having to wrongly declare it as UTF-8.

Fixes the writer part of apache#42971

The reader part has already been fixed in 4d82549
and this uses a similar implementation, but with a stricter set of
"exceptions" (only byte arrays with NONE type are allowed).
@mapleFU mapleFU changed the title GH-#42971: [C++] Parquet stream writer: Allow writing BYTE_ARRAY with converted type NONE GH-42971: [C++] Parquet stream writer: Allow writing BYTE_ARRAY with converted type NONE Nov 17, 2024
Copy link

⚠️ GitHub issue #42971 has been automatically assigned in GitHub to PR creator.

"' has converted type[" +
ConvertedTypeToString(node->converted_type()) + "] not '" +
ConvertedTypeToString(converted_type) + "'");
// The converted type does not always match with the value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The root cause should be at this line:

CheckColumn(Type::BYTE_ARRAY, ConvertedType::UTF8);

I think a clean fix might be creating a thin wrapper around const char*, std::string and std:string_view for binary data. Just like FixedStringView for the fixed length type:

StreamWriter& operator<<(FixedStringView v);

In this approach, we can safely call CheckColumn(Type::BYTE_ARRAY, ConvertedType::NONE); in it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants