GH-42971: [C++] Parquet stream writer: Allow writing BYTE_ARRAY with converted type NONE #44739

pulkomandy · 2024-11-15T12:12:29Z

This allows to store binary data of arbitrary length in a parquet file, without having to wrongly declare it as UTF-8.

Fixes the writer part of #42971

The reader part has already been fixed in 4d82549 and this uses a similar implementation, but with a stricter set of "exceptions" (only byte arrays with NONE type are allowed).

Rationale for this change

Hello,

We are trying to store binary data (in our case, dump of captured CAN messages) in a parquet file. The data has a variable length (from 0 to 8 bytes) and is not an UTF-8 string (or a text string at all). For this, physical type BYTE_ARRAY and logical type NONE seems appropriate.

Unfortunately, the parquet writer will not let us do that. We can do either fixed length and converted type NONE, or variable length and converted type UTF-8. This change relaxes the type check on byte arrays to allow use of the NONE converted type.

What changes are included in this PR?

Allow the parquet stream writer to store data in a BYTE_ARRAY with NONE logical type. The changes are based to similar changes made earlier to the stream reader.

Are these changes tested?

I'm not sure if this is the right way to fix this problem. I'm happy to add tests if needed after the general idea has been validated.

In particular, the NONE type does not assume ASCII text (with no NULL bytes inside), so the operator<<(const char* v) method may need to be excluded from this (and only allow UTF-8), what do you think? In that case, what would be the way of implementing this without making slightly different versions of CheckColumn for each case?

Are there any user-facing changes?

Parquet stream writer allows using BYTE_ARRAY witn NONE converted type for storage of arbitrary binary data.

GitHub Issue: [C++][Parquet] Forced UTF8 encoding of BYTE_ARRAY on stream::read/write #42971

github-actions · 2024-11-15T12:12:54Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

mapleFU · 2024-11-15T12:15:09Z

mind:

a test for this
Mark this pr for parquet stream writer?

pulkomandy · 2024-11-15T12:33:01Z

Updated the existing test to cover this case, and modified the PR title.

The tests I found does not seem to validate the parquet file contents, is there another place where I could check that?

This allows to store binary data of arbitrary length in a parquet file, without having to wrongly declare it as UTF-8. Fixes the writer part of apache#42971 The reader part has already been fixed in 4d82549 and this uses a similar implementation, but with a stricter set of "exceptions" (only byte arrays with NONE type are allowed).

github-actions · 2024-11-17T15:50:28Z

⚠️ GitHub issue #42971 has been automatically assigned in GitHub to PR creator.

wgtmac · 2024-11-18T01:50:36Z

cpp/src/parquet/stream_writer.cc

-                           "' has converted type[" +
-                           ConvertedTypeToString(node->converted_type()) + "] not '" +
-                           ConvertedTypeToString(converted_type) + "'");
+    // The converted type does not always match with the value


The root cause should be at this line:

arrow/cpp/src/parquet/stream_writer.cc

Line 145 in 4c2aef7

CheckColumn(Type::BYTE_ARRAY, ConvertedType::UTF8);

I think a clean fix might be creating a thin wrapper around const char*, std::string and std:string_view for binary data. Just like FixedStringView for the fixed length type:

arrow/cpp/src/parquet/stream_writer.h

Line 147 in 4c2aef7

StreamWriter& operator<<(FixedStringView v);

In this approach, we can safely call CheckColumn(Type::BYTE_ARRAY, ConvertedType::NONE); in it.

pulkomandy requested a review from wgtmac as a code owner November 15, 2024 12:12

github-actions bot added Component: Parquet Component: C++ awaiting review Awaiting review labels Nov 15, 2024

pulkomandy changed the title ~~Allow writing BYTE_ARRAY with converted type NONE~~ GH-#42971: [C++] Parquet stream writer: Allow writing BYTE_ARRAY with converted type NONE Nov 15, 2024

pulkomandy force-pushed the main branch from e50ec6b to ae87ec3 Compare November 15, 2024 12:31

pulkomandy force-pushed the main branch from ae87ec3 to 7a061c3 Compare November 15, 2024 13:11

pulkomandy force-pushed the main branch from 7a061c3 to 22663c1 Compare November 15, 2024 14:45

mapleFU changed the title ~~GH-#42971: [C++] Parquet stream writer: Allow writing BYTE_ARRAY with converted type NONE~~ GH-42971: [C++] Parquet stream writer: Allow writing BYTE_ARRAY with converted type NONE Nov 17, 2024

wgtmac reviewed Nov 18, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-42971: [C++] Parquet stream writer: Allow writing BYTE_ARRAY with converted type NONE #44739

GH-42971: [C++] Parquet stream writer: Allow writing BYTE_ARRAY with converted type NONE #44739

pulkomandy commented Nov 15, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Nov 15, 2024

mapleFU commented Nov 15, 2024

pulkomandy commented Nov 15, 2024

github-actions bot commented Nov 17, 2024

wgtmac Nov 18, 2024

GH-42971: [C++] Parquet stream writer: Allow writing BYTE_ARRAY with converted type NONE #44739

Are you sure you want to change the base?

GH-42971: [C++] Parquet stream writer: Allow writing BYTE_ARRAY with converted type NONE #44739

Conversation

pulkomandy commented Nov 15, 2024 • edited by github-actions bot Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Nov 15, 2024

mapleFU commented Nov 15, 2024

pulkomandy commented Nov 15, 2024

github-actions bot commented Nov 17, 2024

wgtmac Nov 18, 2024

Choose a reason for hiding this comment

pulkomandy commented Nov 15, 2024 •

edited by github-actions bot

Loading