Skip to content

Commit

Permalink
Temporary fix Parquet metadata with empty value string being ignored …
Browse files Browse the repository at this point in the history
…from writing (#14026)

When writing to Parquet files, Spark needs to write pairs of key-value strings into files' metadata. Sometimes the value strings are just an empty string. Such empty string is ignored from writing into the file, causing other applications (such as Spark) to read the value and interpret it as a `null` instead of an empty string as in the original input, as described in #14024. This is wrong and led to data corruption as I tested.

This PR intentionally modifies the empty value string into a space character to workaround the bug. This is a temporary fix while waiting for a better fix to be worked on.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)

URL: #14026
  • Loading branch information
ttnghia authored Sep 6, 2023
1 parent 1d7a77b commit 609f894
Showing 1 changed file with 10 additions and 2 deletions.
12 changes: 10 additions & 2 deletions java/src/main/native/src/TableJni.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1592,7 +1592,11 @@ JNIEXPORT long JNICALL Java_ai_rapids_cudf_Table_writeParquetBufferBegin(
std::map<std::string, std::string> kv_metadata;
std::transform(meta_keys.begin(), meta_keys.end(), meta_values.begin(),
std::inserter(kv_metadata, kv_metadata.end()),
[](auto const &key, auto const &value) { return std::make_pair(key, value); });
[](auto const &key, auto const &value) {
// The metadata value will be ignored if it is empty.
// We modify it into a space character to workaround such issue.
return std::make_pair(key, value.empty() ? std::string(" ") : value);
});

auto stats = std::make_shared<cudf::io::writer_compression_statistics>();
chunked_parquet_writer_options opts =
Expand Down Expand Up @@ -1638,7 +1642,11 @@ JNIEXPORT long JNICALL Java_ai_rapids_cudf_Table_writeParquetFileBegin(
std::map<std::string, std::string> kv_metadata;
std::transform(meta_keys.begin(), meta_keys.end(), meta_values.begin(),
std::inserter(kv_metadata, kv_metadata.end()),
[](auto const &key, auto const &value) { return std::make_pair(key, value); });
[](auto const &key, auto const &value) {
// The metadata value will be ignored if it is empty.
// We modify it into a space character to workaround such issue.
return std::make_pair(key, value.empty() ? std::string(" ") : value);
});

sink_info sink{output_path.get()};
auto stats = std::make_shared<cudf::io::writer_compression_statistics>();
Expand Down

0 comments on commit 609f894

Please sign in to comment.