Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce binary_as_string parquet option, upgrade to arrow/parquet 53.2.0 #12816

Merged
merged 4 commits into from
Oct 25, 2024

Conversation

goldmedal
Copy link
Contributor

@goldmedal goldmedal commented Oct 8, 2024

Which issue does this PR close?

Closes #12788 .
Closes #13042

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added core Core DataFusion crate common Related to common crate proto Related to proto crate labels Oct 8, 2024
@alamb alamb changed the title Introdcue binary_as_string parquet option Introduce binary_as_string parquet option Oct 8, 2024
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Oct 10, 2024
@alamb
Copy link
Contributor

alamb commented Oct 10, 2024

Submitted apache/arrow-rs#6539 for it.

@goldmedal would you be ok if I pushed a change to this PR to temporarily patch arrow-rs to include the fix for apache/arrow-rs#6539 ?

Then we could get this PR ready to go (and I could use it to test with string view on by default)

@goldmedal
Copy link
Contributor Author

@goldmedal would you be ok if I pushed a change to this PR to temporarily patch arrow-rs to include the fix for apache/arrow-rs#6539 ?

Then we could get this PR ready to go (and I could use it to test with string view on by default)

Sure. feel free to do it. Thanks!

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Oct 10, 2024
@alamb
Copy link
Contributor

alamb commented Oct 10, 2024

@goldmedal would you be ok if I pushed a change to this PR to temporarily patch arrow-rs to include the fix for apache/arrow-rs#6539 ?
Then we could get this PR ready to go (and I could use it to test with string view on by default)

Sure. feel free to do it. Thanks!

I pushed the change in d7c3565

I am about out of time to work on this today, but if no one else gets a chance to do this I'll try and polish this PR up tomorrow

@goldmedal
Copy link
Contributor Author

I am about out of time to work on this today, but if no one else gets a chance to do this I'll try and polish this PR up tomorrow

I would be able to help finish the remaining work before I sleep today. (Ensure this PR works well)

field.is_nullable(),
))
}
(Some(DataType::LargeUtf8), DataType::LargeBinary) => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this case isn't covered by testing because the Arrow reader always marks BYTE_ARRAY as Utf8 or Binary. I'm not pretty sure if we need it. 🤔

// string-to-view transformation. So we need all binary types to be coerced to `Utf8View` here.
(
Some(DataType::Utf8View),
DataType::Binary | DataType::LargeBinary | DataType::BinaryView,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, testing doesn't cover the case for DataType::LargeBinary and DataType::BinaryView.

@goldmedal
Copy link
Contributor Author

@alamb
I have confirmed this feature works well and added some tests for it. Only some concerns about #12816 (comment).
You can feel free to push any changes if needed.

@alamb
Copy link
Contributor

alamb commented Oct 11, 2024

I am starting to play around with this PR / write some tests. Will post my updates shortly

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @goldmedal -- I played around with this locally with the hits_partitioned file and I am happy to say it seems to get the correct schema.

I am going to play around with this a bit more / test with some other feature branches

+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Sort: l DESC NULLS FIRST, fetch=25                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
|               |   Projection: regexp_replace(hits_partitioned.Referer,Utf8("^https?://(?:www\.)?([^/]+)/.*$"),Utf8("\1")) AS k, avg(character_length(hits_partitioned.Referer)) AS l, count(*) AS c, min(hits_partitioned.Referer)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
|               |     Filter: count(*) > Int64(100000)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
|               |       Aggregate: groupBy=[[regexp_replace(hits_partitioned.Referer, Utf8("^https?://(?:www\.)?([^/]+)/.*$"), Utf8("\1"))]], aggr=[[avg(CAST(character_length(hits_partitioned.Referer) AS Float64)), count(Int64(1)) AS count(*), min(hits_partitioned.Referer)]]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|               |         Filter: hits_partitioned.Referer != Utf8View("")                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
|               |           TableScan: hits_partitioned projection=[Referer], partial_filters=[hits_partitioned.Referer != Utf8View("")]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| physical_plan | SortPreservingMergeExec: [l@1 DESC], fetch=25                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
|               |   SortExec: TopK(fetch=25), expr=[l@1 DESC], preserve_partitioning=[true]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
|               |     ProjectionExec: expr=[regexp_replace(hits_partitioned.Referer,Utf8("^https?://(?:www\.)?([^/]+)/.*$"),Utf8("\1"))@0 as k, avg(character_length(hits_partitioned.Referer))@1 as l, count(*)@2 as c, min(hits_partitioned.Referer)@3 as min(hits_partitioned.Referer)]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
|               |       CoalesceBatchesExec: target_batch_size=8192                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|               |         FilterExec: count(*)@2 > 100000                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
|               |           AggregateExec: mode=FinalPartitioned, gby=[regexp_replace(hits_partitioned.Referer,Utf8("^https?://(?:www\.)?([^/]+)/.*$"),Utf8("\1"))@0 as regexp_replace(hits_partitioned.Referer,Utf8("^https?://(?:www\.)?([^/]+)/.*$"),Utf8("\1"))], aggr=[avg(character_length(hits_partitioned.Referer)), count(*), min(hits_partitioned.Referer)]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
|               |             CoalesceBatchesExec: target_batch_size=8192                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
|               |               RepartitionExec: partitioning=Hash([regexp_replace(hits_partitioned.Referer,Utf8("^https?://(?:www\.)?([^/]+)/.*$"),Utf8("\1"))@0], 16), input_partitions=16                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|               |                 AggregateExec: mode=Partial, gby=[regexp_replace(Referer@0, ^https?://(?:www\.)?([^/]+)/.*$, \1) as regexp_replace(hits_partitioned.Referer,Utf8("^https?://(?:www\.)?([^/]+)/.*$"),Utf8("\1"))], aggr=[avg(character_length(hits_partitioned.Referer)), count(*), min(hits_partitioned.Referer)]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|               |                   CoalesceBatchesExec: target_batch_size=8192                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
|               |                     FilterExec: Referer@0 !=                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
|               |                       ParquetExec: file_groups={16 groups: [[Users/andrewlamb/Downloads/hits_partitioned/hits_0.parquet:0..122446530, Users/andrewlamb/Downloads/hits_partitioned/hits_1.parquet:0..174965044, Users/andrewlamb/Downloads/hits_partitioned/hits_10.parquet:0..101513258, Users/andrewlamb/Downloads/hits_partitioned/hits_11.parquet:0..118419888, Users/andrewlamb/Downloads/hits_partitioned/hits_12.parquet:0..149514164, ...], [Users/andrewlamb/Downloads/hits_partitioned/hits_14.parquet:108113265..151121699, Users/andrewlamb/Downloads/hits_partitioned/hits_15.parquet:0..103098894, Users/andrewlamb/Downloads/hits_partitioned/hits_16.parquet:0..101067219, Users/andrewlamb/Downloads/hits_partitioned/hits_17.parquet:0..116867853, Users/andrewlamb/Downloads/hits_partitioned/hits_18.parquet:0..133119589, ...], [Users/andrewlamb/Downloads/hits_partitioned/hits_21.parquet:3887560..113455196, Users/andrewlamb/Downloads/hits_partitioned/hits_22.parquet:0..79775901, Users/andrewlamb/Downloads/hits_partitioned/hits_23.parquet:0..79631107, Users/andrewlamb/Downloads/hits_partitioned/hits_24.parquet:0..78257049, Users/andrewlamb/Downloads/hits_partitioned/hits_25.parquet:0..144169728, ...], [Users/andrewlamb/Downloads/hits_partitioned/hits_28.parquet:106905624..162772407, Users/andrewlamb/Downloads/hits_partitioned/hits_29.parquet:0..79213288, Users/andrewlamb/Downloads/hits_partitioned/hits_3.parquet:0..192507052, Users/andrewlamb/Downloads/hits_partitioned/hits_30.parquet:0..124187913, Users/andrewlamb/Downloads/hits_partitioned/hits_31.parquet:0..123065410, ...], [Users/andrewlamb/Downloads/hits_partitioned/hits_35.parquet:54087340..153632381, Users/andrewlamb/Downloads/hits_partitioned/hits_36.parquet:0..92487304, Users/andrewlamb/Downloads/hits_partitioned/hits_37.parquet:0..108247781, Users/andrewlamb/Downloads/hits_partitioned/hits_38.parquet:0..132005180, Users/andrewlamb/Downloads/hits_partitioned/hits_39.parquet:0..103522954, ...], ...]}, projection=[Referer], predicate=Referer@14 != , pruning_predicate=CASE WHEN Referer_null_count@2 = Referer_row_count@3 THEN false ELSE Referer_min@0 !=  OR  != Referer_max@1 END, required_guarantees=[Referer not in ()] |
|               |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 row(s) fetched.

@alamb
Copy link
Contributor

alamb commented Oct 11, 2024

I am running some benchmarks on this PR

@alamb
Copy link
Contributor

alamb commented Oct 13, 2024

Here is the performance of this PR. Some queries are slower, some are faster.

I believe once we turn on string view everything will be faster.

--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  main_base ┃ feature_12788-binary-as-string-… ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.26ms │                           2.22ms │     no change │
│ QQuery 1     │    38.78ms │                          38.16ms │     no change │
│ QQuery 2     │    92.96ms │                          94.66ms │     no change │
│ QQuery 3     │    98.39ms │                         100.72ms │     no change │
│ QQuery 4     │   928.83ms │                         929.22ms │     no change │
│ QQuery 5     │   973.20ms │                         988.39ms │     no change │
│ QQuery 6     │    33.75ms │                          36.60ms │  1.08x slower │
│ QQuery 7     │    42.31ms │                          42.33ms │     no change │
│ QQuery 8     │  1383.69ms │                        1364.35ms │     no change │
│ QQuery 9     │  1323.14ms │                        1361.55ms │     no change │
│ QQuery 10    │   351.92ms │                         413.62ms │  1.18x slower │
│ QQuery 11    │   400.63ms │                         462.88ms │  1.16x slower │
│ QQuery 12    │  1095.05ms │                        1101.34ms │     no change │
│ QQuery 13    │  1753.33ms │                        1676.37ms │     no change │
│ QQuery 14    │  1220.66ms │                        1253.31ms │     no change │
│ QQuery 15    │  1099.67ms │                        1091.86ms │     no change │
│ QQuery 16    │  2530.60ms │                        2536.06ms │     no change │
│ QQuery 17    │  2299.37ms │                        2317.53ms │     no change │
│ QQuery 18    │  5042.06ms │                        4983.43ms │     no change │
│ QQuery 19    │    94.19ms │                          95.74ms │     no change │
│ QQuery 20    │  1720.68ms │                        1493.36ms │ +1.15x faster │
│ QQuery 21    │  2074.60ms │                        1824.64ms │ +1.14x faster │
│ QQuery 22    │  5257.20ms │                        3143.86ms │ +1.67x faster │
│ QQuery 23    │ 10530.69ms │                       10229.16ms │     no change │
│ QQuery 24    │   590.29ms │                         654.78ms │  1.11x slower │
│ QQuery 25    │   489.54ms │                         523.81ms │  1.07x slower │
│ QQuery 26    │   653.48ms │                         714.03ms │  1.09x slower │
│ QQuery 27    │  2585.81ms │                        2285.59ms │ +1.13x faster │
│ QQuery 28    │ 15372.17ms │                       15562.00ms │     no change │
│ QQuery 29    │   530.70ms │                         529.82ms │     no change │
│ QQuery 30    │  1031.82ms │                        1094.30ms │  1.06x slower │
│ QQuery 31    │  1121.86ms │                        1139.77ms │     no change │
│ QQuery 32    │  4358.25ms │                        4290.06ms │     no change │
│ QQuery 33    │  5154.15ms │                        5209.55ms │     no change │
│ QQuery 34    │  5133.94ms │                        5172.32ms │     no change │
│ QQuery 35    │  1947.97ms │                        1895.07ms │     no change │
│ QQuery 36    │   270.40ms │                         262.39ms │     no change │
│ QQuery 37    │   121.52ms │                         126.73ms │     no change │
│ QQuery 38    │   143.79ms │                         141.60ms │     no change │
│ QQuery 39    │   758.06ms │                         767.78ms │     no change │
│ QQuery 40    │    52.74ms │                          56.09ms │  1.06x slower │
│ QQuery 41    │    48.60ms │                          50.59ms │     no change │
│ QQuery 42    │    64.42ms │                          63.09ms │     no change │
└──────────────┴────────────┴──────────────────────────────────┴───────────────┘

@alamb alamb force-pushed the feature/12788-binary-as-string-opt branch from 3a62740 to abce0e9 Compare October 13, 2024 12:11
@alamb
Copy link
Contributor

alamb commented Oct 13, 2024

I reabased / squashed all the code in this branch so it would be easier to pull in to test in #12092

@goldmedal
Copy link
Contributor Author

Here is the performance of this PR. Some queries are slower, some are faster.

I believe once we turn on string view everything will be faster.

Thanks @alamb It's interesting 🤔
Does this benchmark only include the change made by this PR, or does it include others?
It seems there are many queries slowed down by this PR.

Before this PR, the casting flow is

Binary(parquet) -> Binary(arrow) -> BinaryView(arrow) -> StringView(arrow)

Now, it's

Binary(paruqet) -> StringView(arrow)

Theoretically, we save the two steps (including the most expensive ones) for it. I have no idea why they would be slower.
I might try to do some profiling for the slower cases 🤔

@alamb
Copy link
Contributor

alamb commented Oct 14, 2024

Here is the performance of this PR. Some queries are slower, some are faster.
I believe once we turn on string view everything will be faster.

Thanks @alamb It's interesting 🤔 Does this benchmark only include the change made by this PR, or does it include others? It seems there are many queries slowed down by this PR.

It only includes changes made by this PR

The results with several other changes are here: #12092 (comment) (and they are all faster 🎉 )

Before this PR, the casting flow is

Binary(parquet) -> Binary(arrow) -> BinaryView(arrow) -> StringView(arrow)

Now, it's

Binary(paruqet) -> StringView(arrow)

Theoretically, we save the two steps (including the most expensive ones) for it. I have no idea why they would be slower. I might try to do some profiling for the slower cases 🤔

I think the reason it is slower is that there are some operations in the hash grouping code that have specializations for StringArray/BinaryArray but do not (yet) have specializations for StringView. Specifically

So while this PR makes the scan faster, the total time is slower as those paths dominated the query path.

When they are all put together we get the speedup we have been looking for

@alamb
Copy link
Contributor

alamb commented Oct 14, 2024

Since this PR requires a change to arrow-rs, I think there is no particular rush to merge it in -- I have a few thoughts about how to make the code a bit simpler and hope to propose some changes over the next few days

@alamb
Copy link
Contributor

alamb commented Oct 16, 2024

I am starting to get this PR ready

parquet_options.schema_force_view_types = self.common.force_view_types;
// The hits_partitioned dataset specifies string columns
// as binary due to how it was written. Force it to strings
parquet_options.binary_as_string = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will have to mirror these options in the actual clickbench run

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note in #13099

Field::new(field.name(), DataType::BinaryView, field.is_nullable())
.with_metadata(field.metadata().to_owned()),
),
DataType::Utf8 | DataType::LargeUtf8 => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reduced repetition of this code by encapsulating creating a new FieldRef from an existing Field in a function. I also think this will avoid the potential loss of metadata

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! It makes sense to me. 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to try and accelerate this PR along with a new arrow-rs release: apache/arrow-rs#6341

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arrow is released. I think this PR is ready for review

@alamb alamb mentioned this pull request Oct 21, 2024
1 task
@alamb
Copy link
Contributor

alamb commented Oct 24, 2024

Update here is we are on track to release a version of arrow with the required fixes today and then I will merge this PR up and get it ready for review ⏲️

@alamb alamb changed the title Introduce binary_as_string parquet option Introduce binary_as_string parquet option, upgrade to arrow/parquet 53.2.0 Oct 24, 2024
@alamb alamb force-pushed the feature/12788-binary-as-string-opt branch from a20ac87 to 9794e93 Compare October 24, 2024 20:00
@@ -70,22 +70,22 @@ version = "42.1.0"
ahash = { version = "0.8", default-features = false, features = [
"runtime-rng",
] }
arrow = { version = "53.1.0", features = [
arrow = { version = "53.2.0", features = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this update is required to have access to apache/arrow-rs#6539

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize it isn't ideal that I am approving a PR that I have most recently worked on. @goldmedal perhaps you can give it a final review to make sure I didn't mess anything else up

Comment on lines 478 to 489
# NB the data is read and displayed a StringView
query error DataFusion error: SQL error: ParserError\("Expected: an SQL statement, found: Utf8View"\)
select
arrow_typeof(binary_col), binary_col,
arrow_typeof(largebinary_col), largebinary_col,
arrow_typeof(binaryview_col), binaryview_col
FROM binary_as_string_both;
----
Utf8View aaa
Utf8View bbb
Utf8View ccc
Utf8View ddd
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why this query fails 🤔 ? It got a ParserError but I think it's a valid SQL.
The test expects this query to fail but it still returns some results with the wrong schema.
The query should have 6 columns per row but It only shows 2.
It's weird. 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an excellent find @goldmedal -- thank you

I debugged it and the issue was there was an extra space before ---- 🤦

 ----

vs

----

Fixed in a48dce1

Copy link
Contributor Author

@goldmedal goldmedal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @alamb. I have one question about the test. Others look good to me.

@alamb
Copy link
Contributor

alamb commented Oct 25, 2024

Thanks, @alamb. I have one question about the test. Others look good to me.

Thanks @goldmedal -- that was an excellent catch. I have fixed the issue

@alamb alamb merged commit 13a4225 into apache:main Oct 25, 2024
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate documentation Improvements or additions to documentation proto Related to proto crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance: Add "read strings as binary" option for parquet
2 participants