Clustering support
#13071
Replies: 1 comment 2 replies
-
I'm thinking something along these lines: https://github.com/apache/arrow-rs/compare/master...adriangb:arrow-rs:clustering?expand=1 (very very rough idea). |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Perhaps this belongs in arrow-rs since most of the relevant code is over there but it's really a query performance issue so I'm sharing it here.
In our use case we take OpenTelemetry metrics and write them to hive-partitioned tables in GCS. We are currently partitioning by
metric_name
but this has 2 issues:process.runtime.java....
which mean that you end up with cardinality explosion from the language in there.The obvious solution available today would be to not partition by metric_name and instead sort by it, but I fear that leads to bad query performance as often you want all of the data points for a given metric and now you have to essentially do a full table scan to get that. Adding a bloom filter might help skip some row groups but still, my gut feel is it wouldn't work out all that well.
I think an ideal solution for this would be some sort of clustering. Instead of partitioning per file, what if we could make a row group for each metric? That would play well with statistics since it then becomes a lot cheaper to read all of the data for a given metric (you still need to read the metadata to get the stats but beyond that you skip the row groups for all of the other metrics). I'd think this would work equally well for small and large files.
Would this make sense as a feature? Something along the lines of
with_row_group_clustering_columns
? I imagine this could be useful in many other use cases.Clustering is in general a broad topic, I am also curious what the community thinks about this broader topic. Would it make sense to have something like DeltaLake's Liquid Clustering built into the Parquet writer?
Beta Was this translation helpful? Give feedback.
All reactions