Clustering support #13071

adriangb · 2024-10-23T05:39:43Z

adriangb
Oct 23, 2024

Perhaps this belongs in arrow-rs since most of the relevant code is over there but it's really a query performance issue so I'm sharing it here.

In our use case we take OpenTelemetry metrics and write them to hive-partitioned tables in GCS. We are currently partitioning by metric_name but this has 2 issues:

It's too high cardinality for partitioning. It turns out OTEL has metric names like process.runtime.java.... which mean that you end up with cardinality explosion from the language in there.
It results in many small files because while we may get 10k data points at a time (still small) once you split those up into 100 metrics you only have 100 rows per file, which is not great against object storage.

The obvious solution available today would be to not partition by metric_name and instead sort by it, but I fear that leads to bad query performance as often you want all of the data points for a given metric and now you have to essentially do a full table scan to get that. Adding a bloom filter might help skip some row groups but still, my gut feel is it wouldn't work out all that well.

I think an ideal solution for this would be some sort of clustering. Instead of partitioning per file, what if we could make a row group for each metric? That would play well with statistics since it then becomes a lot cheaper to read all of the data for a given metric (you still need to read the metadata to get the stats but beyond that you skip the row groups for all of the other metrics). I'd think this would work equally well for small and large files.

Would this make sense as a feature? Something along the lines of with_row_group_clustering_columns? I imagine this could be useful in many other use cases.

Clustering is in general a broad topic, I am also curious what the community thinks about this broader topic. Would it make sense to have something like DeltaLake's Liquid Clustering built into the Parquet writer?

adriangb · 2024-10-26T22:58:55Z

adriangb
Oct 26, 2024
Author

I'm thinking something along these lines: https://github.com/apache/arrow-rs/compare/master...adriangb:arrow-rs:clustering?expand=1 (very very rough idea).

2 replies

adriangb Oct 27, 2024
Author

I'm realizing now this can probably be done with existing APIs... will try.

adriangb Oct 30, 2024
Author

We got it working:

fn split_batch<'b>(
    batch: &'b RecordBatch,
    columns: &[impl AsRef<str>],
) -> AnyResult<impl Iterator<Item = RecordBatch> + 'b> {
    let partition_arrays: Vec<_> = columns
        .iter()
        .map(|col| {
            let col = col.as_ref();
            batch
                .column_by_name(col)
                .with_context(|| format!("row_group column {col} not found"))
                .cloned()
        })
        .try_collect()?;

    let partitions = partition(&partition_arrays)?;
    Ok(partitions
        .ranges()
        .into_iter()
        .map(|range| batch.slice(range.start, range.len())))
}

(credit to @davidhewitt)

Then we call .flush() on the writer after every batch 😄

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clustering support #13071

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Clustering support #13071

adriangb Oct 23, 2024

Replies: 1 comment · 2 replies

adriangb Oct 26, 2024 Author

adriangb Oct 27, 2024 Author

adriangb Oct 30, 2024 Author

adriangb
Oct 23, 2024

Replies: 1 comment 2 replies

adriangb
Oct 26, 2024
Author

adriangb Oct 27, 2024
Author

adriangb Oct 30, 2024
Author