Introduce new Chunk/Dataframe conversion crate to clean up our handling of arrow metadata #8744

jleibs · 2025-01-20T17:52:20Z

Context

We currently have 2 different arrow-metadata encoding schemas:

In query API results we use:
- sorbet.path, sorbet.semantic_family, sorbet.logical_type, sorbet.semantic_type
In chunk transport we use:
- rerun.entity_path, rerun.archetype_name, rerun.archetype_field_name, Name of column as component name

Keeping track of which encoding is required at different points in the pipeline is hard and only adds confusion while bringing no meaningful utility.

Proposal

We will continue to maintain two separate encodings, but we will work to normalize them and align with the rerun names, as more generic sorbet concepts currently lead to confusion.

We will stop using sorbet name until we have cycles to make this a more universal spec. We use rerun names even in the dataframe API because these are specific Rerun-APIs that we are currently exposing.

We might as well use this as an opportunity to add versioning. Both types will include two new schema-metadata-level properties: rerun.schema_version, and rerun.batch_variant so that we can differentiate them.

Proposed v1: RerunChunk encoded data.

Schema-level metadata

rerun.schema_version = 1
rerun.batch_variant = "chunk"
rerun.id = The Chunk Id (required)
rerun.entity_path = The entity-path for the whole chunk

Control Column-level metadata

rerun.kind = "control"

Index Column-level metadata

rerun.kind = index | time (for backwards compat)
rerun.is_sorted
rerun.index_name = If unset, we use the Field-name for backwards compatabillity
rerun.dataframe_column_name = The original column from a converted dataframe (Optional)

Data Column-level metadata

rerun.kind = "data"
rerun.archetype_name
rerun.archetype_field_name
rerun.component_name -- If unset, we use the Field-name for backwards compatibillity
rerun.dataframe_column_name = The original column from a converted dataframe (Optional)

All data columns MUST be wrapped as a ListArray type.

Proposed v1: RerunDataframe encoded data.

On INGEST paths (rr.send_dataframe), we want to be generally forgiving and make a best-effort to interpret an arrow payload as a dataframe, even if it's missing top-level metadata.
On OUTPUT paths (dataframe query results) we should always include the full metadata

Schema-level metadata

rerun.schema_version = 1 (Optional) If missing we assume the latest version
rerun.batch_variant = "dataframe" (Optional) If missing we assume a "dataframe"
rerun.entity_path = (Optional) Defines the entity_path for any column where that entity_path is not set

Index Column-level metadata

rerun.kind = index
rerun.is_sorted
rerun.index_name = If unset, we use the Field-name

Data Column-level metadata

rerun.kind = "data"
rerun.archetype_name
rerun.archetype_field_name
rerun.component_name
rerun.entity_path (optional)

Ideally, data columns of mono-types in the dataframe representation should NOT need to be list-wrapped. Requiring users to structure data in this way is a significantly larger burden than simply adding metadata tags to their columns. Doing it on-ingest in Rerun is a significantly better experience.

Plan

We will introduce a new standalone crate which includes utilities for identifying, validating, and converting between "Dataframe" and "Chunk" representations.

The primary transformation that needs to happen for v1.

If more than 1 entities are present, split into separate chunks.
- Any index column is duplicated to each chunk.
For each chunk, inject a rerun.id for the ChunkId
For each chunk, synthesize a control-column of Rerun RowIds
For any datatypes which are not ListArrays, introduce a single-element List wrapper.
(Maybe?) If no index column was provided, synthesize a monotonic sequence index.

The following places are good candidates for using this new crate:

https://github.com/rerun-io/rerun/blob/main/crates/store/re_chunk_store/src/dataframe.rs
https://github.com/rerun-io/rerun/blob/main/rerun_py/rerun_sdk/rerun/dataframe.py
- Most of this python code should go away. Rather than using send_columns under the hood, we should pass the arrow-encoded dataframe directly to rust and handle this in https://github.com/rerun-io/rerun/blob/main/rerun_py/src/dataframe.rs
rerun/crates/store/re_grpc_client/src/lib.rs

Line 407 in 343da4a

// Catalog received from the ReDap server isn't suitable for direct conversion to a Rerun Chunk:

The text was updated successfully, but these errors were encountered:

jleibs · 2025-01-21T17:34:18Z

Some frameworks have restrictions on what defines a valid column-name. For example: lance doesn't permit top-level field names to contain a ".".

This is a good argument for Rerun being agnostic about the choice of field-name and encoding all metadata as proper metadata fields.

### Related * Part of #8744 ### What Creates a new crate `re_sorbet`. The goal is for it to contain our canonical ways of converting to-and-from arrow metadata and record batches. In this initial PR, I mostly move some code around.

### Related * Part of #8744 * Sibling PR: rerun-io/dataplatform#154

emilk · 2025-01-28T19:08:46Z

We discussed today a bit on how to handle mono components, i.e. the common case of single-instances (e.g. scalars). We want to make this as ergonomic as possible.

We will start supporting RecordBatches with Mono-types in them, and do the list-array wrapping on the way into the chunkstore so that code in viewer-space never has to worry about mono-types.
Future: we can also introduce a sorbet tag, e.g. rerun.mono and use that to unwrap-list arrays when generating query results in order to make these datatypes round-trip properly from the perspective of users.

jleibs added the 🔩 data model label Jan 20, 2025

jleibs changed the title ~~Unify sorbet and rerun arrow metadata~~ Introduce new Chunk/Dataframe conversion crate to clean up our handling of arrow metadata Jan 22, 2025

jleibs assigned emilk Jan 22, 2025

emilk mentioned this issue Jan 27, 2025

Move Time/ComponentColumnDescriptor to new re_sorbet #8813

Merged

emilk mentioned this issue Jan 27, 2025

Use Arrow IPC to encode the column schema #8821

Merged

emilk added a commit that referenced this issue Jan 28, 2025

Use Arrow IPC to encode the column schema (#8821)

fd5628e

### Related * Part of #8744 * Sibling PR: rerun-io/dataplatform#154

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce new Chunk/Dataframe conversion crate to clean up our handling of arrow metadata #8744

Introduce new Chunk/Dataframe conversion crate to clean up our handling of arrow metadata #8744

jleibs commented Jan 20, 2025 •

edited

Loading

jleibs commented Jan 21, 2025

emilk commented Jan 28, 2025

Introduce new Chunk/Dataframe conversion crate to clean up our handling of arrow metadata #8744

Introduce new Chunk/Dataframe conversion crate to clean up our handling of arrow metadata #8744

Comments

jleibs commented Jan 20, 2025 • edited Loading

Context

Proposal

Proposed v1: RerunChunk encoded data.

Proposed v1: RerunDataframe encoded data.

Plan

jleibs commented Jan 21, 2025

emilk commented Jan 28, 2025

jleibs commented Jan 20, 2025 •

edited

Loading