Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce new Chunk/Dataframe conversion crate to clean up our handling of arrow metadata #8744

Open
jleibs opened this issue Jan 20, 2025 · 2 comments
Assignees

Comments

@jleibs
Copy link
Member

jleibs commented Jan 20, 2025

Context

We currently have 2 different arrow-metadata encoding schemas:

  • In query API results we use:
    • sorbet.path, sorbet.semantic_family, sorbet.logical_type, sorbet.semantic_type
  • In chunk transport we use:
    • rerun.entity_path, rerun.archetype_name, rerun.archetype_field_name, Name of column as component name

Keeping track of which encoding is required at different points in the pipeline is hard and only adds confusion while bringing no meaningful utility.

Proposal

We will continue to maintain two separate encodings, but we will work to normalize them and align with the rerun names, as more generic sorbet concepts currently lead to confusion.

We will stop using sorbet name until we have cycles to make this a more universal spec. We use rerun names even in the dataframe API because these are specific Rerun-APIs that we are currently exposing.

We might as well use this as an opportunity to add versioning. Both types will include two new schema-metadata-level properties: rerun.schema_version, and rerun.batch_variant so that we can differentiate them.

Proposed v1: RerunChunk encoded data.

Schema-level metadata

  • rerun.schema_version = 1
  • rerun.batch_variant = "chunk"
  • rerun.id = The Chunk Id (required)
  • rerun.entity_path = The entity-path for the whole chunk

Control Column-level metadata

  • rerun.kind = "control"

Index Column-level metadata

  • rerun.kind = index | time (for backwards compat)
  • rerun.is_sorted
  • rerun.index_name = If unset, we use the Field-name for backwards compatabillity
  • rerun.dataframe_column_name = The original column from a converted dataframe (Optional)

Data Column-level metadata

  • rerun.kind = "data"
  • rerun.archetype_name
  • rerun.archetype_field_name
  • rerun.component_name -- If unset, we use the Field-name for backwards compatibillity
  • rerun.dataframe_column_name = The original column from a converted dataframe (Optional)

All data columns MUST be wrapped as a ListArray type.

Proposed v1: RerunDataframe encoded data.

On INGEST paths (rr.send_dataframe), we want to be generally forgiving and make a best-effort to interpret an arrow payload as a dataframe, even if it's missing top-level metadata.
On OUTPUT paths (dataframe query results) we should always include the full metadata

Schema-level metadata

  • rerun.schema_version = 1 (Optional) If missing we assume the latest version
  • rerun.batch_variant = "dataframe" (Optional) If missing we assume a "dataframe"
  • rerun.entity_path = (Optional) Defines the entity_path for any column where that entity_path is not set

Index Column-level metadata

  • rerun.kind = index
  • rerun.is_sorted
  • rerun.index_name = If unset, we use the Field-name

Data Column-level metadata

  • rerun.kind = "data"
  • rerun.archetype_name
  • rerun.archetype_field_name
  • rerun.component_name
  • rerun.entity_path (optional)

Ideally, data columns of mono-types in the dataframe representation should NOT need to be list-wrapped. Requiring users to structure data in this way is a significantly larger burden than simply adding metadata tags to their columns. Doing it on-ingest in Rerun is a significantly better experience.

Plan

We will introduce a new standalone crate which includes utilities for identifying, validating, and converting between "Dataframe" and "Chunk" representations.

The primary transformation that needs to happen for v1.

  • If more than 1 entities are present, split into separate chunks.
    • Any index column is duplicated to each chunk.
  • For each chunk, inject a rerun.id for the ChunkId
  • For each chunk, synthesize a control-column of Rerun RowIds
  • For any datatypes which are not ListArrays, introduce a single-element List wrapper.
  • (Maybe?) If no index column was provided, synthesize a monotonic sequence index.

The following places are good candidates for using this new crate:

@jleibs
Copy link
Member Author

jleibs commented Jan 21, 2025

Some frameworks have restrictions on what defines a valid column-name. For example: lance doesn't permit top-level field names to contain a ".".

This is a good argument for Rerun being agnostic about the choice of field-name and encoding all metadata as proper metadata fields.

@jleibs jleibs changed the title Unify sorbet and rerun arrow metadata Introduce new Chunk/Dataframe conversion crate to clean up our handling of arrow metadata Jan 22, 2025
emilk added a commit that referenced this issue Jan 27, 2025
### Related
* Part of #8744

### What
Creates a new crate `re_sorbet`. The goal is for it to contain our
canonical ways of converting to-and-from arrow metadata and record
batches. In this initial PR, I mostly move some code around.
emilk added a commit that referenced this issue Jan 28, 2025
@emilk
Copy link
Member

emilk commented Jan 28, 2025

We discussed today a bit on how to handle mono components, i.e. the common case of single-instances (e.g. scalars). We want to make this as ergonomic as possible.

  • We will start supporting RecordBatches with Mono-types in them, and do the list-array wrapping on the way into the chunkstore so that code in viewer-space never has to worry about mono-types.
  • Future: we can also introduce a sorbet tag, e.g. rerun.mono and use that to unwrap-list arrays when generating query results in order to make these datatypes round-trip properly from the perspective of users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants