You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
rerun.entity_path, rerun.archetype_name, rerun.archetype_field_name, Name of column as component name
Keeping track of which encoding is required at different points in the pipeline is hard and only adds confusion while bringing no meaningful utility.
Proposal
We will continue to maintain two separate encodings, but we will work to normalize them and align with the rerun names, as more generic sorbet concepts currently lead to confusion.
We will stop using sorbet name until we have cycles to make this a more universal spec. We use rerun names even in the dataframe API because these are specific Rerun-APIs that we are currently exposing.
We might as well use this as an opportunity to add versioning. Both types will include two new schema-metadata-level properties: rerun.schema_version, and rerun.batch_variant so that we can differentiate them.
Proposed v1: RerunChunk encoded data.
Schema-level metadata
rerun.schema_version = 1
rerun.batch_variant = "chunk"
rerun.id = The Chunk Id (required)
rerun.entity_path = The entity-path for the whole chunk
Control Column-level metadata
rerun.kind = "control"
Index Column-level metadata
rerun.kind = index | time (for backwards compat)
rerun.is_sorted
rerun.index_name = If unset, we use the Field-name for backwards compatabillity
rerun.dataframe_column_name = The original column from a converted dataframe (Optional)
Data Column-level metadata
rerun.kind = "data"
rerun.archetype_name
rerun.archetype_field_name
rerun.component_name -- If unset, we use the Field-name for backwards compatibillity
rerun.dataframe_column_name = The original column from a converted dataframe (Optional)
All data columns MUST be wrapped as a ListArray type.
Proposed v1: RerunDataframe encoded data.
On INGEST paths (rr.send_dataframe), we want to be generally forgiving and make a best-effort to interpret an arrow payload as a dataframe, even if it's missing top-level metadata.
On OUTPUT paths (dataframe query results) we should always include the full metadata
Schema-level metadata
rerun.schema_version = 1 (Optional) If missing we assume the latest version
rerun.batch_variant = "dataframe" (Optional) If missing we assume a "dataframe"
rerun.entity_path = (Optional) Defines the entity_path for any column where that entity_path is not set
Index Column-level metadata
rerun.kind = index
rerun.is_sorted
rerun.index_name = If unset, we use the Field-name
Data Column-level metadata
rerun.kind = "data"
rerun.archetype_name
rerun.archetype_field_name
rerun.component_name
rerun.entity_path (optional)
Ideally, data columns of mono-types in the dataframe representation should NOT need to be list-wrapped. Requiring users to structure data in this way is a significantly larger burden than simply adding metadata tags to their columns. Doing it on-ingest in Rerun is a significantly better experience.
Plan
We will introduce a new standalone crate which includes utilities for identifying, validating, and converting between "Dataframe" and "Chunk" representations.
The primary transformation that needs to happen for v1.
If more than 1 entities are present, split into separate chunks.
Any index column is duplicated to each chunk.
For each chunk, inject a rerun.id for the ChunkId
For each chunk, synthesize a control-column of Rerun RowIds
For any datatypes which are not ListArrays, introduce a single-element List wrapper.
(Maybe?) If no index column was provided, synthesize a monotonic sequence index.
The following places are good candidates for using this new crate:
Some frameworks have restrictions on what defines a valid column-name. For example: lance doesn't permit top-level field names to contain a ".".
This is a good argument for Rerun being agnostic about the choice of field-name and encoding all metadata as proper metadata fields.
jleibs
changed the title
Unify sorbet and rerun arrow metadata
Introduce new Chunk/Dataframe conversion crate to clean up our handling of arrow metadata
Jan 22, 2025
### Related
* Part of #8744
### What
Creates a new crate `re_sorbet`. The goal is for it to contain our
canonical ways of converting to-and-from arrow metadata and record
batches. In this initial PR, I mostly move some code around.
We discussed today a bit on how to handle mono components, i.e. the common case of single-instances (e.g. scalars). We want to make this as ergonomic as possible.
We will start supporting RecordBatches with Mono-types in them, and do the list-array wrapping on the way into the chunkstore so that code in viewer-space never has to worry about mono-types.
Future: we can also introduce a sorbet tag, e.g. rerun.mono and use that to unwrap-list arrays when generating query results in order to make these datatypes round-trip properly from the perspective of users.
Context
We currently have 2 different arrow-metadata encoding schemas:
sorbet.path
,sorbet.semantic_family
,sorbet.logical_type
,sorbet.semantic_type
rerun.entity_path
,rerun.archetype_name
,rerun.archetype_field_name
, Name of column as component nameKeeping track of which encoding is required at different points in the pipeline is hard and only adds confusion while bringing no meaningful utility.
Proposal
We will continue to maintain two separate encodings, but we will work to normalize them and align with the
rerun
names, as more generic sorbet concepts currently lead to confusion.We will stop using
sorbet
name until we have cycles to make this a more universal spec. We usererun
names even in the dataframe API because these are specific Rerun-APIs that we are currently exposing.We might as well use this as an opportunity to add versioning. Both types will include two new schema-metadata-level properties:
rerun.schema_version
, andrerun.batch_variant
so that we can differentiate them.Proposed v1: RerunChunk encoded data.
Schema-level metadata
rerun.schema_version
= 1rerun.batch_variant
= "chunk"rerun.id
= The Chunk Id (required)rerun.entity_path
= The entity-path for the whole chunkControl Column-level metadata
rerun.kind
= "control"Index Column-level metadata
rerun.kind
= index | time (for backwards compat)rerun.is_sorted
rerun.index_name
= If unset, we use the Field-name for backwards compatabillityrerun.dataframe_column_name
= The original column from a converted dataframe (Optional)Data Column-level metadata
rerun.kind
= "data"rerun.archetype_name
rerun.archetype_field_name
rerun.component_name
-- If unset, we use the Field-name for backwards compatibillityrerun.dataframe_column_name
= The original column from a converted dataframe (Optional)All data columns MUST be wrapped as a ListArray type.
Proposed v1: RerunDataframe encoded data.
On INGEST paths (rr.send_dataframe), we want to be generally forgiving and make a best-effort to interpret an arrow payload as a dataframe, even if it's missing top-level metadata.
On OUTPUT paths (dataframe query results) we should always include the full metadata
Schema-level metadata
rerun.schema_version
= 1 (Optional) If missing we assume the latest versionrerun.batch_variant
= "dataframe" (Optional) If missing we assume a "dataframe"rerun.entity_path
= (Optional) Defines the entity_path for any column where that entity_path is not setIndex Column-level metadata
rerun.kind
= indexrerun.is_sorted
rerun.index_name
= If unset, we use the Field-nameData Column-level metadata
rerun.kind
= "data"rerun.archetype_name
rerun.archetype_field_name
rerun.component_name
rerun.entity_path
(optional)Ideally, data columns of mono-types in the dataframe representation should NOT need to be list-wrapped. Requiring users to structure data in this way is a significantly larger burden than simply adding metadata tags to their columns. Doing it on-ingest in Rerun is a significantly better experience.
Plan
We will introduce a new standalone crate which includes utilities for identifying, validating, and converting between "Dataframe" and "Chunk" representations.
The primary transformation that needs to happen for v1.
rerun.id
for the ChunkIdListArrays
, introduce a single-element List wrapper.The following places are good candidates for using this new crate:
rerun/crates/store/re_grpc_client/src/lib.rs
Line 407 in 343da4a
The text was updated successfully, but these errors were encountered: