Improve OCF encoder/decoder handling of dynamic types #467

jhump · 2024-10-09T20:44:03Z

There are a few issues when dealing with dynamic types:

The existing code always uses avro.DefaultSchemaCache when parsing schema JSON documents. But this pollutes a global variable with temporary, non-global state. If the schemas being processed have a high cardinality of distinct namespaces and/or types, this could be a memory pressure issue, since none of the cached schemas are ever purged.
The existing code requires a JSON document for the schema, but when dynamically building a schema, an avro.Schema is more natural and more type-safe.
The existing code stores the schema in the file's "avro.schema" metadata key using the output of Schema.String(). However, this schema omits a lot of the schema details: in particular, it drops logical types and extra properties. I want to create files for use with Apache Iceberg, which has requirements that the file's Avro schema retain both logical types and some extra properties (mainly to capture Iceberg "field IDs" for everything). So in this branch, the logic now uses json.Compact on the input string, when the schema is provided as a JSON document, and json.Marshal on the avro schema, when the schema is provided as an avro.Schema. This makes sure that none of these details are lost. (I could use json.Marshal on the parsed schema in both cases, but it is more efficient to use json.Compact on an existing JSON document than to generate a new document.)

Lastly, this PR provides direct access to the parsed schema to users of an ocf.Decoder. Without this new accessor, users must re-parse the schema by getting the JSON from dec.Metadata()["avro.schema"] and calling avro.ParseBytes....

I haven't added tests yet because I was hoping for some guidance. This doesn't really touch any meaningful logic. So maybe just a simple test to encode and decode files using a different schema cache, and verifying the default cache is not touched? And also a simple test to verify that the schema encoded into file metadata retains custom properties and logical types?

1. Allow control of schema cache (to not pollute global default cache) 2. Accept parsed schema as alternative to string 3. Use schemas' MarshalJSON instead of String to preserve logical types and add'l properties 4. Make schema available to callers of NewDecoder

nrwiersma · 2024-10-10T04:18:26Z

Thanks for this.

Firstly, a clarification.

However, this schema omits a lot of the schema details: in particular, it drops logical types and extra properties.

String() does not drop logical types, as seen here: https://github.com/hamba/avro/blob/main/schema.go#L1355
It in fact makes a canonical schema, which is a standardised comparable form of the schema with only properties needed to read the data.

I like the schema cache addition, this makes a ton of sense. The rub comes with changing the way the header schema is created. 1) this is backwards breaking and 2) while this may be desirable for Iceberg, it is not desirable for general Avro en/decoding. I would suggest you provide an option here that allows customisation of the JSON marshalling function, but defaults to the "standard" String() method.

jhump · 2024-10-10T12:55:58Z

String() does not drop logical types, as seen here: https://github.com/hamba/avro/blob/main/schema.go#L1355

Ah, I think the the issue then is that String sometimes drops logical types. In the example I was seeing, I was using an array with a "logicalType": "map" (which is how Iceberg represents maps that do not have string keys: as an array of key-value tuples/map entries). And when I looked at ArrayType and noticed it didn't preserve logical type, I looked at a handful of others and saw they also don't preserve it (it seems only FixedSchema and PrimitiveSchema do).

this is backwards breaking

I didn't realize this would be the case. Do you mean that consumers of OCF files expect the schema to be deterministic and byte-for-byte the same from one file to the next for equivalent schemas? Or is it because emitting the extra properties makes the schema not equivalent to previously produced files?

In any event, I will make this an option.

…ON; add option to customize

nrwiersma · 2024-10-10T13:29:39Z

Ah, I think the the issue then is that String sometimes drops logical types. In the example I was seeing, I was using an array with a "logicalType": "map" (which is how Iceberg represents maps that do not have string keys: as an array of key-value tuples/map entries). And when I looked at ArrayType and noticed it didn't preserve logical type, I looked at a handful of others and saw they also don't preserve it (it seems only FixedSchema and PrimitiveSchema do).

These do not have logical types in the Avro spec, and so the logical types are not recognised. I am working on some thoughts on how we can extend this to allow for Iceberg while not breaking Avro Spec compliance. There is also a PR around this (See #448)

I didn't realize this would be the case. Do you mean that consumers of OCF files expect the schema to be deterministic and byte-for-byte the same from one file to the next for equivalent schemas? Or is it because emitting the extra properties makes the schema not equivalent to previously produced files?

People using pure Avro generally only rely on the canonical form in the header, but it is breaking because it would produce a different file.

… for encoders); fix issue where array types won't preserve logicalType as a property

schema.go

jhump · 2024-10-10T14:30:26Z

These do not have logical types in the Avro spec, and so the logical types are not recognised.

Sure. The spec says this:

Language implementations must ignore unknown logical types when reading

While it is true that this implementation was ignoring them, it also seems a bit problematic to discard them. That means you end up losing fidelity to the original schema and can't successfully round-trip the schema through de-serialization and re-serialization, even though neither step reported any errors. That's a bit counter-intuitive in my opinion. (This last paragraph is mostly justifying the latest change I made, that I already commented on above -- making it so that the logicalType attribute is actually preserved for array types.)

ocf/ocf.go

ocf/ocf_test.go

…rror

nrwiersma

LGTM

revert back to Schema.String as default method of computing schema JS…

4a51471

…ON; add option to customize

add tests; remove unneeded marshaler option for decoders (only needed…

4130e57

… for encoders); fix issue where array types won't preserve logicalType as a property

jhump commented Oct 10, 2024

View reviewed changes

schema.go Outdated Show resolved Hide resolved

nrwiersma requested changes Oct 10, 2024

View reviewed changes

ocf/ocf.go Outdated Show resolved Hide resolved

ocf/ocf_test.go Show resolved Hide resolved

jhump added 2 commits October 10, 2024 13:31

code review: simplify how schema is passed to encoder; ci: fix lint e…

5cc4d63

…rror

use golden file to verify that FullSchemaMarshaler preserves everything

2943520

nrwiersma approved these changes Oct 11, 2024

View reviewed changes

nrwiersma merged commit 4966106 into hamba:main Oct 11, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve OCF encoder/decoder handling of dynamic types #467

Improve OCF encoder/decoder handling of dynamic types #467

jhump commented Oct 9, 2024 •

edited

Loading

nrwiersma commented Oct 10, 2024

jhump commented Oct 10, 2024

nrwiersma commented Oct 10, 2024

jhump commented Oct 10, 2024 •

edited

Loading

nrwiersma left a comment

Improve OCF encoder/decoder handling of dynamic types #467

Improve OCF encoder/decoder handling of dynamic types #467

Conversation

jhump commented Oct 9, 2024 • edited Loading

nrwiersma commented Oct 10, 2024

jhump commented Oct 10, 2024

nrwiersma commented Oct 10, 2024

jhump commented Oct 10, 2024 • edited Loading

nrwiersma left a comment

Choose a reason for hiding this comment

jhump commented Oct 9, 2024 •

edited

Loading

jhump commented Oct 10, 2024 •

edited

Loading