-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve OCF encoder/decoder handling of dynamic types #467
Conversation
1. Allow control of schema cache (to not pollute global default cache) 2. Accept parsed schema as alternative to string 3. Use schemas' MarshalJSON instead of String to preserve logical types and add'l properties 4. Make schema available to callers of NewDecoder
Thanks for this. Firstly, a clarification.
I like the schema cache addition, this makes a ton of sense. The rub comes with changing the way the header schema is created. 1) this is backwards breaking and 2) while this may be desirable for Iceberg, it is not desirable for general Avro en/decoding. I would suggest you provide an option here that allows customisation of the JSON marshalling function, but defaults to the "standard" |
Ah, I think the the issue then is that
I didn't realize this would be the case. Do you mean that consumers of OCF files expect the schema to be deterministic and byte-for-byte the same from one file to the next for equivalent schemas? Or is it because emitting the extra properties makes the schema not equivalent to previously produced files? In any event, I will make this an option. |
…ON; add option to customize
These do not have logical types in the Avro spec, and so the logical types are not recognised. I am working on some thoughts on how we can extend this to allow for Iceberg while not breaking Avro Spec compliance. There is also a PR around this (See #448)
People using pure Avro generally only rely on the canonical form in the header, but it is breaking because it would produce a different file. |
… for encoders); fix issue where array types won't preserve logicalType as a property
Sure. The spec says this:
While it is true that this implementation was ignoring them, it also seems a bit problematic to discard them. That means you end up losing fidelity to the original schema and can't successfully round-trip the schema through de-serialization and re-serialization, even though neither step reported any errors. That's a bit counter-intuitive in my opinion. (This last paragraph is mostly justifying the latest change I made, that I already commented on above -- making it so that the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There are a few issues when dealing with dynamic types:
avro.DefaultSchemaCache
when parsing schema JSON documents. But this pollutes a global variable with temporary, non-global state. If the schemas being processed have a high cardinality of distinct namespaces and/or types, this could be a memory pressure issue, since none of the cached schemas are ever purged.avro.Schema
is more natural and more type-safe.Schema.String()
. However, this schema omits a lot of the schema details: in particular, it drops logical types and extra properties. I want to create files for use with Apache Iceberg, which has requirements that the file's Avro schema retain both logical types and some extra properties (mainly to capture Iceberg "field IDs" for everything). So in this branch, the logic now usesjson.Compact
on the input string, when the schema is provided as a JSON document, andjson.Marshal
on the avro schema, when the schema is provided as anavro.Schema
. This makes sure that none of these details are lost. (I could usejson.Marshal
on the parsed schema in both cases, but it is more efficient to usejson.Compact
on an existing JSON document than to generate a new document.)Lastly, this PR provides direct access to the parsed schema to users of an
ocf.Decoder
. Without this new accessor, users must re-parse the schema by getting the JSON fromdec.Metadata()["avro.schema"]
and callingavro.ParseBytes...
.I haven't added tests yet because I was hoping for some guidance. This doesn't really touch any meaningful logic. So maybe just a simple test to encode and decode files using a different schema cache, and verifying the default cache is not touched? And also a simple test to verify that the schema encoded into file metadata retains custom properties and logical types?