Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve OCF encoder/decoder handling of dynamic types #467

Merged
merged 5 commits into from
Oct 11, 2024

Conversation

jhump
Copy link
Contributor

@jhump jhump commented Oct 9, 2024

There are a few issues when dealing with dynamic types:

  1. The existing code always uses avro.DefaultSchemaCache when parsing schema JSON documents. But this pollutes a global variable with temporary, non-global state. If the schemas being processed have a high cardinality of distinct namespaces and/or types, this could be a memory pressure issue, since none of the cached schemas are ever purged.
  2. The existing code requires a JSON document for the schema, but when dynamically building a schema, an avro.Schema is more natural and more type-safe.
  3. The existing code stores the schema in the file's "avro.schema" metadata key using the output of Schema.String(). However, this schema omits a lot of the schema details: in particular, it drops logical types and extra properties. I want to create files for use with Apache Iceberg, which has requirements that the file's Avro schema retain both logical types and some extra properties (mainly to capture Iceberg "field IDs" for everything). So in this branch, the logic now uses json.Compact on the input string, when the schema is provided as a JSON document, and json.Marshal on the avro schema, when the schema is provided as an avro.Schema. This makes sure that none of these details are lost. (I could use json.Marshal on the parsed schema in both cases, but it is more efficient to use json.Compact on an existing JSON document than to generate a new document.)

Lastly, this PR provides direct access to the parsed schema to users of an ocf.Decoder. Without this new accessor, users must re-parse the schema by getting the JSON from dec.Metadata()["avro.schema"] and calling avro.ParseBytes....

I haven't added tests yet because I was hoping for some guidance. This doesn't really touch any meaningful logic. So maybe just a simple test to encode and decode files using a different schema cache, and verifying the default cache is not touched? And also a simple test to verify that the schema encoded into file metadata retains custom properties and logical types?

1. Allow control of schema cache (to not pollute global default cache)
2. Accept parsed schema as alternative to string
3. Use schemas' MarshalJSON instead of String to preserve logical types and add'l properties
4. Make schema available to callers of NewDecoder
@nrwiersma
Copy link
Member

Thanks for this.

Firstly, a clarification.

However, this schema omits a lot of the schema details: in particular, it drops logical types and extra properties.

String() does not drop logical types, as seen here: https://github.com/hamba/avro/blob/main/schema.go#L1355
It in fact makes a canonical schema, which is a standardised comparable form of the schema with only properties needed to read the data.

I like the schema cache addition, this makes a ton of sense. The rub comes with changing the way the header schema is created. 1) this is backwards breaking and 2) while this may be desirable for Iceberg, it is not desirable for general Avro en/decoding. I would suggest you provide an option here that allows customisation of the JSON marshalling function, but defaults to the "standard" String() method.

@jhump
Copy link
Contributor Author

jhump commented Oct 10, 2024

String() does not drop logical types, as seen here: https://github.com/hamba/avro/blob/main/schema.go#L1355

Ah, I think the the issue then is that String sometimes drops logical types. In the example I was seeing, I was using an array with a "logicalType": "map" (which is how Iceberg represents maps that do not have string keys: as an array of key-value tuples/map entries). And when I looked at ArrayType and noticed it didn't preserve logical type, I looked at a handful of others and saw they also don't preserve it (it seems only FixedSchema and PrimitiveSchema do).

  1. this is backwards breaking

I didn't realize this would be the case. Do you mean that consumers of OCF files expect the schema to be deterministic and byte-for-byte the same from one file to the next for equivalent schemas? Or is it because emitting the extra properties makes the schema not equivalent to previously produced files?

In any event, I will make this an option.

@nrwiersma
Copy link
Member

Ah, I think the the issue then is that String sometimes drops logical types. In the example I was seeing, I was using an array with a "logicalType": "map" (which is how Iceberg represents maps that do not have string keys: as an array of key-value tuples/map entries). And when I looked at ArrayType and noticed it didn't preserve logical type, I looked at a handful of others and saw they also don't preserve it (it seems only FixedSchema and PrimitiveSchema do).

These do not have logical types in the Avro spec, and so the logical types are not recognised. I am working on some thoughts on how we can extend this to allow for Iceberg while not breaking Avro Spec compliance. There is also a PR around this (See #448)

I didn't realize this would be the case. Do you mean that consumers of OCF files expect the schema to be deterministic and byte-for-byte the same from one file to the next for equivalent schemas? Or is it because emitting the extra properties makes the schema not equivalent to previously produced files?

People using pure Avro generally only rely on the canonical form in the header, but it is breaking because it would produce a different file.

… for encoders); fix issue where array types won't preserve logicalType as a property
schema.go Outdated Show resolved Hide resolved
@jhump
Copy link
Contributor Author

jhump commented Oct 10, 2024

These do not have logical types in the Avro spec, and so the logical types are not recognised.

Sure. The spec says this:

Language implementations must ignore unknown logical types when reading

While it is true that this implementation was ignoring them, it also seems a bit problematic to discard them. That means you end up losing fidelity to the original schema and can't successfully round-trip the schema through de-serialization and re-serialization, even though neither step reported any errors. That's a bit counter-intuitive in my opinion. (This last paragraph is mostly justifying the latest change I made, that I already commented on above -- making it so that the logicalType attribute is actually preserved for array types.)

ocf/ocf.go Outdated Show resolved Hide resolved
ocf/ocf_test.go Show resolved Hide resolved
Copy link
Member

@nrwiersma nrwiersma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nrwiersma nrwiersma merged commit 4966106 into hamba:main Oct 11, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants