❗ This is quite barebones. Not all avro structures/types will be handled correctly since it was made with a specific avro schema catalog in mind. If you find anything missing, please submit a PR or create an issue.
This library aims to create dataclasses and/or typed dict definitions from avro schemes, that can be used to type check creation of messages using a static type checker such as mypy. The aim is to catch bugs before they occurr on runtime, and provide better IDE support.
As the name suggests, it was thought of as an optional extension to use with the excellent fastavro library. Fastavro allows users to read avro schemas, create messages and write them as avro messages, or validate against a schema.
Fastavro-gen uses fastavro to read .avsc
files and from the schema object generated, creates classes. Classes are written one per file, using the namespace to create a directory structure.
For example, the following record output class will be created under ./example/avro/user.py
.
{
"namespace": "example.avro",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
Building a User
message would normally be done by building a dictionary:
{
"name": "My User",
"favorite_number": "1",
"favorite_color": "green",
}
Notice that the favorite number field in the schema has type int (or None) but the one we created has a string. This would cause a runtime error when writing or validating the record.
Using the generated dataclass we can get IDE support (screenshot using VSCode with the Pylance language server). Notice the underlined "1"
. Hovering over shows the relevant error.
Mypy will also catch this issue:
test.py:9: error: Argument "favorite_number" to "User" has incompatible type "str"; expected "Optional[int]"
Found 1 error in 1 file (checked 1 source file)
The library offers the user two different output class types, dataclass
es and TypedDict
s.
Each has it's pros and cons that have to be weighed for the user's use cases.
TypedDict
s are only valuable during type checking, and on runtime they are simply treated as normal dicts.
As such they can be built using common python dict syntax with an added type annotation, or using a class instantiation syntax:
class A(TypedDict, total=True):
field1: int
field2: str
...
instance1: A = {
"field1": 1,
"field2": "2",
}
instance2 = A(
field1=1,
field2="2",
)
➕Messages can be built using python dictionary syntax
➕Fastavro expects messages as dictionaries
➖All fields of the dictionary have to be given at the time of creation, unless the total
option is given as False
.
Having total=False
however restricts some aspects of the type checking e.g. checking if some keys are set or not.
Currently this library has the total option hardcoded as False
but that might be configurable at a later time.
➖No ability to specify defaults.
Dataclasses allow for easy declaration of python classes.
➕Can handle default values for fields. As such only non-default fields have to be instantiated initially.
➕Easy to transform to dictionaries with the provided fastavro_gen.asdict
function. It is simply a wrapper around dataclasses.asdict
.
➖Complex nested schemas means a lot of objects being created
➖Extra overhead transforming messages to dictionaries
➖Overhead transforming dictionaries to dataclasses using fastavro_gen.fromdict
.
This is a work in progress but is available on PyPI.
pip install fastavro-gen
To generate classes use the CLI or import the generate
function from fastavro_gen
. The library also exposes fastavro_gen.[asdict, fromdict]
to map generated dataclasses to and from dictionaries.
💡 When the ordered option is specified, the file parameter will be ignored. Instead you can define schemas specified in the file parameter as singletons in the toml file passed to ordered.
usage: fastavro_gen [-h] [-o ORDERED] [--class-type {dataclass,TypedDict}] [--no-black] [--prefix PREFIX] [--output-dir OUTPUT_DIR] [file [file ...]]
Generate dataclasses or TypedDicts from avro schemas
positional arguments:
file file(s) to parse, use '-' for stdin
optional arguments:
-h, --help show this help message and exit
-o ORDERED, --ordered ORDERED
Path to a .toml file for multiple schemas or ordered schemas. Overwrites 'file' parameter.
--class-type {dataclass,TypedDict}
--no-black Do not run output files through 'black'
--prefix PREFIX Removes this prefix from namespace if it is contained
--output-dir OUTPUT_DIR
Specify the output location
The option allows users to specify an order of files to read throught fastavro's load_schema_ordered
function.
This is useful when your files are laid out in a manner that does not follow the structure that the normal load_schema
expects.
The option takes as value a path to a .toml
file that describes what schemas to read and what their pre-requisites.
For example, creating classes for a schema A that depends on B and C your .toml
would include:
schemaA = [
"/path/to/C.avsc",
"/path/to/B.avsc",
"/path/to/A.avsc",
]
The toml file can describe multiple schema dependencies, each as their own list.