Skip to content

gudjonragnar/fastavro-gen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Avro scheme generation

❗ This is quite barebones. Not all avro structures/types will be handled correctly since it was made with a specific avro schema catalog in mind. If you find anything missing, please submit a PR or create an issue.

This library aims to create dataclasses and/or typed dict definitions from avro schemes, that can be used to type check creation of messages using a static type checker such as mypy. The aim is to catch bugs before they occurr on runtime, and provide better IDE support.

As the name suggests, it was thought of as an optional extension to use with the excellent fastavro library. Fastavro allows users to read avro schemas, create messages and write them as avro messages, or validate against a schema.

Fastavro-gen uses fastavro to read .avsc files and from the schema object generated, creates classes. Classes are written one per file, using the namespace to create a directory structure. For example, the following record output class will be created under ./example/avro/user.py.

{
    "namespace": "example.avro",
    "type": "record",
    "name": "User",
    "fields": [
        {"name": "name", "type": "string"},
        {"name": "favorite_number",  "type": ["int", "null"]},
        {"name": "favorite_color", "type": ["string", "null"]}
    ]
}

Building a User message would normally be done by building a dictionary:

{
    "name": "My User",
    "favorite_number": "1",
    "favorite_color": "green",
}

Notice that the favorite number field in the schema has type int (or None) but the one we created has a string. This would cause a runtime error when writing or validating the record. Using the generated dataclass we can get IDE support (screenshot using VSCode with the Pylance language server). Notice the underlined "1". Hovering over shows the relevant error.

VSCode IDE support

Mypy will also catch this issue:

test.py:9: error: Argument "favorite_number" to "User" has incompatible type "str"; expected "Optional[int]"
Found 1 error in 1 file (checked 1 source file)

Output

The library offers the user two different output class types, dataclasses and TypedDicts. Each has it's pros and cons that have to be weighed for the user's use cases.

TypedDict

TypedDicts are only valuable during type checking, and on runtime they are simply treated as normal dicts. As such they can be built using common python dict syntax with an added type annotation, or using a class instantiation syntax:

class A(TypedDict, total=True):
    field1: int
    field2: str
    ...

instance1: A = {
    "field1": 1,
    "field2": "2",
}

instance2 = A(
    field1=1,
    field2="2",
)

➕Messages can be built using python dictionary syntax
➕Fastavro expects messages as dictionaries
➖All fields of the dictionary have to be given at the time of creation, unless the total option is given as False. Having total=False however restricts some aspects of the type checking e.g. checking if some keys are set or not. Currently this library has the total option hardcoded as False but that might be configurable at a later time.
➖No ability to specify defaults.

dataclass

Dataclasses allow for easy declaration of python classes.

➕Can handle default values for fields. As such only non-default fields have to be instantiated initially.
➕Easy to transform to dictionaries with the provided fastavro_gen.asdict function. It is simply a wrapper around dataclasses.asdict.
➖Complex nested schemas means a lot of objects being created
➖Extra overhead transforming messages to dictionaries
➖Overhead transforming dictionaries to dataclasses using fastavro_gen.fromdict.

Usage

This is a work in progress but is available on PyPI.

pip install fastavro-gen

To generate classes use the CLI or import the generate function from fastavro_gen. The library also exposes fastavro_gen.[asdict, fromdict] to map generated dataclasses to and from dictionaries.

💡 When the ordered option is specified, the file parameter will be ignored. Instead you can define schemas specified in the file parameter as singletons in the toml file passed to ordered.

usage: fastavro_gen [-h] [-o ORDERED] [--class-type {dataclass,TypedDict}] [--no-black] [--prefix PREFIX] [--output-dir OUTPUT_DIR] [file [file ...]]

Generate dataclasses or TypedDicts from avro schemas

positional arguments:
  file                  file(s) to parse, use '-' for stdin

optional arguments:
  -h, --help            show this help message and exit
  -o ORDERED, --ordered ORDERED
                        Path to a .toml file for multiple schemas or ordered schemas. Overwrites 'file' parameter.
  --class-type {dataclass,TypedDict}
  --no-black            Do not run output files through 'black'
  --prefix PREFIX       Removes this prefix from namespace if it is contained
  --output-dir OUTPUT_DIR
                        Specify the output location

The --ordered option

The option allows users to specify an order of files to read throught fastavro's load_schema_ordered function. This is useful when your files are laid out in a manner that does not follow the structure that the normal load_schema expects.

The option takes as value a path to a .toml file that describes what schemas to read and what their pre-requisites. For example, creating classes for a schema A that depends on B and C your .toml would include:

schemaA = [
    "/path/to/C.avsc",
    "/path/to/B.avsc",
    "/path/to/A.avsc",
]

The toml file can describe multiple schema dependencies, each as their own list.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages