Consider using partitioned parquet files #121

tomalrussell · 2022-11-04T15:56:15Z

Parquet allows readers/writers to treat a set of files under a directory as a single dataset, with the option to specify partitions based on data values that are then encoded into the directory structure, for example: roads.parquet/highway=primary/slice=1/part0.parquet

For Python, the key docs are:

Some questions:

what would be the benefits for open-gira intermediate or results datasets? At the least, it could reduce particularly large file sizes and avoid (or simplify) the concatenation steps as currently implemented.
how would the file/directory structure interact with snakemake? Would we need additional flags or workarounds?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider using partitioned parquet files #121

Consider using partitioned parquet files #121

tomalrussell commented Nov 4, 2022

Consider using partitioned parquet files #121

Consider using partitioned parquet files #121

Comments

tomalrussell commented Nov 4, 2022