Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary format choice #34

Open
certik opened this issue Mar 13, 2024 · 2 comments
Open

Binary format choice #34

certik opened this issue Mar 13, 2024 · 2 comments

Comments

@certik
Copy link
Owner

certik commented Mar 13, 2024

Requirements:

  • Binary
  • High performance (header and data section that can be loaded at once into memory and map to arrays, so no compression, data must be aligned to load directly into arrays)
  • Small file on a disk (for small arrays)
  • Self-contained single file
  • Simple: easy to write a reader/writer in C and Python from scratch with no external library dependencies, easy to reverse engineer the format
  • Store multiple arrays
  • For each array store: name, shape, type and data
  • Allow dimensions, say, up to 16
  • Store all numpy types, at least: f32, f64, i8, i16, i32, i64
  • Store metadata (variables of various types, at least "str", f64, i64)
  • Extensible: add new types easily
  • Future proof: old reader should be able to read newer types, at least as i8 arrays
  • Widely used: to get community, ecosystem and 3rd party readers/writers, and most importantly to have the large community create and upload a lot of files online, which cements the structure and the format to live as an "IR", independent of any particular reader/writer library implementation.
  • Bonus: It should be possible to implement a reader in C and Fortran that reads integers from the metadata section at predetermined locations to read model parameters and an offset to data section, and then it goes to data section and reads arrays directly (of sizes determined by the model parameters); The writer just needs to store the metadata integers and data arrays in certain order and types. The metadata section can still have other data between integers and after them; just that storing i64 integers n, m, k in this order (each name exactly length 1) must correspond to easy to calculate positions in the binary file, so that the reader can be written easily, without actually parsing the headers. We can store the offset as an integer in metadata for now.

Options:

  • gguf: hits all the requirements except: missing f64, i64 types for arrays (can store those as variables in metadata though), and dimensions only up to 4. Both can be fixed with various workarounds, for example using metadata to note that an array is in fact f64, but store it as f32; using metadata to store a full shape with higher than 4 dimension. Also easy to extend the format to accommodate these two issues.
  • safetensors: must parse json header
  • hdf5: Not "simple". Not possible to easily fix the complexity, requires a complex library to read/write. Not small for small arrays.
  • asdf: yaml header (harder to parse in C), more complicated than gguf
  • npy: can only store one array, no name (the name is in the filename), no metadata
  • npz: can store multiple arrays, no metadata, zip file (requires complex decompression, or dependency on zlib), creates files on disk, so low performance to read all arrays at once, name only as a filename
  • bjdata: does not seem possible to memory map the data section and read all arrays at once (requires parsing)

We chose gguf and wrote our own C reader. There is an existing gguf Python library to read/write. The limitations do not currently affect us, since all arrays have dimensions up to 4, and we don't need f64 and i64 for now.

@certik
Copy link
Owner Author

certik commented Mar 13, 2024

Here is a list of minor fixes and issues in GGUF:

@rebcabin
Copy link
Collaborator

I would spell out a few more tenets

  1. It must be trivial to compute internal file offsets so that readers can skip blocks they don't parse.

  2. The file should support only one order, say column-major, and one endianness, say little-endian. Readers bear the burden of converting to app- and language-dependent forms, and writers bear the burden of converting from app- and language-dependent forms.

  3. The format should not have "features." It should be as minimal as possible. "Features" are the province of reader and writer implementation software.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants