Binary format choice #34

certik · 2024-03-13T17:04:37Z

Requirements:

Binary
High performance (header and data section that can be loaded at once into memory and map to arrays, so no compression, data must be aligned to load directly into arrays)
Small file on a disk (for small arrays)
Self-contained single file
Simple: easy to write a reader/writer in C and Python from scratch with no external library dependencies, easy to reverse engineer the format
Store multiple arrays
For each array store: name, shape, type and data
Allow dimensions, say, up to 16
Store all numpy types, at least: f32, f64, i8, i16, i32, i64
Store metadata (variables of various types, at least "str", f64, i64)
Extensible: add new types easily
Future proof: old reader should be able to read newer types, at least as i8 arrays
Widely used: to get community, ecosystem and 3rd party readers/writers, and most importantly to have the large community create and upload a lot of files online, which cements the structure and the format to live as an "IR", independent of any particular reader/writer library implementation.
Bonus: It should be possible to implement a reader in C and Fortran that reads integers from the metadata section at predetermined locations to read model parameters and an offset to data section, and then it goes to data section and reads arrays directly (of sizes determined by the model parameters); The writer just needs to store the metadata integers and data arrays in certain order and types. The metadata section can still have other data between integers and after them; just that storing i64 integers n, m, k in this order (each name exactly length 1) must correspond to easy to calculate positions in the binary file, so that the reader can be written easily, without actually parsing the headers. We can store the offset as an integer in metadata for now.

Options:

gguf: hits all the requirements except: missing f64, i64 types for arrays (can store those as variables in metadata though), and dimensions only up to 4. Both can be fixed with various workarounds, for example using metadata to note that an array is in fact f64, but store it as f32; using metadata to store a full shape with higher than 4 dimension. Also easy to extend the format to accommodate these two issues.
safetensors: must parse json header
hdf5: Not "simple". Not possible to easily fix the complexity, requires a complex library to read/write. Not small for small arrays.
asdf: yaml header (harder to parse in C), more complicated than gguf
npy: can only store one array, no name (the name is in the filename), no metadata
npz: can store multiple arrays, no metadata, zip file (requires complex decompression, or dependency on zlib), creates files on disk, so low performance to read all arrays at once, name only as a filename
bjdata: does not seem possible to memory map the data section and read all arrays at once (requires parsing)

We chose gguf and wrote our own C reader. There is an existing gguf Python library to read/write. The limitations do not currently affect us, since all arrays have dimensions up to 4, and we don't need f64 and i64 for now.

The text was updated successfully, but these errors were encountered:

certik · 2024-03-13T20:23:33Z

Here is a list of minor fixes and issues in GGUF:

gguf : add support for I64 and F64 arrays ggerganov/llama.cpp#6062
gguf-py: add support for I8, I16 and I32 ggerganov/llama.cpp#6045
Add offsets for each header section and total header (to make it easy to skip, and avoid implementation issues with alignment) --- for inference in C we can assume a given order and types, so we can go directly into the data section and read it "blindly"; and we can use these offset numbers to skip all header sections and only extract integer values for model parameters (to know the sizes of all arrays); as well as the reader can skip header sections it does not understand (future proofing)
Add sizes in bytes for all arrays, to future proof the format for types that an older reader does not support, and the reader should just use I8 to represent unknown formats.
Add support for dimensions > 4
GGUF writer reverses array (tensor) dimensions ggerganov/llama.cpp#6040

rebcabin · 2024-03-14T19:45:28Z

I would spell out a few more tenets

It must be trivial to compute internal file offsets so that readers can skip blocks they don't parse.
The file should support only one order, say column-major, and one endianness, say little-endian. Readers bear the burden of converting to app- and language-dependent forms, and writers bear the burden of converting from app- and language-dependent forms.
The format should not have "features." It should be as minimal as possible. "Features" are the province of reader and writer implementation software.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Binary format choice #34

Binary format choice #34

certik commented Mar 13, 2024 •

edited

Loading

certik commented Mar 13, 2024 •

edited

Loading

rebcabin commented Mar 14, 2024

Binary format choice #34

Binary format choice #34

Comments

certik commented Mar 13, 2024 • edited Loading

certik commented Mar 13, 2024 • edited Loading

rebcabin commented Mar 14, 2024

certik commented Mar 13, 2024 •

edited

Loading

certik commented Mar 13, 2024 •

edited

Loading