Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary format documentation #6

Open
nyurik opened this issue Dec 7, 2023 · 6 comments
Open

Binary format documentation #6

nyurik opened this issue Dec 7, 2023 · 6 comments

Comments

@nyurik
Copy link
Contributor

nyurik commented Dec 7, 2023

Hi, I am thinking of using this crate for maplibre/martin#1049 - storing binary diffs for map tile distribution. This is a widely used format with multiple implementations in different languages, so creating a diff may be done with one tool, and patching the diff (in theory) could be done with another. Additionally, it would have to be versioned somehow, so that if the diff language changes in some way, we could track and avoid messing up the data. Thus, I would need to reference a definitive documentation on the diff format specifics, i.e. the "diff language". Is there such a documentation that you know of? Should it be created?

P.S. I will be happy to contribute to your project as well. Thx!

@kornelski
Copy link
Collaborator

This crate uses a bit different format than the popular tool of the same name.

There isn't a proper doc for it, but the code is simple, so it shouldn't be hard to make one. PRs welcome.

@nyurik
Copy link
Contributor Author

nyurik commented Mar 28, 2024

@kornelski thx, took me a bit of time to process :) Do you think it would be possible to generate the same patching format as the output of your crate? From my understanding, it is totally OK for different libs to use different methods of producing patches, but the actual patching format should be relatively easy to make standard... unless I am missing some aspects that require patch format itself to be different?

@kornelski
Copy link
Collaborator

The patch format is just 3 integers and data, repeated. You can standardize around it if you want. You can make alternative patch implementations if you want.

The main difficulty is in finding appropriately similar patches, which uses a particularly clever sorting algorithm, described in the original bsdiff paper.

The original bsdiff tool also adds bz2 compression on top of the patch format. This tool leaves it up to the user to apply further compression, because bz2 is slow and there are newer algorithms to choose from.

@nyurik
Copy link
Contributor Author

nyurik commented Mar 29, 2024

@kornelski exactly! So would it be possible to generate the patch with your lib, but apply it with the standard bsdiff? That would instantly make it standard again -- simply because the standard cannot dictate HOW the patch is created, only the format of the patch itself. Thus, given two files, A and B, the patch between them can be different (depending on which tool was used to create it), but applying a patch generated by any tool should produce identical B, no matter which tool is used to actually apply it.

Do you know if the original bsdiff standardized those 3 integers + data format, e.g. the endness, perhaps 7bit encoding of those integers, a patch version at the beginning, CRC or size expectations, any other aspects?

@kornelski
Copy link
Collaborator

No, it is not compatible with the standard bsdiff.

@nyurik
Copy link
Contributor Author

nyurik commented Apr 5, 2024

I just discovered another similar rust crate - https://github.com/hucsmn/qbsdiff -- it promises to be bsdiff 4.x compatible - I wonder if these two efforts can be combined? This way we all can have the best of both worlds - good algorithm for creating patches, and cross-compatibility between various implementations. Would also be great to have some benchmarks on different approaches (without compression). Regardless, thanks for all the hard work on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants