Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch exports to Zstd, from bzip2 #3148

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from
Open

Conversation

linkmauve
Copy link
Contributor

The main benefit of this format for our users is that it decompresses much faster than bzip2, even at high compression levels.

At level 19 it compresses even better than bzip2 for our files, hopefully the compression time is still acceptable, if not we can reduce it as to not overwork the server, at the price of some slightly bigger files.

On my i7-8700K, unarchiving sentences.tar.bz2 takes 15.5s, compared to 994ms for sentences.csv.zst compressed at level 19. The file is 183 MiB compared to 197 MiB with bzip2. We could go down to 167 MiB with level 22 (which decompresses in 941ms), but compression time starts to get much higher, not sure this is worth it.

The only downside I see to this change is that user automation will have to be changed, so perhaps announce it somehow before deploying it.

@jiru
Copy link
Member

jiru commented Dec 10, 2024

Thank you, this is a welcomed improvement, but I wonder if the performance benefits are really worth breaking data consumer’s workflows all of a sudden.

I think it would be nicer for data consumers to have a period of transition when we both produce old bzip2 and new zstd files, while only advertising zstd files on the https://tatoeba.org/downloads page. We can decide to later remove bzip2 file generation code whenever we think it’s a good time to do so. We could rely on HTTP logs to check how often the bzip2 files are being used. You could also add a comment in the bzip2 file generation code to remind developers it should be removed at some point.

Also I can see your PR is removing the tar-ing part, can you elaborate on this change?

The main benefit of this format for our users is that it decompresses
much faster than bzip2, even at high compression levels.

At level 19 it compresses even better than bzip2 for our files,
hopefully the compression time is still acceptable, if not we can reduce
it as to not overwork the server, at the price of some slightly bigger
files.

On my i7-8700K, unarchiving sentences.tar.bz2 takes 15.5s, compared to
994ms for sentences.csv.zst compressed at level 19.  The file is 183 MiB
compared to 197 MiB with bzip2.  We could go down to 167 MiB with level
22 (which decompresses in 941ms), but compression time starts to get
much higher, not sure this is worth it.

The only downside I see to this change is that user automation will have
to be changed, so perhaps announce it somehow before deploying it.

I’ve also removed the tar step, which only added overhead since we only
ever created a single archive per file.
@linkmauve
Copy link
Contributor Author

I had mentioned such a transition period in the chat, but nobody reacted. This is now added with a TODO comment.

I’ve also edited the commit message to mention why the tar archive was useless: it only ever contained a single file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants