Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible enhancements to ZIM #951

Open
wsdookadr opened this issue Feb 10, 2025 · 2 comments
Open

Possible enhancements to ZIM #951

wsdookadr opened this issue Feb 10, 2025 · 2 comments

Comments

@wsdookadr
Copy link

wsdookadr commented Feb 10, 2025

I've used browsertrix + warc2zim to build a 100GB+ zim. I'm going to share some thoughts on this. Feel free to move this issue to discussions.

Summary:

  • warc2zim could do image compression when creating the ZIM in the interest of making the ZIM smaller
  • FTS for relevant parts of the web pages archived
  • extensibility of ZIM after creation requires complete recreation which means the WARC needs to be kept around; if it would be extensible without requiring the WARC that would be great

About the first item, it should be possible to offload or delegate this compression to an external process. I do wonder if it would be better for this to happen in browsertrix or in warc2zim. For larger images like >3MB it would make sense for them to be sent to a compressor and then written to WARC/ZIM.

The FTS in xapian currently uses the entire text with accents removed in each webpage, but actually not all of it is relevant and this leads to suboptimal search results when searching inside the ZIM. On this, there's this library called readability that has an algorithm for scoring all DOM nodes and selecting the one where the actual webpage content is stored (what is meant by "actual" is not things like "Similar products" or "See also" nodes). So it would be possible to only index the relevant content or even replace the original content with what readability believes is the relevant content. But either way the search results would improve as a result of this.

The last item is about having to keep around the WARC in order to expand the ZIM and the high cost to rebuild it. When I say high cost what I really mean is a lot of disk I/O and time. I wonder if WACZ can address this. I've yet to learn how the spec would allow this to happen, I'd like to learn more about how WACZ could address this or if there are plans to make ZIM extendable.

Oh I also have a question if there are any tools that can replay a large (100GB+) WACZ file.

@benoit74
Copy link

About the first item, it should be possible to offload or delegate this compression to an external process. I do wonder if it would be better for this to happen in browsertrix or in warc2zim. For larger images like >3MB it would make sense for them to be sent to a compressor and then written to WARC/ZIM.

This is now tracked in openzim/warc2zim#436

The FTS in xapian currently uses the entire text with accents removed in each webpage, but actually not all of it is relevant and this leads to suboptimal search results when searching inside the ZIM. On this, there's this library called readability that has an algorithm for scoring all DOM nodes and selecting the one where the actual webpage content is stored (what is meant by "actual" is not things like "Similar products" or "See also" nodes). So it would be possible to only index the relevant content or even replace the original content with what readability believes is the relevant content. But either way the search results would improve as a result of this.

Please open an issue dedicated to this suggestion so we better track it.

The last item is about having to keep around the WARC in order to expand the ZIM and the high cost to rebuild it. When I say high cost what I really mean is a lot of disk I/O and time. I wonder if WACZ can address this. I've yet to learn how the spec would allow this to happen, I'd like to learn more about how WACZ could address this or if there are plans to make ZIM extendable.

There is no plan to make ZIM extendable for now, and it is most probably not going to happen anywhere in a near future. Too much complexities in terms of development and maintenance compared to the "cost" of hardware. Not saying it will never happen, but we have many features in the backlog with more obvious return on investment.

Oh I also have a question if there are any tools that can replay a large (100GB+) WACZ file.

I don't know

@wsdookadr
Copy link
Author

Please open an issue dedicated to this suggestion so we better track it.

Yes I've opened #952 for this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants