-
-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible enhancements to ZIM #951
Comments
This is now tracked in openzim/warc2zim#436
Please open an issue dedicated to this suggestion so we better track it.
There is no plan to make ZIM extendable for now, and it is most probably not going to happen anywhere in a near future. Too much complexities in terms of development and maintenance compared to the "cost" of hardware. Not saying it will never happen, but we have many features in the backlog with more obvious return on investment.
I don't know |
Yes I've opened #952 for this |
I've used browsertrix + warc2zim to build a 100GB+ zim. I'm going to share some thoughts on this. Feel free to move this issue to discussions.
Summary:
About the first item, it should be possible to offload or delegate this compression to an external process. I do wonder if it would be better for this to happen in browsertrix or in warc2zim. For larger images like >3MB it would make sense for them to be sent to a compressor and then written to WARC/ZIM.
The FTS in xapian currently uses the entire text with accents removed in each webpage, but actually not all of it is relevant and this leads to suboptimal search results when searching inside the ZIM. On this, there's this library called readability that has an algorithm for scoring all DOM nodes and selecting the one where the actual webpage content is stored (what is meant by "actual" is not things like "Similar products" or "See also" nodes). So it would be possible to only index the relevant content or even replace the original content with what readability believes is the relevant content. But either way the search results would improve as a result of this.
The last item is about having to keep around the WARC in order to expand the ZIM and the high cost to rebuild it. When I say high cost what I really mean is a lot of disk I/O and time. I wonder if WACZ can address this. I've yet to learn how the spec would allow this to happen, I'd like to learn more about how WACZ could address this or if there are plans to make ZIM extendable.
Oh I also have a question if there are any tools that can replay a large (100GB+) WACZ file.
The text was updated successfully, but these errors were encountered: