Possible enhancements to ZIM #951

wsdookadr · 2025-02-10T13:12:31Z

I've used browsertrix + warc2zim to build a 100GB+ zim. I'm going to share some thoughts on this. Feel free to move this issue to discussions.

Summary:

warc2zim could do image compression when creating the ZIM in the interest of making the ZIM smaller
FTS for relevant parts of the web pages archived
extensibility of ZIM after creation requires complete recreation which means the WARC needs to be kept around; if it would be extensible without requiring the WARC that would be great

About the first item, it should be possible to offload or delegate this compression to an external process. I do wonder if it would be better for this to happen in browsertrix or in warc2zim. For larger images like >3MB it would make sense for them to be sent to a compressor and then written to WARC/ZIM.

The FTS in xapian currently uses the entire text with accents removed in each webpage, but actually not all of it is relevant and this leads to suboptimal search results when searching inside the ZIM. On this, there's this library called readability that has an algorithm for scoring all DOM nodes and selecting the one where the actual webpage content is stored (what is meant by "actual" is not things like "Similar products" or "See also" nodes). So it would be possible to only index the relevant content or even replace the original content with what readability believes is the relevant content. But either way the search results would improve as a result of this.

The last item is about having to keep around the WARC in order to expand the ZIM and the high cost to rebuild it. When I say high cost what I really mean is a lot of disk I/O and time. I wonder if WACZ can address this. I've yet to learn how the spec would allow this to happen, I'd like to learn more about how WACZ could address this or if there are plans to make ZIM extendable.

Oh I also have a question if there are any tools that can replay a large (100GB+) WACZ file.

benoit74 · 2025-02-10T14:56:56Z

About the first item, it should be possible to offload or delegate this compression to an external process. I do wonder if it would be better for this to happen in browsertrix or in warc2zim. For larger images like >3MB it would make sense for them to be sent to a compressor and then written to WARC/ZIM.

This is now tracked in openzim/warc2zim#436

The FTS in xapian currently uses the entire text with accents removed in each webpage, but actually not all of it is relevant and this leads to suboptimal search results when searching inside the ZIM. On this, there's this library called readability that has an algorithm for scoring all DOM nodes and selecting the one where the actual webpage content is stored (what is meant by "actual" is not things like "Similar products" or "See also" nodes). So it would be possible to only index the relevant content or even replace the original content with what readability believes is the relevant content. But either way the search results would improve as a result of this.

Please open an issue dedicated to this suggestion so we better track it.

The last item is about having to keep around the WARC in order to expand the ZIM and the high cost to rebuild it. When I say high cost what I really mean is a lot of disk I/O and time. I wonder if WACZ can address this. I've yet to learn how the spec would allow this to happen, I'd like to learn more about how WACZ could address this or if there are plans to make ZIM extendable.

There is no plan to make ZIM extendable for now, and it is most probably not going to happen anywhere in a near future. Too much complexities in terms of development and maintenance compared to the "cost" of hardware. Not saying it will never happen, but we have many features in the backlog with more obvious return on investment.

Oh I also have a question if there are any tools that can replay a large (100GB+) WACZ file.

I don't know

wsdookadr · 2025-02-10T15:44:52Z

Please open an issue dedicated to this suggestion so we better track it.

Yes I've opened #952 for this

benoit74 mentioned this issue Feb 10, 2025

Convert and compress pictures before adding them to the ZIM openzim/warc2zim#436

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible enhancements to ZIM #951

Possible enhancements to ZIM #951

wsdookadr commented Feb 10, 2025 •

edited

Loading

benoit74 commented Feb 10, 2025

wsdookadr commented Feb 10, 2025

Possible enhancements to ZIM #951

Possible enhancements to ZIM #951

Comments

wsdookadr commented Feb 10, 2025 • edited Loading

benoit74 commented Feb 10, 2025

wsdookadr commented Feb 10, 2025

wsdookadr commented Feb 10, 2025 •

edited

Loading