Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor ::getStream in the seeker to use only the CDR data #12

Closed
digitaldogsbody opened this issue Sep 21, 2022 · 10 comments
Closed

Refactor ::getStream in the seeker to use only the CDR data #12

digitaldogsbody opened this issue Sep 21, 2022 · 10 comments
Assignees
Labels
cantfix Things we can't rectify due to some limitation enhancement New feature or request
Milestone

Comments

@digitaldogsbody
Copy link
Owner

From the getStream PR:

Thinking more, we might be able to dodge having to request and retrieve the local file header, since we only read it here to get the length of the filename and the extra field, and the type of compression used, all of which we already have from the Central Directory.

So if we can trust that the Central Directory and the Local File Header will always match, there might be a bit more efficiency to be squeezed out here. Probably doesn't make much difference for local files, but it would save one read for remote files.

Since the information about an individual file in its CDR entry and its LFH should be identical (otherwise you have an invalid zip file!), we should just trust the CDR information and skip reading the LFH all together, saving us a read when retrieving the compressed file stream (as we calculate the offset with CDR copy of the data).

@DiegoPino
Copy link
Collaborator

I also agree. CDR should be trustable (enough). Can't imagine a normal scenario where that is not the case. But real life data might show us someday a different reality!

@digitaldogsbody
Copy link
Owner Author

Like with the 9000+ PB Zip64, we can deal with it if it ever arises 😂

@digitaldogsbody
Copy link
Owner Author

Also, we could easily add a ::verifyHeaders to the file information class from #13 that does a comparison between the CDR and LFH, which can be used for debugging or verification of the correctness of the zip etc.

@DiegoPino
Copy link
Collaborator

hahaha. OK. So that said. Is it even performant to get/store in a class the files/CDR if the ZIP has 100,000 gifs? Wonder that now .... that versus on - request - ?

@DiegoPino
Copy link
Collaborator

One thing regarding my "wonder", optimize at that level later. Not needed right now!

@digitaldogsbody
Copy link
Owner Author

digitaldogsbody commented Sep 21, 2022

hahaha. OK. So that said. Is it even performant to get/store in a class the files/CDR if the ZIP has 100,000 gifs? Wonder that now .... that versus on - request - ?

Good question. We have to retrieve + read it all anyway to parse out the information we need for other calls and to return to the user (otherwise it will be a seek+read every time we need data that we don't store). We could maybe be selective and only store exactly what we need (this is what currently happens, but I am trying to think extensibility. We can still store what we use right now and add to it later).

The flip side is that each record is only 46 bytes (+ filename + extra field + file comment), so lets call it ~90 bytes per file. That's only 9 megabytes for the CDR of your 100,000 gifs.

@digitaldogsbody
Copy link
Owner Author

If every gif has a maximum length comment, then we would be at 6.5 gigabytes, which is probably more of an issue! But this could easily be sidestepped by not storing comments and reading them on request, or only storing them if they are under a certain size and exposing a getter that knows whether to read from memory or order a seek/read from the file.

@DiegoPino
Copy link
Collaborator

DiegoPino commented Sep 21, 2022 via email

@digitaldogsbody
Copy link
Owner Author

A small snag - the extra field data in the CDR and the LFH is not the same, so we cannot rely on the length of that field reported in the CDR, we must read it from the LFH to get the correct offset of the compressed data stream.

This means we can't avoid the extra read 😔

@digitaldogsbody
Copy link
Owner Author

I have optimised things a little bit by only reading the two bits of variable data from the LFH, rather than the whole thing.

I have some thoughts about ways to make this pattern more efficient if the user intends to retrieve multiple files from the zip (something like ::getMultipleChunks() that would allow for retrieving all (or requested subset) of the LFH records at once - meaning that remote files could take advantage of Range: A-B, C-D, E-F syntax) but for now sadly this is a close as "cantfix"

@digitaldogsbody digitaldogsbody added the cantfix Things we can't rectify due to some limitation label Oct 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cantfix Things we can't rectify due to some limitation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants