-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor ::getStream
in the seeker to use only the CDR data
#12
Comments
I also agree. CDR should be trustable (enough). Can't imagine a normal scenario where that is not the case. But real life data might show us someday a different reality! |
Like with the 9000+ PB Zip64, we can deal with it if it ever arises 😂 |
Also, we could easily add a |
hahaha. OK. So that said. Is it even performant to get/store in a class the files/CDR if the ZIP has 100,000 gifs? Wonder that now .... that versus on - request - ? |
One thing regarding my "wonder", optimize at that level later. Not needed right now! |
Good question. We have to retrieve + read it all anyway to parse out the information we need for other calls and to return to the user (otherwise it will be a seek+read every time we need data that we don't store). We could maybe be selective and only store exactly what we need (this is what currently happens, but I am trying to think extensibility. We can still store what we use right now and add to it later). The flip side is that each record is only 46 bytes (+ filename + extra field + file comment), so lets call it ~90 bytes per file. That's only 9 megabytes for the CDR of your 100,000 gifs. |
If every gif has a maximum length comment, then we would be at 6.5 gigabytes, which is probably more of an issue! But this could easily be sidestepped by not storing comments and reading them on request, or only storing them if they are under a certain size and exposing a getter that knows whether to read from memory or order a seek/read from the file. |
Yep. Request/cpu v/s memory. The eternal struggle of gods and humans
On Wed, Sep 21, 2022 at 8:19 PM Mike Bennett ***@***.***> wrote:
If every gif has a maximum length comment, then we would be at 6.5
gigabytes, which is probably more of an issue! But this could easily be
sidestepped by not storing comments and reading them on request, or only
storing them if they are under a certain size and exposing a getter that
knows whether to read from memory or order a seek/read from the file.
—
Reply to this email directly, view it on GitHub
<#12 (comment)>,
or unsubscribe
.
You are receiving this because you commented.Message ID:
***@***.***>
--
Diego Pino Navarro
Digital Repositories Developer
Metropolitan New York Library Council (METRO)
|
A small snag - the extra field data in the CDR and the LFH is not the same, so we cannot rely on the length of that field reported in the CDR, we must read it from the LFH to get the correct offset of the compressed data stream. This means we can't avoid the extra read 😔 |
I have optimised things a little bit by only reading the two bits of variable data from the LFH, rather than the whole thing. I have some thoughts about ways to make this pattern more efficient if the user intends to retrieve multiple files from the zip (something like |
From the getStream PR:
Since the information about an individual file in its CDR entry and its LFH should be identical (otherwise you have an invalid zip file!), we should just trust the CDR information and skip reading the LFH all together, saving us a read when retrieving the compressed file stream (as we calculate the offset with CDR copy of the data).
The text was updated successfully, but these errors were encountered: