Refactor `::getStream` in the seeker to use only the CDR data #12

digitaldogsbody · 2022-09-21T22:40:32Z

From the getStream PR:

Thinking more, we might be able to dodge having to request and retrieve the local file header, since we only read it here to get the length of the filename and the extra field, and the type of compression used, all of which we already have from the Central Directory.

So if we can trust that the Central Directory and the Local File Header will always match, there might be a bit more efficiency to be squeezed out here. Probably doesn't make much difference for local files, but it would save one read for remote files.

Since the information about an individual file in its CDR entry and its LFH should be identical (otherwise you have an invalid zip file!), we should just trust the CDR information and skip reading the LFH all together, saving us a read when retrieving the compressed file stream (as we calculate the offset with CDR copy of the data).

DiegoPino · 2022-09-21T23:03:24Z

I also agree. CDR should be trustable (enough). Can't imagine a normal scenario where that is not the case. But real life data might show us someday a different reality!

digitaldogsbody · 2022-09-21T23:05:14Z

Like with the 9000+ PB Zip64, we can deal with it if it ever arises 😂

digitaldogsbody · 2022-09-21T23:07:17Z

Also, we could easily add a ::verifyHeaders to the file information class from #13 that does a comparison between the CDR and LFH, which can be used for debugging or verification of the correctness of the zip etc.

DiegoPino · 2022-09-21T23:08:45Z

hahaha. OK. So that said. Is it even performant to get/store in a class the files/CDR if the ZIP has 100,000 gifs? Wonder that now .... that versus on - request - ?

DiegoPino · 2022-09-21T23:09:17Z

One thing regarding my "wonder", optimize at that level later. Not needed right now!

digitaldogsbody · 2022-09-21T23:14:35Z

hahaha. OK. So that said. Is it even performant to get/store in a class the files/CDR if the ZIP has 100,000 gifs? Wonder that now .... that versus on - request - ?

Good question. We have to retrieve + read it all anyway to parse out the information we need for other calls and to return to the user (otherwise it will be a seek+read every time we need data that we don't store). We could maybe be selective and only store exactly what we need (this is what currently happens, but I am trying to think extensibility. We can still store what we use right now and add to it later).

The flip side is that each record is only 46 bytes (+ filename + extra field + file comment), so lets call it ~90 bytes per file. That's only 9 megabytes for the CDR of your 100,000 gifs.

digitaldogsbody · 2022-09-21T23:19:14Z

If every gif has a maximum length comment, then we would be at 6.5 gigabytes, which is probably more of an issue! But this could easily be sidestepped by not storing comments and reading them on request, or only storing them if they are under a certain size and exposing a getter that knows whether to read from memory or order a seek/read from the file.

DiegoPino · 2022-09-21T23:22:42Z

Yep. Request/cpu v/s memory. The eternal struggle of gods and humans

On Wed, Sep 21, 2022 at 8:19 PM Mike Bennett ***@***.***> wrote: If every gif has a maximum length comment, then we would be at 6.5 gigabytes, which is probably more of an issue! But this could easily be sidestepped by not storing comments and reading them on request, or only storing them if they are under a certain size and exposing a getter that knows whether to read from memory or order a seek/read from the file. — Reply to this email directly, view it on GitHub <#12 (comment)>, or unsubscribe . You are receiving this because you commented.Message ID: ***@***.***>

-- Diego Pino Navarro Digital Repositories Developer Metropolitan New York Library Council (METRO)

digitaldogsbody · 2022-10-05T16:11:59Z

A small snag - the extra field data in the CDR and the LFH is not the same, so we cannot rely on the length of that field reported in the CDR, we must read it from the LFH to get the correct offset of the compressed data stream.

This means we can't avoid the extra read 😔

digitaldogsbody · 2022-10-05T16:50:59Z

I have optimised things a little bit by only reading the two bits of variable data from the LFH, rather than the whole thing.

I have some thoughts about ways to make this pattern more efficient if the user intends to retrieve multiple files from the zip (something like ::getMultipleChunks() that would allow for retrieving all (or requested subset) of the LFH records at once - meaning that remote files could take advantage of Range: A-B, C-D, E-F syntax) but for now sadly this is a close as "cantfix"

digitaldogsbody self-assigned this Sep 21, 2022

digitaldogsbody added the enhancement New feature or request label Sep 21, 2022

digitaldogsbody added this to the 0.9 milestone Sep 21, 2022

This was referenced Sep 21, 2022

Consider replacing objects in the ZipRangeReader->files array with instances of a class #13

Open

Add ::getStream functionality #9

Merged

digitaldogsbody mentioned this issue Oct 5, 2022

Add ::verifyHeaders method for files #15

Open

digitaldogsbody closed this as completed Oct 5, 2022

digitaldogsbody added the cantfix Things we can't rectify due to some limitation label Oct 5, 2022

digitaldogsbody mentioned this issue Oct 5, 2022

Minimise extra data retrieved when requesting a file stream #16

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor `::getStream` in the seeker to use only the CDR data #12

Refactor `::getStream` in the seeker to use only the CDR data #12

digitaldogsbody commented Sep 21, 2022

DiegoPino commented Sep 21, 2022

digitaldogsbody commented Sep 21, 2022

digitaldogsbody commented Sep 21, 2022

DiegoPino commented Sep 21, 2022

DiegoPino commented Sep 21, 2022

digitaldogsbody commented Sep 21, 2022 •

edited

Loading

digitaldogsbody commented Sep 21, 2022

DiegoPino commented Sep 21, 2022 via email •

edited by digitaldogsbody

Loading

digitaldogsbody commented Oct 5, 2022

digitaldogsbody commented Oct 5, 2022

Refactor ::getStream in the seeker to use only the CDR data #12

Refactor ::getStream in the seeker to use only the CDR data #12

Comments

digitaldogsbody commented Sep 21, 2022

DiegoPino commented Sep 21, 2022

digitaldogsbody commented Sep 21, 2022

digitaldogsbody commented Sep 21, 2022

DiegoPino commented Sep 21, 2022

DiegoPino commented Sep 21, 2022

digitaldogsbody commented Sep 21, 2022 • edited Loading

digitaldogsbody commented Sep 21, 2022

DiegoPino commented Sep 21, 2022 via email • edited by digitaldogsbody Loading

digitaldogsbody commented Oct 5, 2022

digitaldogsbody commented Oct 5, 2022

Refactor `::getStream` in the seeker to use only the CDR data #12

Refactor `::getStream` in the seeker to use only the CDR data #12

digitaldogsbody commented Sep 21, 2022 •

edited

Loading

DiegoPino commented Sep 21, 2022 via email •

edited by digitaldogsbody

Loading