Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming zip data #3

Open
zardoy opened this issue May 1, 2024 · 6 comments
Open

Streaming zip data #3

zardoy opened this issue May 1, 2024 · 6 comments
Labels
enhancement New feature or request

Comments

@zardoy
Copy link

zardoy commented May 1, 2024

Issue: I have a 32 gb ZIP archive and I think I should not read the entire file and load it to ram just to look at the list of files within it.

AFAIR with the modern file reader API it is possible to read the file by chunks, but would it be possible to integrate some of these solutions here? Is there a right way to work with large ZIPs efficiently?

@james-pre
Copy link
Member

This could be possible with IndexFS, which is extended by ZipFS. The file reader API is for use with the File System access API, so I don't know how that would work with ZipFS. Right now, ZipFS loads the entire file into memory.

The zip file is laid out like this:

file 1
...
file n
archive decryption header
archive extra data record
central directory header 1
...
central directory header n
zip64 end of central directory record
zip64 end of central directory locator
end of central directory record

This means that it would be difficult to stream since you would need to start from the end of the zip data.

@james-pre james-pre added the enhancement New feature or request label May 2, 2024
@james-pre james-pre changed the title Is streaming possible? Streaming zip data May 2, 2024
@james-pre
Copy link
Member

james-pre commented May 3, 2024

@zardoy,

Today I overhauled the internals for processing zip files (check out v0.3.0) and found some interesting things in the zip spec. I've gained a much better understanding of how zip files work. Section 4.3.5 of the spec caught my eye since it mentions streaming.

My thoughts on how streaming can be implemented:

Since the zip "header"/"end of central directory" is located at the end of the file, some metadata will not be known until the entire file has been streamed. However, the LocalFileHeader (which comes almost immediately before the file data) contains the same metadata as the FileEntry (which occurs after all the files). This metadata may be enough to load and decompress an entire file.

FileEntry.data is what actually parses the file data, which primarily uses values from LocalFileHeader. Adapting FileEntry.data to LocalFileHeader is easy enough:

public get data(): Uint8Array {
	const data = this.zipData.slice(this._offset + this.size);
	const decompress = decompressionMethods[this.compressionMethod];
	// decompress validation check not included for readability
	return decompress(data, this.compressedSize, this.uncompressedSize, this.flag);
}

All that needs to be done is to pass the zip data buffer to LocalFileHeader (since it is not included right now), and to get the offset in that buffer of this (i.e. the current local file header).

Note that even if streaming is possible, you wouldn't be able to ZipFS without loading it all into memory still (since the entire buffer must be passed to the FS).

This is still in the early stages, but adding streaming support is workable. I hope this has helped.

- JP

@zardoy
Copy link
Author

zardoy commented May 4, 2024

This is still in the early stages, but adding streaming support is workable

Wow great news, thanks! What do you think of adding support for File option to ZIP fs backend options? it has .slice() support for retrieving data by chunks...

@james-pre
Copy link
Member

james-pre commented May 4, 2024

File.slice is inherited from Blob.slice. From what I can tell, it works off of the Blob which is already in memory. This would mean your zip file would already be loaded into memory.

ZipFS doesn't copy the buffers, though it does copy all of the data when parsing a struct (from the buffer to the members).

Perhaps it would be possible to access the members on the view directly, though that could get complicated since the struct decorators would need to intercept get/set calls. Feel free to look at utilium!Struct and utilium!struct

@zardoy
Copy link
Author

zardoy commented May 15, 2024

File.slice is inherited from Blob.slice. From what I can tell, it works off of the Blob which is already in memory. This would mean your zip file would already be loaded into memory.

From my point of view, I can tell that loading 1GB zip with file.arrayBuffer() obviously takes 1GB of memory (and calling file.slice() reads the requested offset directly from FS without fully loading file into the memory).
And then, calling configure with ZIP backend makes it use 3GB of ram (sometimes it doesn't go down even after reload). I'm really not sure why it goes so high, but if there is a chance .slice can be used or any other optimizations can be made I definitely need it! (because right now because of this RAM usage I can't use this module on iOS at all).

@james-pre
Copy link
Member

I'm currently working on releasing core 0.11 and the Emscripten backend. After that, I would be happy to address the ZIP backend. Hopefully that does not delay your project too much.

... calling file.slice() reads the requested offset directly from FS without fully loading file into the memory

This actually makes a lot of sense. I apologize if I mistakingly thought the entire blob was preloaded.

I'm really not sure why it goes so high, but if there is a chance .slice can be used or any other optimizations can be made I definitely need it!

I will see what I can do, though processing ZIP files is convoluted already so I'm not sure what other optimizations I can make.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants