Stream RO-Crate Zip #212

dnlbauer · 2025-01-15T12:53:24Z

This is a Draft implementation showing how #205 could be solved.

Implementation is rather straightforward for file entities where I could simply add a method to stream the file contents.
For datasets entities, it was a bit challenging to find a good solution. Creating a zip stream requires that all the file handling need to happen at one place where the zip stream object is created (rocrate.stream_zip in this case). However, dataset entities currently handle the file writing operation internally and don't expose their containing files in an easy to use way. I solved this with a generator that not only yields the data of the contained files, but also yields file paths. However, the solution still looks a bit "hacky" to me. I didn't want to do more refactoring though.

dnlbauer · 2025-01-15T12:55:34Z

How it works:

#test_zipstream.py
from rocrate.rocrate import ROCrate

crate = ROCrate(gen_preview=True)

# add a file
testfile = crate.add_file("file.txt", properties={
    "name": "my test file",
}, record_size=True)

# add a folder
crate.add_dataset(
            source="folder",
            dest_path="internal_folder",
    )

# add a remote dataset
crate.add_dataset(
    source="https://raw.githubusercontent.com/ResearchObject/ro-crate-py/refs/heads/master/",
    dest_path="remote/folder/",
    validate_url=False,
    fetch_remote=True,
    properties={
        "hasPart": [{"@id": "README.md"}]
    }
)

# write zip stream to file.
# Instead, it could also be used as response to an url request
# to stream the content with very low memory footprint and no disk usage
with open("out.zip", "wb") as out:
    for chunk in crate.stream_zip():
        out.write(chunk)

$> python test_zipstream.py
$> unzip -l out.zip
Archive:  out.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
       19  1980-01-01 00:00   file.txt
        5  1980-01-01 00:00   internal_folder/test.txt
    17395  1980-01-01 00:00   remote/folder/README.md
     6027  1980-01-01 00:00   ro-crate-preview.html
     1371  1980-01-01 00:00   ro-crate-metadata.json
---------                     -------
    24817                     5 files

dnlbauer · 2025-01-15T14:55:18Z

Seems like I broke some test cases, but I think we can first discuss if this is an approach worth to continue pursuing before I start to clean things up.

simleo · 2025-01-17T11:00:14Z

I tried this:

from rocrate.rocrate import ROCrate

in_crate_dir = "test/test-data/ro-crate-galaxy-sortchangecase"
out_crate_zip = "/tmp/ro-crate-galaxy-sortchangecase.zip"
crate = ROCrate(in_crate_dir)
with open(out_crate_zip, "wb") as f:
    for chunk in crate.stream_zip():
        f.write(chunk)

And the output crate is missing test/test1/input.bed and test/test1/output_exp.bed. They are missing even if I do crate.write_zip(out_crate_zip) instead of the loop with the new method. This is due to the fact that the rewritten write_zip does not call write, which in turn calls a method to copy files that are unlisted in the metadata. Looking at the changes, I see that several existing methods have been rewritten, hence the failure of existing tests. Could you manage to implement the new functionality without touching existing methods, or at least with only minimal changes to them? This way, the chance of breaking existing stuff would be much smaller.

dnlbauer · 2025-01-17T12:27:52Z

Yes there are some implementation because I tried to replace the write code. It's probably easier to, as you suggested, take a step back and implement it without changing the write code. This will lead to a lot of code duplication though.

simleo · 2025-01-21T13:19:57Z

It's probably easier to, as you suggested, take a step back and implement it without changing the write code. This will lead to a lot of code duplication though.

Let's avoid code duplication then. Moving forward, please follow these steps:

Change the code so that all current tests pass, i.e., without changing the test code
Add new tests to test the new functionalities, making sure they pass as well
Check that this PR does not cause significant performance degradation

dnlbauer · 2025-01-22T21:46:55Z

I spent some time today to figure out why there are so many test failing. There were quite some edge-cases I missed which required some refactoring to address. (force-pushed since I basically started from scratch)

Updates:

Writing an ro-crate as folder when the files are already present elsewhere now uses shutil, just like before this PR. This ensures there are no noticeable performance degradations for writing RO-Crates from existing files.
Streaming is now only used when calling stream, or when writing files to disk which need to be streamed anyways (i.e. the input is an instance of IOBase or the data has to be fetched from remote)

Things to do / to discuss:

Decide whether write_zip should use streaming under the hood: While this is optional, it comes with several advantages:
- writing a zip to disk would not duplicate raw data anymore, which benefits systems with limited storage or systems where /tmp is mounted as RAM.
- We can reuse the already existing unit tests for zip to test the streaming implementation. Otherwise,they will have to be implemented for streaming zips in addtion.
"extra data" handling does not work for streaming -> There is no equivalent of _copy_unlisted for streaming yet.
Unit tests

rocrate/model/dataset.py

simleo · 2025-01-23T16:31:56Z

I tried this:

from rocrate.rocrate import ROCrate

in_crate_dir = "test/test-data/ro-crate-galaxy-sortchangecase"
out_crate_zip = "/tmp/ro-crate-galaxy-sortchangecase.zip"
crate = ROCrate(in_crate_dir)
with open(out_crate_zip, "wb") as f:
    for chunk in crate.stream_zip():
        f.write(chunk)

I ran the above snippet again. There is something wrong with the zip that gets created:

$ unzip -d ro-crate-galaxy-sortchangecase{,.zip}
Archive:  ro-crate-galaxy-sortchangecase.zip
warning [ro-crate-galaxy-sortchangecase.zip]:  245282 extra bytes at beginning or within zipfile
  (attempting to process anyway)
  inflating: ro-crate-galaxy-sortchangecase/sort-and-change-case.ga  
  inflating: ro-crate-galaxy-sortchangecase/LICENSE  
  inflating: ro-crate-galaxy-sortchangecase/README.md  
  inflating: ro-crate-galaxy-sortchangecase/test/test1/sort-and-change-case-test.yml  
  inflating: ro-crate-galaxy-sortchangecase/ro-crate-metadata.json

dnlbauer · 2025-01-24T10:16:33Z

I ran the above snippet again. There is something wrong with the zip that gets created:

My bad on not properly testing between two commits ... Seems like ZipFile treats the BytesIO differently than a subclass from RawIOBase or sth like that. When using BytesIO, the initial bytes of the buffer are repeatedly written to the stream, even when I tried to flush/truncate 😕 . With a plain in memory buffer, it works so I reverted to use one with 27457d8

simleo · 2025-01-24T15:38:47Z

rocrate/model/dataset.py

+                    source = root / name
+                    dest = source.relative_to(Path(self.source).parent)
+                    with open(source, 'rb') as f:
+                        for chunk in f:


This iterates over the file's "lines", even if it's opened in binary mode, i.e. with chunks delimited by \n. It should yield chunks of the same size (except for the last one) instead, independent from \n characters and reasonably large (I think you used 8192 elsewhere).

simleo · 2025-01-24T15:47:44Z

The chunks yielded by stream_zip are almost all empty and highly irregular in size:

>>> from rocrate.rocrate import ROCrate
>>> in_crate_dir = "test/test-data/ro-crate-galaxy-sortchangecase"
>>> crate = ROCrate(in_crate_dir)
>>> chunks = [_ for _ in crate.stream_zip()]
>>> len(chunks)
310
>>> from collections import Counter
>>> Counter(len(_) for _ in chunks).most_common()
[(0, 304), (53, 1), (916, 1), (3607, 1), (348, 1), (168, 1), (1108, 1)]
>>>

I was expecting a more regular stream, and not to see empty chunks.

dnlbauer · 2025-01-24T16:59:57Z

Good point. the reason for the non-uniform chunk sizes is not only because of the different input stream sizes, but also because the chunk size of chunks going into the zip is different then the output chunk sizes (compression, zip file format overhead, ..). So even if the input streams would yield nicely sized chunks, the sizes will be different from the output.

It shouldn't be too complicated to implement to wait for the buffer for a given chunk size to be filled before yielding, though. It might actually be sensible to make the chunk size a parameter that gets passed through to underlying read operations.

dnlbauer · 2025-01-26T13:04:56Z

bf1b990 added unlisted files from the crate to the stream.
5f0a709 switched write_zip to use the stream method. This makes all the unit tests that work on the zip to also test the streaming functionality.

simleo · 2025-01-27T15:13:12Z

rocrate/rocrate.py

+                    for name in files:
+                        source = root / name
+                        rel = source.relative_to(self.source)
+                        if not self.dereference(str(rel)) and not out_path.samefile(source):


This crashes if out_path is None (or anything that does not have a samefile attribute, actually). Moreover, an out_path argument does not make sense in stream_zip: there is no output path, you're just yielding chunks.

If write_zip should use the streaming implementation, we need a way to pass the out_path to the streaming implementation since write_zip is currently intended to support the edge-case where the zip file is written into the ro-crate folder. Otherwise, this functionality (tested with the test_no_zip_in_zip test) will break.

What could be done is:

Leave write_zip with the implementation that writes the crate to a temp dir and zips it using shell utils

Document that the out_path parameter is for internal use. (this is what I did)

Hide the parameter by having an external wrapper around the stream method.

I now switched to an external wrapper function approach.

Also added a test that makes sure that a valid zip is written from calling stream without the write_zip method, to make sure a crash like the one you mentioned is tested against.

stream_zip is still not yielding unlisted files. See 6b33a90.

Note that only test_stream fails. test_write_zip_copy_unlisted passes.

dnlbauer force-pushed the stream_crate branch from 604623a to 5e9a597 Compare January 15, 2025 13:54

dnlbauer added 8 commits January 22, 2025 17:35

add abstract method for streaming data entities

721e43c

implement streaming for file

939abf4

implement streaming for metadata

a23da9c

implement streaming for preview

f34dad8

implement streaming for dataset

f909fbb

fix: dataset should not stream if root entity

883c439

feat: add method to stream zip

77eee32

chore: remove memory buffer class in favor of BytesIO

08eeb02

dnlbauer force-pushed the stream_crate branch from 268fbb4 to 08eeb02 Compare January 22, 2025 21:11

simleo reviewed Jan 23, 2025

View reviewed changes

rocrate/model/dataset.py Outdated Show resolved Hide resolved

dnlbauer added 3 commits January 24, 2025 11:03

fix: files from datasets should also be streamed in chunks

c034c66

fix: zip stream repeats initial bytes

27457d8

fix: make sure zip stream buffer is always properly closed

0a1c4a1

simleo reviewed Jan 24, 2025

View reviewed changes

feat: zip stream now yields predictable chunk sizes

e229ccd

dnlbauer force-pushed the stream_crate branch from f47a584 to d4a66cb Compare January 24, 2025 17:42

chore: remove type hints for consistency

37f2430

dnlbauer force-pushed the stream_crate branch from d4a66cb to 37f2430 Compare January 24, 2025 17:44

feat: include unlisted files in zip stream

bf1b990

dnlbauer force-pushed the stream_crate branch from df27e0b to 5f0a709 Compare January 26, 2025 12:58

dnlbauer changed the title ~~[DRAFT] Stream RO-Crate Zip~~ Stream RO-Crate Zip Jan 27, 2025

feat: write_zip now uses zip streaming

3634e4b

dnlbauer force-pushed the stream_crate branch from 5f0a709 to 3634e4b Compare January 27, 2025 11:12

feat: add streaming example

1a4253f

dnlbauer force-pushed the stream_crate branch from 3d77677 to 1a4253f Compare January 27, 2025 11:42

fix: flake8

2fc8fc8

simleo reviewed Jan 27, 2025

View reviewed changes

dnlbauer and others added 5 commits January 27, 2025 16:52

fix: NPE when no out_path is given for straming

781052f

feat: hide out_path parameter of streaming api in an internal wrapper

0d81d70

feat: add test for streaming without write_zip

c636220

test for unlisted file presence when writing zip

6b33a90

fix+refactor: streaming should not ignore unlisted files

5d49dab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stream RO-Crate Zip #212

Stream RO-Crate Zip #212

dnlbauer commented Jan 15, 2025 •

edited

Loading

dnlbauer commented Jan 15, 2025

dnlbauer commented Jan 15, 2025

simleo commented Jan 17, 2025

dnlbauer commented Jan 17, 2025

simleo commented Jan 21, 2025

dnlbauer commented Jan 22, 2025 •

edited

Loading

simleo commented Jan 23, 2025

dnlbauer commented Jan 24, 2025 •

edited

Loading

simleo Jan 24, 2025

simleo commented Jan 24, 2025

dnlbauer commented Jan 24, 2025

dnlbauer commented Jan 26, 2025

simleo Jan 27, 2025

dnlbauer Jan 27, 2025 •

edited

Loading

simleo Jan 29, 2025

simleo Jan 29, 2025

Stream RO-Crate Zip #212

Are you sure you want to change the base?

Stream RO-Crate Zip #212

Conversation

dnlbauer commented Jan 15, 2025 • edited Loading

dnlbauer commented Jan 15, 2025

dnlbauer commented Jan 15, 2025

simleo commented Jan 17, 2025

dnlbauer commented Jan 17, 2025

simleo commented Jan 21, 2025

dnlbauer commented Jan 22, 2025 • edited Loading

Updates:

Things to do / to discuss:

simleo commented Jan 23, 2025

dnlbauer commented Jan 24, 2025 • edited Loading

simleo Jan 24, 2025

Choose a reason for hiding this comment

simleo commented Jan 24, 2025

dnlbauer commented Jan 24, 2025

dnlbauer commented Jan 26, 2025

simleo Jan 27, 2025

Choose a reason for hiding this comment

dnlbauer Jan 27, 2025 • edited Loading

Choose a reason for hiding this comment

simleo Jan 29, 2025

Choose a reason for hiding this comment

simleo Jan 29, 2025

Choose a reason for hiding this comment

dnlbauer commented Jan 15, 2025 •

edited

Loading

dnlbauer commented Jan 22, 2025 •

edited

Loading

dnlbauer commented Jan 24, 2025 •

edited

Loading

dnlbauer Jan 27, 2025 •

edited

Loading