Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stream RO-Crate Zip #212

Open
wants to merge 22 commits into
base: master
Choose a base branch
from
Open

Conversation

dnlbauer
Copy link

@dnlbauer dnlbauer commented Jan 15, 2025

This is a Draft implementation showing how #205 could be solved.

Implementation is rather straightforward for file entities where I could simply add a method to stream the file contents.
For datasets entities, it was a bit challenging to find a good solution. Creating a zip stream requires that all the file handling need to happen at one place where the zip stream object is created (rocrate.stream_zip in this case). However, dataset entities currently handle the file writing operation internally and don't expose their containing files in an easy to use way. I solved this with a generator that not only yields the data of the contained files, but also yields file paths. However, the solution still looks a bit "hacky" to me. I didn't want to do more refactoring though.

@dnlbauer
Copy link
Author

How it works:

#test_zipstream.py
from rocrate.rocrate import ROCrate

crate = ROCrate(gen_preview=True)

# add a file
testfile = crate.add_file("file.txt", properties={
    "name": "my test file",
}, record_size=True)

# add a folder
crate.add_dataset(
            source="folder",
            dest_path="internal_folder",
    )

# add a remote dataset
crate.add_dataset(
    source="https://raw.githubusercontent.com/ResearchObject/ro-crate-py/refs/heads/master/",
    dest_path="remote/folder/",
    validate_url=False,
    fetch_remote=True,
    properties={
        "hasPart": [{"@id": "README.md"}]
    }
)

# write zip stream to file.
# Instead, it could also be used as response to an url request
# to stream the content with very low memory footprint and no disk usage
with open("out.zip", "wb") as out:
    for chunk in crate.stream_zip():
        out.write(chunk)
$> python test_zipstream.py
$> unzip -l out.zip
Archive:  out.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
       19  1980-01-01 00:00   file.txt
        5  1980-01-01 00:00   internal_folder/test.txt
    17395  1980-01-01 00:00   remote/folder/README.md
     6027  1980-01-01 00:00   ro-crate-preview.html
     1371  1980-01-01 00:00   ro-crate-metadata.json
---------                     -------
    24817                     5 files

@dnlbauer
Copy link
Author

Seems like I broke some test cases, but I think we can first discuss if this is an approach worth to continue pursuing before I start to clean things up.

@simleo
Copy link
Collaborator

simleo commented Jan 17, 2025

I tried this:

from rocrate.rocrate import ROCrate

in_crate_dir = "test/test-data/ro-crate-galaxy-sortchangecase"
out_crate_zip = "/tmp/ro-crate-galaxy-sortchangecase.zip"
crate = ROCrate(in_crate_dir)
with open(out_crate_zip, "wb") as f:
    for chunk in crate.stream_zip():
        f.write(chunk)

And the output crate is missing test/test1/input.bed and test/test1/output_exp.bed. They are missing even if I do crate.write_zip(out_crate_zip) instead of the loop with the new method. This is due to the fact that the rewritten write_zip does not call write, which in turn calls a method to copy files that are unlisted in the metadata. Looking at the changes, I see that several existing methods have been rewritten, hence the failure of existing tests. Could you manage to implement the new functionality without touching existing methods, or at least with only minimal changes to them? This way, the chance of breaking existing stuff would be much smaller.

@dnlbauer
Copy link
Author

Yes there are some implementation because I tried to replace the write code. It's probably easier to, as you suggested, take a step back and implement it without changing the write code. This will lead to a lot of code duplication though.

@simleo
Copy link
Collaborator

simleo commented Jan 21, 2025

It's probably easier to, as you suggested, take a step back and implement it without changing the write code. This will lead to a lot of code duplication though.

Let's avoid code duplication then. Moving forward, please follow these steps:

  1. Change the code so that all current tests pass, i.e., without changing the test code
  2. Add new tests to test the new functionalities, making sure they pass as well
  3. Check that this PR does not cause significant performance degradation

@dnlbauer
Copy link
Author

dnlbauer commented Jan 22, 2025

I spent some time today to figure out why there are so many test failing. There were quite some edge-cases I missed which required some refactoring to address. (force-pushed since I basically started from scratch)

Updates:

  • Writing an ro-crate as folder when the files are already present elsewhere now uses shutil, just like before this PR. This ensures there are no noticeable performance degradations for writing RO-Crates from existing files.
  • Streaming is now only used when calling stream, or when writing files to disk which need to be streamed anyways (i.e. the input is an instance of IOBase or the data has to be fetched from remote)

Things to do / to discuss:

  • Decide whether write_zip should use streaming under the hood: While this is optional, it comes with several advantages:
    • writing a zip to disk would not duplicate raw data anymore, which benefits systems with limited storage or systems where /tmp is mounted as RAM.
    • We can reuse the already existing unit tests for zip to test the streaming implementation. Otherwise,they will have to be implemented for streaming zips in addtion.
  • "extra data" handling does not work for streaming -> There is no equivalent of _copy_unlisted for streaming yet.
  • Unit tests

rocrate/model/dataset.py Outdated Show resolved Hide resolved
@simleo
Copy link
Collaborator

simleo commented Jan 23, 2025

I tried this:

from rocrate.rocrate import ROCrate

in_crate_dir = "test/test-data/ro-crate-galaxy-sortchangecase"
out_crate_zip = "/tmp/ro-crate-galaxy-sortchangecase.zip"
crate = ROCrate(in_crate_dir)
with open(out_crate_zip, "wb") as f:
    for chunk in crate.stream_zip():
        f.write(chunk)

I ran the above snippet again. There is something wrong with the zip that gets created:

$ unzip -d ro-crate-galaxy-sortchangecase{,.zip}
Archive:  ro-crate-galaxy-sortchangecase.zip
warning [ro-crate-galaxy-sortchangecase.zip]:  245282 extra bytes at beginning or within zipfile
  (attempting to process anyway)
  inflating: ro-crate-galaxy-sortchangecase/sort-and-change-case.ga  
  inflating: ro-crate-galaxy-sortchangecase/LICENSE  
  inflating: ro-crate-galaxy-sortchangecase/README.md  
  inflating: ro-crate-galaxy-sortchangecase/test/test1/sort-and-change-case-test.yml  
  inflating: ro-crate-galaxy-sortchangecase/ro-crate-metadata.json  

@dnlbauer
Copy link
Author

dnlbauer commented Jan 24, 2025

I ran the above snippet again. There is something wrong with the zip that gets created:

My bad on not properly testing between two commits ... Seems like ZipFile treats the BytesIO differently than a subclass from RawIOBase or sth like that. When using BytesIO, the initial bytes of the buffer are repeatedly written to the stream, even when I tried to flush/truncate 😕 . With a plain in memory buffer, it works so I reverted to use one with 27457d8

source = root / name
dest = source.relative_to(Path(self.source).parent)
with open(source, 'rb') as f:
for chunk in f:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This iterates over the file's "lines", even if it's opened in binary mode, i.e. with chunks delimited by \n. It should yield chunks of the same size (except for the last one) instead, independent from \n characters and reasonably large (I think you used 8192 elsewhere).

@simleo
Copy link
Collaborator

simleo commented Jan 24, 2025

The chunks yielded by stream_zip are almost all empty and highly irregular in size:

>>> from rocrate.rocrate import ROCrate
>>> in_crate_dir = "test/test-data/ro-crate-galaxy-sortchangecase"
>>> crate = ROCrate(in_crate_dir)
>>> chunks = [_ for _ in crate.stream_zip()]
>>> len(chunks)
310
>>> from collections import Counter
>>> Counter(len(_) for _ in chunks).most_common()
[(0, 304), (53, 1), (916, 1), (3607, 1), (348, 1), (168, 1), (1108, 1)]
>>> 

I was expecting a more regular stream, and not to see empty chunks.

@dnlbauer
Copy link
Author

Good point. the reason for the non-uniform chunk sizes is not only because of the different input stream sizes, but also because the chunk size of chunks going into the zip is different then the output chunk sizes (compression, zip file format overhead, ..). So even if the input streams would yield nicely sized chunks, the sizes will be different from the output.

It shouldn't be too complicated to implement to wait for the buffer for a given chunk size to be filled before yielding, though. It might actually be sensible to make the chunk size a parameter that gets passed through to underlying read operations.

@dnlbauer
Copy link
Author

  • bf1b990 added unlisted files from the crate to the stream.
  • 5f0a709 switched write_zip to use the stream method. This makes all the unit tests that work on the zip to also test the streaming functionality.

@dnlbauer dnlbauer changed the title [DRAFT] Stream RO-Crate Zip Stream RO-Crate Zip Jan 27, 2025
for name in files:
source = root / name
rel = source.relative_to(self.source)
if not self.dereference(str(rel)) and not out_path.samefile(source):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This crashes if out_path is None (or anything that does not have a samefile attribute, actually). Moreover, an out_path argument does not make sense in stream_zip: there is no output path, you're just yielding chunks.

Copy link
Author

@dnlbauer dnlbauer Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If write_zip should use the streaming implementation, we need a way to pass the out_path to the streaming implementation since write_zip is currently intended to support the edge-case where the zip file is written into the ro-crate folder. Otherwise, this functionality (tested with the test_no_zip_in_zip test) will break.

What could be done is:

  • Leave write_zip with the implementation that writes the crate to a temp dir and zips it using shell utils
  • Document that the out_path parameter is for internal use. (this is what I did)
  • Hide the parameter by having an external wrapper around the stream method.

I now switched to an external wrapper function approach.

Also added a test that makes sure that a valid zip is written from calling stream without the write_zip method, to make sure a crash like the one you mentioned is tested against.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stream_zip is still not yielding unlisted files. See 6b33a90.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that only test_stream fails. test_write_zip_copy_unlisted passes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants