-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stream RO-Crate Zip #212
base: master
Are you sure you want to change the base?
Stream RO-Crate Zip #212
Conversation
How it works: #test_zipstream.py
from rocrate.rocrate import ROCrate
crate = ROCrate(gen_preview=True)
# add a file
testfile = crate.add_file("file.txt", properties={
"name": "my test file",
}, record_size=True)
# add a folder
crate.add_dataset(
source="folder",
dest_path="internal_folder",
)
# add a remote dataset
crate.add_dataset(
source="https://raw.githubusercontent.com/ResearchObject/ro-crate-py/refs/heads/master/",
dest_path="remote/folder/",
validate_url=False,
fetch_remote=True,
properties={
"hasPart": [{"@id": "README.md"}]
}
)
# write zip stream to file.
# Instead, it could also be used as response to an url request
# to stream the content with very low memory footprint and no disk usage
with open("out.zip", "wb") as out:
for chunk in crate.stream_zip():
out.write(chunk) $> python test_zipstream.py
$> unzip -l out.zip
Archive: out.zip
Length Date Time Name
--------- ---------- ----- ----
19 1980-01-01 00:00 file.txt
5 1980-01-01 00:00 internal_folder/test.txt
17395 1980-01-01 00:00 remote/folder/README.md
6027 1980-01-01 00:00 ro-crate-preview.html
1371 1980-01-01 00:00 ro-crate-metadata.json
--------- -------
24817 5 files |
604623a
to
5e9a597
Compare
Seems like I broke some test cases, but I think we can first discuss if this is an approach worth to continue pursuing before I start to clean things up. |
I tried this: from rocrate.rocrate import ROCrate
in_crate_dir = "test/test-data/ro-crate-galaxy-sortchangecase"
out_crate_zip = "/tmp/ro-crate-galaxy-sortchangecase.zip"
crate = ROCrate(in_crate_dir)
with open(out_crate_zip, "wb") as f:
for chunk in crate.stream_zip():
f.write(chunk) And the output crate is missing |
Yes there are some implementation because I tried to replace the write code. It's probably easier to, as you suggested, take a step back and implement it without changing the write code. This will lead to a lot of code duplication though. |
Let's avoid code duplication then. Moving forward, please follow these steps:
|
268fbb4
to
08eeb02
Compare
I spent some time today to figure out why there are so many test failing. There were quite some edge-cases I missed which required some refactoring to address. (force-pushed since I basically started from scratch) Updates:
Things to do / to discuss:
|
I ran the above snippet again. There is something wrong with the zip that gets created: $ unzip -d ro-crate-galaxy-sortchangecase{,.zip}
Archive: ro-crate-galaxy-sortchangecase.zip
warning [ro-crate-galaxy-sortchangecase.zip]: 245282 extra bytes at beginning or within zipfile
(attempting to process anyway)
inflating: ro-crate-galaxy-sortchangecase/sort-and-change-case.ga
inflating: ro-crate-galaxy-sortchangecase/LICENSE
inflating: ro-crate-galaxy-sortchangecase/README.md
inflating: ro-crate-galaxy-sortchangecase/test/test1/sort-and-change-case-test.yml
inflating: ro-crate-galaxy-sortchangecase/ro-crate-metadata.json |
My bad on not properly testing between two commits ... Seems like ZipFile treats the BytesIO differently than a subclass from RawIOBase or sth like that. When using BytesIO, the initial bytes of the buffer are repeatedly written to the stream, even when I tried to |
rocrate/model/dataset.py
Outdated
source = root / name | ||
dest = source.relative_to(Path(self.source).parent) | ||
with open(source, 'rb') as f: | ||
for chunk in f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This iterates over the file's "lines", even if it's opened in binary mode, i.e. with chunks delimited by \n
. It should yield chunks of the same size (except for the last one) instead, independent from \n
characters and reasonably large (I think you used 8192 elsewhere).
The chunks yielded by >>> from rocrate.rocrate import ROCrate
>>> in_crate_dir = "test/test-data/ro-crate-galaxy-sortchangecase"
>>> crate = ROCrate(in_crate_dir)
>>> chunks = [_ for _ in crate.stream_zip()]
>>> len(chunks)
310
>>> from collections import Counter
>>> Counter(len(_) for _ in chunks).most_common()
[(0, 304), (53, 1), (916, 1), (3607, 1), (348, 1), (168, 1), (1108, 1)]
>>> I was expecting a more regular stream, and not to see empty chunks. |
Good point. the reason for the non-uniform chunk sizes is not only because of the different input stream sizes, but also because the chunk size of chunks going into the zip is different then the output chunk sizes (compression, zip file format overhead, ..). So even if the input streams would yield nicely sized chunks, the sizes will be different from the output. It shouldn't be too complicated to implement to wait for the buffer for a given chunk size to be filled before yielding, though. It might actually be sensible to make the chunk size a parameter that gets passed through to underlying read operations. |
f47a584
to
d4a66cb
Compare
d4a66cb
to
37f2430
Compare
df27e0b
to
5f0a709
Compare
5f0a709
to
3634e4b
Compare
3d77677
to
1a4253f
Compare
rocrate/rocrate.py
Outdated
for name in files: | ||
source = root / name | ||
rel = source.relative_to(self.source) | ||
if not self.dereference(str(rel)) and not out_path.samefile(source): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This crashes if out_path
is None
(or anything that does not have a samefile
attribute, actually). Moreover, an out_path
argument does not make sense in stream_zip
: there is no output path, you're just yielding chunks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If write_zip
should use the streaming implementation, we need a way to pass the out_path to the streaming implementation since write_zip
is currently intended to support the edge-case where the zip file is written into the ro-crate folder. Otherwise, this functionality (tested with the test_no_zip_in_zip
test) will break.
What could be done is:
- Leave
write_zip
with the implementation that writes the crate to a temp dir and zips it using shell utils - Document that the
out_path
parameter is for internal use. (this is what I did) - Hide the parameter by having an external wrapper around the stream method.
I now switched to an external wrapper function approach.
Also added a test that makes sure that a valid zip is written from calling stream
without the write_zip
method, to make sure a crash like the one you mentioned is tested against.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stream_zip
is still not yielding unlisted files. See 6b33a90.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that only test_stream
fails. test_write_zip_copy_unlisted
passes.
This is a Draft implementation showing how #205 could be solved.
Implementation is rather straightforward for file entities where I could simply add a method to stream the file contents.
For datasets entities, it was a bit challenging to find a good solution. Creating a zip stream requires that all the file handling need to happen at one place where the zip stream object is created (
rocrate.stream_zip
in this case). However, dataset entities currently handle the file writing operation internally and don't expose their containing files in an easy to use way. I solved this with a generator that not only yields the data of the contained files, but also yields file paths. However, the solution still looks a bit "hacky" to me. I didn't want to do more refactoring though.