-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overflow Error with Pickle Protocol 3 #59
Comments
Thanks for sharing this @TylerSpears. We originally picked pickle version 3 for backwards compatibility, but as you point out it has this limitation on data size. As you correctly point out, changing the pickle version is a breaking change since it changes many hashes for values. We have gathered several design improvements that break hashes and are considering if some future redun version should adopt them all in one go, but haven't settled on a plan just yet. Thanks for sharing some suggestions.
My understanding of cloudpickle is that it is appropriate for over-the-wire serialization, but is problematic for persistence (to a db) because it requires the same python interpreter version for serialization and deserialization.
This a good idea that we have be considered for other benefits as well. Basically, we have though through a ValueRef-like object that references the large value. The large value is fetched lazily as needed. Expressions containing the ValueRef can then remain small.
Subclassing ProxyValue is official way to customize serializing, hash, etc. We have one builtin class, |
Just chiming in to say my team would also benefit from a ValueRef type of object, both for large objects, and also for objects with class dependencies not installed in the environment where the main redun scheduler is running. |
I am building a data pipeline with redun where large
numpy.ndarray
s are passed through different task functions. However, I can very easily hit this overflow exception:OverflowError: serializing a bytes object larger than 4 GiB requires pickle protocol 4 or higher
The cause of this issue seems straightforward - redun limits the pickle protocol to version 3
redun/redun/utils.py
Line 40 in 0cd06c8
which cannot serialize objects beyond 4 GiB in size. Whenever redun tries to serialize a decent sized numpy array, either the
pickle_dumps
orpickle_dump
function cannot handle the object. It seems like this serialization is used in value hashing, so any use of a large ndarray breaks redun even without caching the arrays. I can provide code to re-create this, but it's pretty easy to do.Are there any plans to fix this constraint? Passing large arrays/tensors beyond 4 GiB in size is very common, and this seems like a blindspot in redun that would be hit frequently. It's certainly the largest blocker for us being able to use redun, which we would love to do given how elegant and straightforward the library is so far (and the lovely
script
tasks!).Some ideas for solutions:
I'm not really an expert in this, so perhaps these ideas are not workable in redun.
Another easy(er) workaround would be to use file-based objects, but this would lead to tons of file I/O and all arrays would have to be stored on the hard drive, even ones that don't need to be cached. Not to mention that the constant "read file -> perform operation -> save file" loop is very clunky and makes code unreadable and inflexible. I've also tried making a custom
ProxyValue
class that did custom serialization, but I just kept running in circles between the type registry, the argument preprocessing, the differences betweenserialization
,deserialization
,__setstate__
,__getstate__
, etc.I don't think my software version matters, but just to be thorough:
The text was updated successfully, but these errors were encountered: