Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs updates - multithreading/process recommendations + strikethrough of disk serialization #307

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

codeananda
Copy link
Contributor

Given the multithreading/processing issues with rtree and libspatialindex, I thought it made sense to add this info to the README (and also the online docs).

Related discussions that inspired my updates:

@hobu
Copy link
Member

hobu commented Feb 13, 2024

I don't understand why we're adding docs to say "Disk serialization is broken" with a pointer to a PR that updates something that shouldn't break it instead of just fixing the thing.

@codeananda
Copy link
Contributor Author

Is your issue that I've linked to the wrong PR/issue? Or more that we're updating the docs instead of fixing the problem? I can help with the former but not sure I know how to fix the latter.

@hobu
Copy link
Member

hobu commented Feb 14, 2024

Or more that we're updating the docs instead of fixing the problem?

Correct. Is there a ticket that clearly demonstrates the issue? I don't think we can categorically say disk serialization doesn't work. Maybe it doesn't work for you and your situation, but we would be flooded with tickets if it was broken more generally.

@adamjstewart
Copy link
Collaborator

I know pickling doesn't work (which means multiprocessing doesn't work on macOS/Windows): #87

@hobu
Copy link
Member

hobu commented Feb 14, 2024

I know pickling doesn't work

pickling isn't the same as "supports disk serialization". It's always been possible to use Rtree to store whatever data you wanted to with an index entry. That it's not conveniently pickled python objects is orthogonal.

@codeananda
Copy link
Contributor Author

pickling isn't the same as "supports disk serialization"

I didn't know this.

The docs give the impression that pickling is synonymous with disk serialisation. Indeed, so does the code.

  1. The Serializing your index to a file section doesn't actually tell you how to serialize to a file, but rather read an already created file from disk.
  2. The only section in performance relating to disk serialization recommends using cpickle.
  3. The Index class has dumps and loads methods - both one-liners that use pickle.loads/dumps
  4. Index._serialize calls self.dumps on the first line.

I genuinely thought the only way to serialize rtree objects to disk was pickle.

Or am I not understanding the meaning of "disk serialization"?

At the very least, we need to add a snippet under Serializing your index to a file that shows how you write to disk.

@codeananda
Copy link
Contributor Author

@hobu @mwtoews any update?

@FreddieWitherden
Copy link
Contributor

At the very least, we need to add a snippet under Serializing your index to a file that shows how you write to disk.

I believe you can pass None for the data and still have an R-tree stored on disk. This is probably what most people think of in terms of disk serialization. Of course, here, you are on your own in terms of associating Python objects with each entry (just going by their integer id's). This, I believe, is a better option and likely more efficient (and closer to what databases do: indices store pointers to the data rather than the data itself). Keeps your index small and enables you to change your data more freely and have multiple pointers to the same data block.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants