Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP geodetic distance search #1086

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from
Draft

Conversation

leoger
Copy link
Contributor

@leoger leoger commented Dec 14, 2024

Why

With some additional work, this branch is intended to fix issue #1079. In its current state, there is still an unresolved issue where both NearFilter and WithinFilter actually just checking the minimal bounding box and never test the actual geometry.

My intended next step is to discuss the best path forward. I have left detailed notes in the form of a comment in the code with my understanding of options for expanding the fix.

What

  • Added NearFilter unit test with real lat-long data. The current implementation of NearFilter doesn't return the expected results.
  • Added a dependency on net.sf.geographiclib : GeographicLib-Java : 2.0 in nitrite-spatial/pom.xml
  • Added an alternative "geometry factory" that creates a geodetic "small circle" instead of a cartesian circle as the existing JTS geometry factory does.
    • I'm happy to refactor this in any number of ways. This current implementation is just what I felt was the most straightforward replacement for existing functionality, serving as a proof of concept without forcing a ripple of changed type signatures.
  • Added another unit test to check the behavior of WithinFilter once I stepped through in the debugger and saw that SpatialIndex.java is mistakenly treating the filter work as done after the initial RTree check. Sure enough, the same problem occurs for WithinFilter. I added an ASCII diagram in a comment within the new unit test, testWithinTriangleNotJustTestingBoundingBox.

Added NearFilter unit test with real lat-long data. The current implementation of NearFilter doesn't return the expected results.

See this shared custom map on Google Maps for an illustration of the Oslo test case: https://www.google.com/maps/d/viewer?mid=1WXlEa5nBOSvBej3HSUNhsh_LLahoab8
Copy link
Contributor

coderabbitai bot commented Dec 14, 2024

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@@ -0,0 +1,45 @@
package org.locationtech.jts.util;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets use nitrite package name here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I thought this needed to be in the same package as the existing GeometricShapeFactory in order to access the protected fields, but I see now that I was misremembering my Java inheritance visibility rules! 😄 I did a quick refactor and it works just fine. Pushing an update...

@anidotnet
Copy link
Contributor

Your changes looks great to me.

@leoger
Copy link
Contributor Author

leoger commented Dec 14, 2024

Thanks for the quick feedback Anindya!

Since these changes don't actually work yet, due to the bounding box problem, please advise how you'd recommend proceeding. There seem to be some simple options, e.g. we can use the set of NitriteIds that come back from SpatialIndex::findNitriteIds to look up the exact geometry and do the exact test as a second pass...

My concern here is that I don't want to start pulling threads and turning this into a larger change because I haven't taken the time to understand the philosophy of how you split the work of the query up across these different layers. (Hence why this PR is a Draft.)

@anidotnet
Copy link
Contributor

anidotnet commented Dec 14, 2024

After a quick review, I see one way we can move forward:

  1. Change the NitriteRTree interface and add Geometry as the argument in add, remove and find* methods.
  2. Store the Geometry along with NitriteId as the value of the RTree
  3. Filter by BoundingBox as it happens today. We will now get Geometry along with NitriteIds. Here we can do finer filtering and return the exact NitriteIds corresponding to the right Geometry values.

If you find a better way don't hesitate to share also.

@leoger
Copy link
Contributor Author

leoger commented Dec 14, 2024

Thanks, I'll give it a try and see how the code shapes up!

@leoger
Copy link
Contributor Author

leoger commented Dec 14, 2024

Also, how much weight should we be putting on avoiding breaking interface changes? Theoretically, someone out there may have taken a dependency on Nitrite 4.x and created their own implementation class for the NitriteRTree interface.

We could just mark the existing methods as @Deprecated and add additional methods. That way bumping the Nitrite dependency version from 4.3.0 to 4.4.0 would produce warnings rather than compile errors. :-\

@anidotnet
Copy link
Contributor

Ofcourse we should keep them backward compatible for next minor version upgrade and create overloaded methods to add Geometry as argument. And please put proper warning message in @Deprecated annotations.

@leoger
Copy link
Contributor Author

leoger commented Dec 15, 2024

As I started looking at the four implementations of NitriteRTree, and in particular the NitriteMVRTreeMap for the MVStore adapter and the RocksDBRTree for the RocksDB adapter, I realized that if we added Geometry to the signature of add and actually added it to the data within the RTreeMap, there are multiple negatives, which I'll enumerate momentarily. (I haven't tested any of these yet, I'm just going on experience and intuition, so let me know if you see any of it differently.)

Considerations

Concrete technical considerations

If we add the Geometry data itself into the RTreeMap, then the actual bytes representing the (serialized) geometry will be written into either the RocksDBMap field backingMap or the MVRTreeMap field mvMap. Depending on the geometry, this could be a significant increase in either the in-memory or on-disk size of the DB.

More importantly, it is my understanding that this would be redundant data. The geometry data could be both in the main collection/repository map as well in the RTree map. (e.g. I see that both the openMap(...) and the openRTree(...) method in RocksDBStore.java contain calls to new RocksDBMap<>(...).) I expect this means that if I spent enough time looking through a hex-dump of a RocksDB Nitrite DB file, I would be able to spot two copies of each Points/Coordinates that makes up each geometry object. Beyond the simple desirability of using less RAM or storage on behalf of Nitrite users, I would expect any such increase to come with its own CPU performance penalty in the form of increased rate of L1/L2 cache misses.

Intuitive considerations based on application of general software-design principles

I tried to peel the next layer of the onion to figure out how to judge whether this is an appropriate trade-off in context. This led me to (a) look at the implementation of SingleFieldIndex and CompoundIndex; and (b) think about what I know of how RDBMS's typically implement indexes. In both cases, I believe the pattern is that the additional data taken up by the index itself consists of only the covered fields and some primary key or equivalent. Furthermore, thinking about the unique design of the R-Tree data structure and associated algorithms, it feels like we'd actually be working "against the grain". I may not have deep experience specifically with designing DB indices and being bitten by their trade-offs first hand, but this feels very much like a situation where we need a stronger motivation to "swim upstream" than we currently have.

Admittedly, on the one hand, for every other type of Index we have, full data of the field being indexed becomes the content of the index alongside the ids. As such, there's an expectation that those indices would enable a wide range of filter operations because those operations would have access to the "complete" data of the indexed field. On the other hand, it seems that a spatial index is necessarily different by it's nature. The R-Tree is built on the idea that using only the bounding boxes this comes with significant benefits that outweigh its significant limitations. If we extend it, it's not really just an R-Tree anymore.

Conclusion

On balance, I think this adds up to strong reason to prefer the approach where we treat each WithinFilter( ..., geometry) as an and(within(boundingBox(geometry)), within(geometry)), where the first will always be powered by the index and the second will go through non-indexed apply.

Sorry for the novella-length analysis. 😄 I look forward to hearing what you think.

@anidotnet
Copy link
Contributor

Thanks for the detailed analysis. While filtering via bounding box (using the current algorithm) the resulting set of NitriteIds will always be a super set of the correct NitriteIds. So our job is - to further narrow down the set of NitriteIds we get from the index search by pulling the Geometry fields corresponding those ids and perform the calculations on those values in final pass and return the NitriteIds from there. In this approach you only need to work on the SpatialIndex.findNitriteIds without modifying the NitriteRTree interface.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants