Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spatial indexing #16

Open
benbovy opened this issue Nov 9, 2023 · 7 comments
Open

Spatial indexing #16

benbovy opened this issue Nov 9, 2023 · 7 comments

Comments

@benbovy
Copy link
Member

benbovy commented Nov 9, 2023

Currently a DGGSIndex only wraps a PandasIndex, which is limited for spatial indexing:

  • dggs.sel_latlon converts the input latitude/longitude points as grid cells and then queries the pandas index of cells. If one of the lat/lon input points after being converted into a cell is not found in the index, the whole selection returns an error. This may be useful in some cases but not in (many) other case where we would want "true" nearest-neighbor lookup (i.e., find the nearest cell present in the index given any lat/lon point).
  • The cell selection by polygon in add support for querying by geometry #9 relies on the healpy.query_polygon function, which returns all grid cells (pixels) contained or intersecting the polygon, irrespective of whether or not those cells are actually present in the index. In add support for querying by geometry #9 there's an extra step using numpy.isin() to avoid getting an error during the selection, which works but is probably not optimal. The more general issue is that other grid backends (H3) may not have this feature provided by healpy. Also, we might want more advanced query options (e.g., within vs. intersect, etc.).

A general solution for spatial indexing might be to follow the approach used in h3ron-polars, which provides a few spatial indexes (kd-tree, r-tree, packed-hilbert-rtree) that work with either the cell centroids or their envelope (rectangle). We could probably do the same here. We might be able to eventually reuse the indexes implemented in xoak once it is refactored to leverage Xarray custom indexes.

Alternatively, we could extract the cell geometries (#10) and then just rely on (or redirect users to) xvec for spatial queries. Xvec's GeometryIndex internally reuses GEOS' STRTree via shapely 2.0.

@danlooo
Copy link
Contributor

danlooo commented May 2, 2024

I tried spatial indicies e.g. a k-d tree for the DGGRID DGGS in danlooo/DGGS.jl#34
Searching for the nearest neighbor i.e. cell center lat/lon coordinate failed in reassembling the cell polygon geometry leading into many points that are assigned to a wrong cell. This is because the cell polygons are not regular anymore after the Snyder Equal Area Projection.

@keewis
Copy link
Collaborator

keewis commented Dec 1, 2024

I've come to the conclusion that I'd try as much as possible to avoid materializing the cell boundaries into memory: the xvec index currently requires loading all geometries into memory at the same time, which may exceed the memory (and even if xvec gets a distributed rtree, materializing polygons is costly).

Instead, how about defining DGGSIndex.sel such that:

  • ds.sel(cell_ids=lat_lon_coords) does point-wise indexing (where the object is possibly a Coordinates or Dataset object? Not sure, could also be a custom dataclass)
  • ds.sel(cell_ids=geometry, method="intersects") does polygon / multi-polygon selection (the method defines whether this should be intersection, strict inclusion, etc.)
  • ds.sel(cell_ids=bbox_tuple) does bbox selection (could also define a specific Bbox dataclass, or introduce a type="spherical / type="planar" kwarg instead that allows switching between great / small-circle selection, or switch depending on whether the geometry is a shapely or spherely object); the method is the same as above

For the latter two, the geometry / bbox would be rasterized using the backend library (e.g. h3ronpy has geometry_to_cells), and then we'd just have to do a cell id selection on a pandas index while ignoring non-existent cell ids. If the backend library does not support this I think we can just create pull requests to make them support that.

It does feel a little awkward to do cell_ids=<geometry or coords>, but unless we'd want to allow cells=<geometry/coords> instead that may be unavoidable? Otherwise we could also define ds.dggs.sel as something like

def sel(self, geometry=None, /, *, method=None, **coords):
    pass

It also won't resolve the concern for method="nearest" above, but I don't think we have to resolve that to make progress on the above.

@benbovy
Copy link
Member Author

benbovy commented Dec 2, 2024

A few thoughts:

  • Storing explicit DGGS cell geometries is at a certain expense but not orders of magnitude more than storing the DGGS cell ids : “only” 6x more memory in the (worst?) case of hexagonal cells. When using a projected CRS it may require more vertices (tesselate) to accurately represent curved cell edges, although this quickly becomes irrelevant for higher refinement levels (small cells). Otherwise cell corners is enough I guess with tools that work with geodetic edges (e.g. spherely).

  • Spherely (via s2geometry) might eventually provide memory-optimized & serializable spatial indexes (see Querying geographies (spatial index) benbovy/spherely#72), although this will require some work.

  • I think that in many cases Xvec will come handy (although sub-optimal) for spatial selection of DGGS cells using arbitrary input geometries. I also think that it might be better to support it transparently, i.e., provide some convenient API here to convert a DGGS data cube to a vector data cube -- or simply to add cells as POLYGON or POINT indexes geometry coordinates -- and let users know they can do spatial queries via Xvec. Perhaps it would be better to have concrete benchmark results or user feedback before trying to implement more optimal and flexible spatial indexing here?

  • I'm not sure that rasterizing arbitrary geometries to DGGS cells is a trivial operation (see h3ronpy's polygon fill modes). It may not be always the most optimized approach either, e.g., rasterizing a large bbox at a high (fixed) refinement level would result in a very large number of cells.

  • DGGS data cubes may have two main configurations: either a "raster-like" coverage of contiguous cells that are already sorted following a certain pattern (that may have some spatial meaning, e.g., healpix's RING vs. NESTED) or a "point-cloud-like" bunch of non-sorted, sparsely distributed cells (I'm here skipping "multi-resolution" / compact-form cell coverage for simplicity). The best approach for spatial indexing will likely differ depending on which of these configurations is considered.

@benbovy
Copy link
Member Author

benbovy commented Dec 2, 2024

Regarding the API, I would avoid trying to support all kinds of data selection via Xarray’s sel() API.

I find ds.sel(cell_id=<cell ids | lat-lon coords | geometries>) potentially very confusing! (on a more general note I'm a bit concerned already by all kinds of behaviors that will result from more custom indexes available in the broader Xarray ecosystem). In fact I would restrict here the indexer type to actual cell ids, i.e., ds.sel(cell_id=<cell ids>), and why not support ids given on different refinement levels sometime in the future (e.g., select all cells given a cell ids at a parent level).

For more advanced queries, I’d rather see it exposed via something like ds.dggs.query(…) similarly to ds.xvec.query(…).

@keewis
Copy link
Collaborator

keewis commented Dec 2, 2024

“only” 6x more memory in the (worst?) case of hexagonal cells

I think that's 12x: cell ids are one 64-bit integer per cell, while boundaries have 5-6 corners per cell with 2 64-bit floats per corner. For the case of healpix, this is 8x (4 corners with 2 floating point values each), which can make the difference between "comfortably fits into memory on a medium sized laptop" and "requires bigger hardware to even get created" (for healpix level 13, this is 6.4GB compared to 51.5GB). And that estimate is a lower bound because it doesn't count the overhead of the polygon objects / RTree.

might eventually provide memory-optimized & serializable spatial indexes

I guess I didn't really point this out, but since the DGGS cells are already arranged in a tree I think it should be possible to make use of that? Not sure how geometry rasterization works for most DGGS.

Either way the RTree will also have to provide the different containment / intersection modes, so there's nothing we would gain here.

rasterizing a large bbox at a high (fixed) refinement level would result in a very large number of cells.

indeed, but this can most likely be optimized, for example by extending the backend library to only return candidate cell ids if they are part of an input set. Or maybe my suggestion was just bad and there's a better way to search cell ids in a DGGS?

Perhaps it would be better to have concrete benchmark results or user feedback before trying to implement more optimal and flexible spatial indexing here?

That's the current state, which has turned out to be somewhat slow (see e.g. the healpix subsetting in the data interpolation script) compared to rasterization (the h3 subsetting in the same script)

However, while I think we can do better for searching, I'm very open to conversion methods (ds.dggs.to_xvec()?), and I'm also trying to push for a way to allow converting to uxarray in UXARRAY/uxarray#1055.

For more advanced queries, I’d rather see it exposed via something like ds.dggs.query(…) similarly to ds.xvec.query(…)

My inspiration was xvec's hook into ds.sel as that also only allows a single geometry as input, but I can see how cell_ids=geometry can be confusing. As mentioned above, I'd be fine with a separate method, though, whatever the name.

@benbovy
Copy link
Member Author

benbovy commented Dec 2, 2024

I think that's 12x

Fair enough! (6x if 32-bit floating point precision is enough :-) ).

I guess my main point is that, although far from optimal (but probably not that bad either in many cases), converting DGGS as vector geometries and then use vector tools for spatial queries based on input arbitrary geometries is a practical solution that can work today and that would provide consistent results for any kind of DGGS. It also works with distance-based queries (find nearest cells).

The opposite approach (i.e., convert the query input geometry into DGGS cells + hash-table lookup) does certainly make sense too and might be better at indexing large datasets. The downside is that to my knowledge the current DGGS implementations wouldn't allow us supporting this approach in a consistent & DGGS-agnostic way. Also I don't think that extending those backend libraries would be an easy task (especially if we want to avoid doing so in a low-level language). That said, probably it is OK supporting this approach via xdggs even though details may vary from one DGGS to another and/or it is only for a subset of the DGGSs available? I.e., Xdggs would just provide a uniform API but would not guarantee a uniform behavior across DGGSs.

Both approaches are complementary IMO.

That's the current state, which has turned out to be somewhat slow

How does it scale? Do you have some results or plots we can look at?

My inspiration was xvec's hook into ds.sel as that also only allows a single geometry as input

Yeah, for xvec it makes more sense IMO since xvec index coordinate labels are geometry objects.

@keewis
Copy link
Collaborator

keewis commented Dec 2, 2024

Do you have some results or plots we can look at?

Nothing systematic, no. If we need a benchmark I can do that once I have a bit more time, though.

Both approaches are complementary IMO.

Agreed. And additionally, going through xvec works today, while there's a lot more work necessary to support the other approach.

For another idea – which may have been mentioned before – is selecting by parent cell ids (and this one ds.sel(cell_ids=parent_cell_id) actually does make sense). For this to work, we need to figure out how to pass both the level and the cell id at the same time, or require an indexing scheme that combines both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants