-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spatial indexing #16
Comments
I tried spatial indicies e.g. a k-d tree for the DGGRID DGGS in danlooo/DGGS.jl#34 |
I've come to the conclusion that I'd try as much as possible to avoid materializing the cell boundaries into memory: the Instead, how about defining
For the latter two, the geometry / bbox would be rasterized using the backend library (e.g. It does feel a little awkward to do def sel(self, geometry=None, /, *, method=None, **coords):
pass It also won't resolve the concern for |
A few thoughts:
|
Regarding the API, I would avoid trying to support all kinds of data selection via Xarray’s I find For more advanced queries, I’d rather see it exposed via something like |
I think that's 12x: cell ids are one 64-bit integer per cell, while boundaries have 5-6 corners per cell with 2 64-bit floats per corner. For the case of healpix, this is 8x (4 corners with 2 floating point values each), which can make the difference between "comfortably fits into memory on a medium sized laptop" and "requires bigger hardware to even get created" (for
I guess I didn't really point this out, but since the DGGS cells are already arranged in a tree I think it should be possible to make use of that? Not sure how geometry rasterization works for most DGGS. Either way the RTree will also have to provide the different containment / intersection modes, so there's nothing we would gain here.
indeed, but this can most likely be optimized, for example by extending the backend library to only return candidate cell ids if they are part of an input set. Or maybe my suggestion was just bad and there's a better way to search cell ids in a DGGS?
That's the current state, which has turned out to be somewhat slow (see e.g. the healpix subsetting in the data interpolation script) compared to rasterization (the h3 subsetting in the same script) However, while I think we can do better for searching, I'm very open to conversion methods (
My inspiration was |
Fair enough! (6x if 32-bit floating point precision is enough :-) ). I guess my main point is that, although far from optimal (but probably not that bad either in many cases), converting DGGS as vector geometries and then use vector tools for spatial queries based on input arbitrary geometries is a practical solution that can work today and that would provide consistent results for any kind of DGGS. It also works with distance-based queries (find nearest cells). The opposite approach (i.e., convert the query input geometry into DGGS cells + hash-table lookup) does certainly make sense too and might be better at indexing large datasets. The downside is that to my knowledge the current DGGS implementations wouldn't allow us supporting this approach in a consistent & DGGS-agnostic way. Also I don't think that extending those backend libraries would be an easy task (especially if we want to avoid doing so in a low-level language). That said, probably it is OK supporting this approach via xdggs even though details may vary from one DGGS to another and/or it is only for a subset of the DGGSs available? I.e., Xdggs would just provide a uniform API but would not guarantee a uniform behavior across DGGSs. Both approaches are complementary IMO.
How does it scale? Do you have some results or plots we can look at?
Yeah, for xvec it makes more sense IMO since xvec index coordinate labels are geometry objects. |
Nothing systematic, no. If we need a benchmark I can do that once I have a bit more time, though.
Agreed. And additionally, going through For another idea – which may have been mentioned before – is selecting by parent cell ids (and this one |
Currently a DGGSIndex only wraps a PandasIndex, which is limited for spatial indexing:
dggs.sel_latlon
converts the input latitude/longitude points as grid cells and then queries the pandas index of cells. If one of the lat/lon input points after being converted into a cell is not found in the index, the whole selection returns an error. This may be useful in some cases but not in (many) other case where we would want "true" nearest-neighbor lookup (i.e., find the nearest cell present in the index given any lat/lon point).numpy.isin()
to avoid getting an error during the selection, which works but is probably not optimal. The more general issue is that other grid backends (H3) may not have this feature provided byhealpy
. Also, we might want more advanced query options (e.g., within vs. intersect, etc.).A general solution for spatial indexing might be to follow the approach used in h3ron-polars, which provides a few spatial indexes (kd-tree, r-tree, packed-hilbert-rtree) that work with either the cell centroids or their envelope (rectangle). We could probably do the same here. We might be able to eventually reuse the indexes implemented in xoak once it is refactored to leverage Xarray custom indexes.
Alternatively, we could extract the cell geometries (#10) and then just rely on (or redirect users to) xvec for spatial queries. Xvec's
GeometryIndex
internally reuses GEOS' STRTree via shapely 2.0.The text was updated successfully, but these errors were encountered: