Discussion : writing xarray-compatible zarr stores #66

emileten · 2024-04-16T10:43:37Z

Context

Hi @joemoorhouse. This topic was briefly mentioned in this issue os-climate/physrisk#263 when we discussed how coordinate labels are stored in, or retrieved from, the output zarr stores in OS-C, in particular the hazard indicators. It was stressed that we do not save coordinate labels directly in the zarr metadata and instead, save a matrix that can be used in combination with the indices to get the coordinates. This makes the output incompatible with native xarray operations. This is why we have a module dedicated to offering an api to read the output as xarray objects

Suggestion

Can we write the coordinates to the zarr metadata so as to have the output directly xarray-compatible ? This would for example allow us to experiment with titiler-xarray for serving tiles..

You mentioned performance concerns in the linked issue, but I am not sure to understand why, in principle, this issue should arise, as the ideal way, I believe (and there is some reference to that), to save such data would be one single zarr store with latitude, longitude, temporal, model and scenario dimensions (and with the coordinate labels for each of those in the zarr store metadata), with a chunking structure optimized for the usage of this data. This also suggests potentially broader changes in how the data is saved, since I think, for example for the days above tas workflow, we're saving separate zarr arrays for each year ?

sharkinsspatial · 2024-04-16T16:23:46Z

@emileten There is a currently an additional utility write_data_array method on the OSCZarr class that supports writing with xarray compliant _ARRAY_DIMENSIONS convention. This works for me when used in the pytest environment but is throwing an xarray exception on set_index when run with actual nex_gddp_cmip6 data, I'll need to investigate a bit more here.

@joemoorhouse if possible, can you elaborate a bit on the use of an index variable in the normalize_array method. I'm not clear on the purpose for indexing on an additional variable here?

Additionally, looking through some of the discussion on the interpolation methods in the physrisk repo, are there selection/interpolation methods being used in physrisk that can't be supported using standard xarray coordinates and interpolation routines?

I may be misunderstanding the intent, but it seems like some of the code could be potentially simplified by working directly in xarray with the zarr data in transform space rather than numpy with data in raw pixel space. I know this was discussed a bit here but can you provide a bit more detail on the advantages of the "Geotiff" based approach being used.

joemoorhouse · 2024-04-16T19:16:44Z

Hi @emileten, hi @sharkinsspatial,
Some answers in reverse order (and sounds like a call to discuss might also be useful?):
First on the question of why in physrisk we are not using xarray (as compared to hazard where we use widely), but rather we are using the Zarr library directly to access the Zarr array. Indeed I started by using xarray in physrisk but was fighting to get the performance I wanted, so opted to go lower level: there is not muchxarray functionality we need and performance optimisation was harder. In hazard, in contrast, our use cases already seem a good fit to using xarray + Dask.
More precisely, we have a (very) important main use case which is looking up single points from latitudes and longitudes. There we want to look up ~ millions of locations as quickly as possible. Here the affine transform gives a computational advantage because we can calculate the pixel locations directly: no look-up necessary at all. When doing the same thing with xarray labels I guess (?) we are at best doing a log2(N) complexity search for each look up along each co-ordinate axis, where N is the dimension length (from memory: I'd have to debug through this again to check what it is doing!).
Other considerations too:

We want to Zarr async behaviour in downloading chunks, but we need the flexibility to control any thread/process pool behaviour on top of that; I'm a bit wary about what xarray is doing there.
Nice to avoid loading up and caching labels, e.g. in the API.

joemoorhouse · 2024-04-16T19:29:06Z

The index dimension is just a convention whereby the hazard indicators are 3 dimensional Zarr arrays by construction. Two dimensions are spatial and the third is the index dimension that depends on the nature of the data. Acute hazards are 3D because we always have a number of return periods e.g. flood depth for 30Y, 100Y, 200Y, ... returns. For some chronic hazards it is also useful to make 3D because we have data that depends on a number of thresholds, e.g. days with wet bulb globe temperature above 20, 25, 30, ... C.

It is important performance-wise to have something 3D, because for the physical climate risk calculations we typically need all of the index values for a particular lat and lon (so also want to have in the same chunk).

joemoorhouse · 2024-04-16T20:09:08Z

On storing the coordinate labels, I think when the xarray [to_zarr method] (https://docs.xarray.dev/en/latest/generated/xarray.Dataset.to_zarr.html) is called the xarray is actually stored as a Zarr group with arrays for the data itself and each coordinate. That is, the convention is not to store the labels in the meta-data but have them in separate arrays. @emileten, was that what you had in mind, or was it actually a variation on that?

I suspect that the to_zarr method format is trying to keep the metadata (attributes) as small as possible so that it is fast to load an array - although you cannot then create an xarray without downloading the coordinates. I was also wondering whether to adopt that convention...

I am in two minds. Perhaps personal taste but I find the GeoTiff-like approach of storing the affine transform much more elegant than storing all of the labels: that way we can store the labels in the metadata without this getting too big because we are storing the labels so efficiently (6 numbers).

I also struggle a bit with the structure we end up with when using the to_zarr method because an array becomes a group. I tried this out for representing map tile sets where we end up with a group of say 9 zoom levels and each zoom level is itself a group containing the array and coordinates. I thought it seemed an overly complex structure.

On the other hand, adopting that convention does not prevent us from doing any performance optimisations, i.e. we could still ignore the co-ordinates arrays and regenerate from affine transform if needed.

Whatever option we choose, it is still pretty easy to create an xarray from the Zarr array and it might not prevent us from using say titiler-xarray (e.g. if we can add a hook for adding the logic to create the xarray from a single Zarr array, as opposed to a group).

emileten · 2024-04-18T08:50:22Z

Thanks both for all the input. For now we'll just ensure this module offers the option to write output with coordinates in the metadata : #67. This will be optional. The default remains the legacy behavior.

emileten · 2024-04-18T08:51:02Z

Marked this as a 'discussion', it's not really an issue anymore. Not sure how to convert an issue to an actual github discussion.

j08lue · 2024-08-13T19:44:19Z

Related discussion on Pangeo considering Xarray to support affine transforms in gridded data stores (e.g. Zarr):

https://discourse.pangeo.io/t/example-which-highlights-the-limitations-of-netcdf-style-coordinates-for-large-geospatial-rasters/4140/6

emileten self-assigned this Apr 16, 2024

emileten changed the title ~~Writing xarray-compatible zarr stores~~ Discussion : writing xarray-compatible zarr stores Apr 18, 2024

emileten mentioned this issue Apr 18, 2024

Offer the option in high-level APIs to write xarray-compatible zarr stores #67

Closed

j08lue mentioned this issue Dec 13, 2024

chore!: remove stac record creation developmentseed/os-climate-hazard#20

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion : writing xarray-compatible zarr stores #66

Discussion : writing xarray-compatible zarr stores #66

emileten commented Apr 16, 2024 •

edited

Loading

sharkinsspatial commented Apr 16, 2024

joemoorhouse commented Apr 16, 2024

joemoorhouse commented Apr 16, 2024

joemoorhouse commented Apr 16, 2024

emileten commented Apr 18, 2024

emileten commented Apr 18, 2024

j08lue commented Aug 13, 2024

Discussion : writing xarray-compatible zarr stores #66

Discussion : writing xarray-compatible zarr stores #66

Comments

emileten commented Apr 16, 2024 • edited Loading

Context

Suggestion

sharkinsspatial commented Apr 16, 2024

joemoorhouse commented Apr 16, 2024

joemoorhouse commented Apr 16, 2024

joemoorhouse commented Apr 16, 2024

emileten commented Apr 18, 2024

emileten commented Apr 18, 2024

j08lue commented Aug 13, 2024

emileten commented Apr 16, 2024 •

edited

Loading