Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add example using xarray atmospheric data #32

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from
Draft

Add example using xarray atmospheric data #32

wants to merge 7 commits into from

Conversation

jni
Copy link
Owner

@jni jni commented Nov 27, 2024

Example to view xarray atmospheric model data in napari.

The example data is too large to display in our gallery, so I'm making the PR
against my own repo for discussion. We can try the same approach with a smaller
dataset, or with a dataset in a cloud-native format like zarr, where therefore
the gallery would only need to load a single timepoint.

Code to download the data in this example:

# download model prediction data
curl -O https://thredds.nci.org.au/thredds/fileServer/wr45/ops_aps3/access-g/1/20241104/0000/fc/ml/air_temp.nc
curl -O https://thredds.nci.org.au/thredds/fileServer/wr45/ops_aps3/access-g/1/20241104/0000/fc/ml/spec_hum.nc

# download corresponding 10 days' worth of measurements
mkdir an && cd an  # use 'an' folder for single time points
for day in 04 05 06 07 08 09 10 11 12 13; do
  for hour in 00 06 12 18; do
    curl https://thredds.nci.org.au/thredds/fileServer/wr45/ops_aps3/access-g/1/202411${day}/${hour}00/an/ml/spec_hum.nc -o ${day}-${hour}-spec_hum.nc
    curl https://thredds.nci.org.au/thredds/fileServer/wr45/ops_aps3/access-g/1/202411${day}/${hour}00/an/ml/air_temp.nc -o ${day}-${hour}-air_temp.nc
  done
done

As noted in the example itself, napari's axes ([plane], row, column), with the
origin at the top left, match NumPy arrays but are unsuitable for latitude
data, which starts at 90 at the top and ends at -90 at the bottom. The polarity
of the latitude is therefore inverted for plotting, but we must add the ability
in napari to describe the geometry of the world space relative to canvas space.

@jni
Copy link
Owner Author

jni commented Nov 27, 2024

@tennlee here's an example of grabbing the coordinate metadata from xarray and passing it to napari.

Unfortunately, I'm finding the resampling in xarray to be suuuuper slow. Here's what I mean:

In [6]: ds = xr.open_dataset('spec_hum.nc', chunks={'time': 1})
/Users/jni/micromamba/envs/all/lib/python3.11/site-packages/xarray/core/dataset.py:282: UserWarning: The specified chunks separate the stored chunks along dimension "time" starting at index 1. This could degrade performance. Instead, consider rechunking after loading.

In [26]: ds_reg = ds.interp(coords={'time': np.arange(np.array(ds.time[0]), np.a
    ...: rray(ds.time[-1]), np.array(np.diff(ds.time[:2]))[0])}, method='nearest
    ...: ')

In [39]: %timeit arr = np.asarray(ds_reg.spec_hum[0])
16.4 s ± 214 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [40]: %timeit arr = np.asarray(ds.spec_hum[0])
518 ms ± 94.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [41]: %timeit arr = np.asarray(ds.spec_hum[0, 0])
124 ms ± 9.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [42]: %timeit arr = np.asarray(ds_reg.spec_hum[0, 0])
34.4 s ± 7.88 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [43]: %%timeit
    ...: arr = np.asarray(ds.spec_hum[0, 0])
    ...: idxs = np.meshgrid(np.arange(arr.shape[0]), np.arange(arr.shape[1]), in
    ...: dexing='ij')
    ...: arr_res = ndi.map_coordinates(arr, idxs, order=0)
    ...:
    ...:
486 ms ± 25.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

What's happening behind the scenes in napari is:

  • loading xarray with chunks means that the xarray datasarray is backed by dask
  • when you move a slider in napari, napari materialises the requested array using np.asarray(arr[slice_info])
  • this is super slow currently for reasons I don't understand

(it's also annoying that the model data starts at 0100 and the measurements at 0000 😂)

It's still usable if you turn on async mode in napari, which you can do either with the NAPARI_ASYNC=1 environment variable, or by setting the experimental > render images asynchronously checkbox in the preferences (Cmd+, on Mac when the viewer is in focus).

Overall though, this is a very cool dataset. I like that it shows off napari's ability to overlay data with different time steps and extents, too. We could also load the temperature volumes and treat them the same way, and make them invisible by default. (Pass visible=False to the layer.)

Ideally, I'd like to save (a) the model resampled at 1h intervals, and (b) the measurements as geozarr or some format that xarray can natively read backed by zarr. If we put that online somewhere useful, this could go into the napari sample gallery (must be able to be built without downloading a massive dataset).

Any ideas?

@jni
Copy link
Owner Author

jni commented Nov 27, 2024

Oh I forgot to add some screenshots:
Screenshot 2024-11-27 at 2 07 57 PM
Screenshot 2024-11-27 at 2 59 40 PM

@jni
Copy link
Owner Author

jni commented Nov 28, 2024

Ah, @tennlee I figured out why the sampling is uneven in the raw data, now that I displayed it without resampling in its own viewer. If you hit play on the first viewer (the grayscale one), you will see that partway through the playback it speeds up. So the time interval increases as you get further into the model — which I guess makes sense since the model wouldn't be able to do hourly precision by that point anyway?

@tennlee
Copy link

tennlee commented Nov 28, 2024

I might have to get to this on the weekend. Dealing with 3 days of backlog apparently takes time. :)

@tennlee
Copy link

tennlee commented Nov 28, 2024

Just a note for something I thought would be neat to try. If there's a callback option in napari, it would be interesting to try to install scores into a local install and then register a verification calculation for overlay/visualisation against the main data. I haven't thought about how that would work, but it may be more interactive than doing all the data calc up front, particularly if people are being more investigative in reviewing data. Not saying it's important, I just wanted to share the idea so it's not just locked in my head only.

@jni
Copy link
Owner Author

jni commented Nov 29, 2024

If there's a callback option in napari, it would be interesting to try to install scores into a local install and then register a verification calculation for overlay/visualisation against the main data.

yes, this is easy to do! You can do viewer.dims.events.point.connect() which will fire whenever you move the sliders with the value of in "real" coordinates.

Having said that, in this case I would probably define a dask array based on the scores computation and display that, rather than hook a lower-dimensional array up to the current point.

@jni
Copy link
Owner Author

jni commented Nov 29, 2024

(Also btw — no rush on this, other than the excitement of finding a cool reason to collaborate. 😃 As I mentioned over Signal, I'm happy to come in to Melb some day to continue sprinting, when the time is right for you!)

@tennlee
Copy link

tennlee commented Nov 29, 2024

Cool. Well I've done the setup and reproduced the issue, which is a good first step.

@tennlee
Copy link

tennlee commented Nov 29, 2024

Okay, so I don't know why exactly, but I achieved
image

by removing the chunks argument in the original file load. Napari may need the chunking in that form, but the interpolation is quick without it.

I'll try loading the resulting thing into napari and testing.

(update - napari seems perfectly happy)

@jni
Copy link
Owner Author

jni commented Nov 29, 2024

The chunking is totally optional. I'm not surprised at those results @tennlee because without the chunks argument

  • ds is completely loaded into memory as a NumPy rather than dask array, and
  • ds_reg is computed eagerly (the ds_reg = ... call takes a long time, yes?), and uses a lot of memory in the process, too. I think you are just front-loading the slowness.

At that point, you have a NumPy array and everything should be super fast indeed. But your process will be using heaps of RAM, which is undesirable.

@tennlee
Copy link

tennlee commented Nov 30, 2024

@jni
Copy link
Owner Author

jni commented Dec 9, 2024

Oof, found the culprit:

The current code also has the unfortunate side-effect of merging all chunks too.

from pydata/xarray#6799

@jni
Copy link
Owner Author

jni commented Dec 9, 2024

Given this limitation I think the solution for making this example work nicely is to save the interpolated array to zarr and host it somewhere.

@jni
Copy link
Owner Author

jni commented Jan 3, 2025

Well this is exciting:

In [4]: ds = xr.open_zarr('https://zarr-data.xyz/thredds-20241104-spec_hum-resampled.zarr')
In [5]: ds
Out[5]:
<xarray.Dataset> Size: 12GB
Dimensions:    (theta_lvl: 4, lat: 1536, lon: 2048, time: 239)
Coordinates:
  * lat        (lat) float64 12kB 89.94 89.82 89.71 ... -89.71 -89.82 -89.94
  * lon        (lon) float64 16kB 0.08789 0.2637 0.4395 ... 359.6 359.7 359.9
  * theta_lvl  (theta_lvl) float32 16B 20.0 53.34 100.0 160.0
  * time       (time) datetime64[ns] 2kB 2024-11-04T01:00:00 ... 2024-11-13T2...
Data variables:
    A_theta    (theta_lvl) float32 16B dask.array<chunksize=(1,), meta=np.ndarray>
    B_theta    (theta_lvl) float32 16B dask.array<chunksize=(1,), meta=np.ndarray>
    spec_hum   (time, theta_lvl, lat, lon) float32 12GB dask.array<chunksize=(1, 1, 1536, 2048), meta=np.ndarray>
Attributes:
    Conventions:   CF-1.5,ACDD-1.3
    base_date:     20241104
    base_time:     0
    date_created:  20241104
    expt_id:       0001
    institution:   Australian Bureau of Meteorology
    modl_vrsn:     ACCESS-G
    source:        APS3
    summary:       forecast data
    title:         forecast data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants