Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

interactive plotting with lonboard #67

Merged
merged 32 commits into from
Oct 14, 2024
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
302a97b
draft of the explore functionality based on `lonboard`
keewis Oct 1, 2024
3606fb2
expose the draft
keewis Oct 1, 2024
0851443
docstring for `explore`
keewis Oct 2, 2024
774cf3f
use the proper import from `matplotlib` for the standard normalizer
keewis Oct 2, 2024
2ffbb5f
use `pandas` to compute the half range
keewis Oct 2, 2024
6348ce2
specify `vmin` and `vmax`
keewis Oct 2, 2024
0c2dc44
explicitly specify `skipna`
keewis Oct 2, 2024
41856c1
refactor to construct the polygons as geoarrow
keewis Oct 8, 2024
091a7bd
allow choosing the transparency
keewis Oct 8, 2024
2ef3192
move the healpix-specific code to `xdggs.healpix`
keewis Oct 8, 2024
46411a4
change the signature of `cell_boundaries`
keewis Oct 8, 2024
05495b8
also have h3 support the geoarrow cell boundaries
keewis Oct 8, 2024
5149fa4
fix several typos
keewis Oct 8, 2024
1ec8bf3
adjust the tests to fit the new dateline fix algorithm
keewis Oct 8, 2024
269f684
update the list of extra dependencies for `explore`
keewis Oct 8, 2024
27a6193
mention that this only works for 1D arrays for now
keewis Oct 8, 2024
f7ddf09
add optional dependencies for `explore`
keewis Oct 8, 2024
e2958b7
explicitly use `pyproj` to construct the crs string
keewis Oct 8, 2024
f96a64b
raise an error if the geometry type is wrong
keewis Oct 8, 2024
49b54fe
wrap the output of `crs.to_json()` in a json dict
keewis Oct 8, 2024
20fd427
verify that both backends produce the same polygons
keewis Oct 9, 2024
8117e7d
move the data normalization to a separate function
keewis Oct 12, 2024
0a2f577
move the table creation to a separate function
keewis Oct 12, 2024
71f612a
check that the normalization works properly
keewis Oct 12, 2024
998e6fb
coverage configuration
keewis Oct 12, 2024
0d31059
pass on `center` to the normalizing function
keewis Oct 12, 2024
5a71559
skip the plotting module if `arro3.core` is not available
keewis Oct 12, 2024
76ffe15
use `json` to construct the arrow extension metadata
keewis Oct 12, 2024
84938a1
Revert "skip the plotting module if `arro3.core` is not available"
keewis Oct 12, 2024
6ca24de
workaround for missing macos-arm packages of arro3-core
keewis Oct 12, 2024
6d9b78a
install arro3-core using `pip`
keewis Oct 12, 2024
c891e12
always include the cell ids in the table data
keewis Oct 14, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 38 additions & 0 deletions xdggs/accessor.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@

from xdggs.grid import DGGSInfo
from xdggs.index import DGGSIndex
from xdggs.plotting import explore


@xr.register_dataset_accessor("dggs")
Expand Down Expand Up @@ -115,3 +116,40 @@ def cell_boundaries(self):
return xr.DataArray(
boundaries, coords={self._name: self.cell_ids}, dims=self.cell_ids.dims
)

def explore(
self, *, cell_boundaries="cell_boundaries", cmap="viridis", center=None
):
"""interactively explore the data using `lonboard`

Requires `geopandas`, `matplotlib`, and `lonboard` to be installed.
keewis marked this conversation as resolved.
Show resolved Hide resolved

Parameters
----------
cell_boundaries : str
The name of the coordinate containing the pre-computed cell boundary polygons.
cmap : str
The name of the color map to use
center : int or float, optional
If set, will choose this as the center point for a diverging color map.

Returns
-------
map : lonboard.Map
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like you have docs yet, but if you use sphinx or mkdocs (with mkdocstrings), lonboard implements intersphinx, so you should be able to set it up so the return value is a hyperlink to the lonboard docs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the hint. I'll probably use sphinx to set up the docs, so I'll definitely make use of that!

The rendered map.

Notes
-----
Plotting currently is restricted to `DataArray` objects.
"""
if isinstance(self._obj, xr.Dataset):
raise ValueError("does not work with Dataset objects, yet")

cell_dim = self._obj[self._name].dims[0]
return explore(
self._obj,
cell_boundaries=cell_boundaries,
cell_dim=cell_dim,
cmap=cmap,
center=center,
)
53 changes: 53 additions & 0 deletions xdggs/plotting.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
import numpy as np


def explore(
arr,
cell_boundaries="cell_boundaries",
cell_dim="cells",
cmap="viridis",
center=None,
):
import geopandas as gpd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case you have interest, it would be possible to implement this conversion in a way that doesn't require GeoPandas (or pyarrow), by using raw numpy buffers and arro3, a more minimal implementation of Arrow in Python that Lonboard adds as a dependency in the next version.

This would also be faster to render as it doesn't have to go through GeoPandas (and shapely geometries).

But it would be a little more complex code here. I can give you some pointers if you want.

Copy link
Collaborator Author

@keewis keewis Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd definitely be interested.

Right now, what we're doing is ask the user to pre-compute cell vertices from the cell ids (which can have various forms for different DGGS, but most often are int64 or uint64 arrays), then convert them into shapely polygons. This is one of the things that take the longest other than converting to geopandas and constructing the SolidPolygonLayer, so if we could instead extend deck.gl to directly support them (like it supports H3 ids), that would be best. However, I suppose that will either take a long time and require someone to actually do the work, especially since there's so many different kinds of DGGS (and I'm not confident in my TypeScript / JavaScript skills so I won't be able to work on this myself anytime soon).

So as the next best step, should we construct geoarrow polygons from the cell vertex coordinates, possibly using arro3? If you have any examples I'd be happy to try figuring out how to do that.

Copy link
Contributor

@kylebarron kylebarron Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the end of the day, we need to create vertex information in Lonboard/deck.gl for rendering. I believe that even deck.gl's H3 layer does not render the h3 cells on the GPU from an ID, but rather uses those IDs on the CPU in JS to create the vertex geometry.

For now, I think it's simplest to convert cell ids to Polygon geometries in Python, and then we can use the SolidPolygonLayer/PolygonLayer directly.

A GeoArrow Polygon array is just:

  • a contiguous buffer of all the coordinates in all the polygons
  • plus an offset array that points into the coordinates array and which says which coordinates are in each ring
  • plus an offset array that points into the previous ring offsets array, and which says how many rings are in each geometry.

Here's a quick code example that might help illustrate:

import arro3.core
import h3.api.numpy_int as h3
import lonboard
import numpy as np
from arro3.core import Array, Field, fixed_size_list_array, list_array

cell = h3.geo_to_h3(-10, 15, 4)
disk = h3.k_ring(cell, 2)

# Overallocates in case there are any pentagons
coords_per_hexagon = 7
total_coords = len(disk) * coords_per_hexagon
# Allocate coordinate buffer
coords = np.zeros((total_coords, 2), dtype=np.float64)
# Allocate ring offsets. The length should be one greater than the number of rings
ring_offsets = np.zeros(len(disk) + 1, dtype=np.int32)
# In the case of h3 cells, we know that every cell will be a simple polygon with a
# single ring, so we can just use `arange`
geom_offsets = np.arange(len(disk) + 1, dtype=np.int32)

coord_offset = 0
for i, h in enumerate(disk):
    boundary = h3.h3_to_geo_boundary(h, geo_json=True)
    num_coords = len(boundary)
    coords[coord_offset : coord_offset + num_coords, :] = boundary
    coord_offset += num_coords
    ring_offsets[i + 1] = coord_offset

# shrink the coordinates array, in case there are any pentagons
last_coord = ring_offsets[-1]
coords = coords[:last_coord, :]


polygon_array = list_array(geom_offsets, list_array(ring_offsets, coords))
# We need to tag the array with extension metadata (`geoarrow.polygon`) so that Lonboard knows that this is a geospatial column.
polygon_array_with_geo_meta = polygon_array.cast(
    polygon_array.field.with_metadata({"ARROW:extension:name": "geoarrow.polygon"})
)
lonboard.viz(polygon_array_with_geo_meta)

(Note that you need to use latest main of arro3, because I just implemented support for using numpy-backed memory for Rust Arrow arrays kylebarron/arro3#204, kylebarron/arro3#208)

image

(Note that this simple implementation probably breaks over the antimeridian.)

h3-py is not itself vectorized, so converting from h3 cell ids to polygon vertices is quite slow. In the longer term, I'd like to update https://github.com/nmandery/h3ronpy to have a fast way to convert from h3 cell ids to GeoArrow polygon array (or MultiPolygonArray in the case of cells that cross the antimeridian)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks great @kylebarron !

I believe that even deck.gl's H3 layer does not render the h3 cells on the GPU from an ID, but rather uses those IDs on the CPU in JS to create the vertex geometry.

Interesting. Do you know if deck.gl has some clever optimizations for plotting H3/S2 like, e.g., compute on the fly the cell polygon geometries and/or aggregating data (parent cells), depending on the current zoom level / display extent? This might still have some benefits for plotting very large datasets (common use case) interactively. Or maybe Lonboard allows doing such optimizations on the server side (widget callback)?

I just implemented support for using numpy-backed memory for Rust Arrow arrays

I guess this could really help better integration between arrow and xarray in general? 👀

Copy link
Collaborator Author

@keewis keewis Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the longer term, I'd like to update https://github.com/nmandery/h3ronpy to have a fast way to convert from h3 cell ids to GeoArrow polygon array

h3ronpy has h3ronpy.arrow.vector.cells_to_wkb_polygons which would require you to decode WKB, so not sure if that's better.

In any case, I've translated my example from #11 to use your example (while borrowing code from lonboard.viz to construct the arrow table and construct a SolidPolygonLayer):

import xarray as xr
import arro3.core
import healpy as hp
import lonboard
import numpy as np
from matplotlib import colormaps
from arro3.core import Array, Field, fixed_size_list_array, list_array, Schema, Table, ChunkedArray

resolution = 10

ds = (
    xr.Dataset(
        coords={
            "cell_ids": (
                "cells",
                np.arange(12 * 4**resolution),
                {
                    "grid_name": "healpix",
                    "resolution": resolution,
                    "indexing_scheme": "nested",
                },
            )
        }
    )
    .pipe(xdggs.decode)
    .pipe(lambda ds: ds.merge(ds.dggs.cell_centers()))
    .assign(data=lambda ds: np.sin(np.radians(ds["latitude"])))
)

level = 10
cell_ids = np.arange(12 * 4**level)


def cell_ids2vertices(cell_ids, level, indexing_scheme):
    nest = indexing_scheme == "nested"
    nside = 2**level

    boundary_vectors = hp.boundaries(nside, cell_ids, step=1, nest=nest)
    lon, lat = hp.vec2ang(np.moveaxis(boundary_vectors, 1, -1), lonlat=True)
    lon_ = (lon + 180) % 360 - 180
    boundaries = np.reshape(np.stack((lon_, lat), axis=-1), (-1, 4, 2))

    # fix the dateline / prime meridian issue
    lon_ = boundaries[..., 0]
    to_fix = abs(np.max(lon_, axis=-1) - np.min(lon_, axis=-1)) > 300
    fixed_lon = lon_[to_fix, :] + 360
    boundaries[to_fix, :, 0] = fixed_lon

    return boundaries


def cell_ids2cell_boundaries_geoarrow(cell_ids, level, indexing_scheme):
    vertices = cell_ids2vertices(cell_ids, level, indexing_scheme)
    boundaries = np.concatenate([vertices, vertices[:, :1, :]], axis=1)
    
    coords = np.reshape(boundaries, (-1, 2))
    coords_per_pixel = boundaries.shape[1]
    geom_offsets = np.arange(cell_ids.size + 1, dtype="int32")
    ring_offsets = geom_offsets * coords_per_pixel

    polygon_array = list_array(geom_offsets, list_array(ring_offsets, coords))
    # We need to tag the array with extension metadata (`geoarrow.polygon`) so that Lonboard knows that this is a geospatial column.
    polygon_array_with_geo_meta = polygon_array.cast(
        polygon_array.field.with_metadata({"ARROW:extension:name": "geoarrow.polygon"})
    )
    return polygon_array_with_geo_meta

polygons = cell_ids2cell_boundaries_geoarrow(cell_ids, level=level, indexing_scheme="nested")
values = np.linspace(0, 1, cell_ids.size)
cmap = colormaps["viridis"]
colors = apply_continuous_cmap(values, cmap)

array = Array.from_arrow(polygons)
field = array.field.with_name("geometry")
schema = Schema([field])
table = Table.from_arrays([array], schema=schema)

num_rows = len(array)
if num_rows <= np.iinfo(np.uint8).max:
    arange_col = Array.from_numpy(np.arange(num_rows, dtype=np.uint8))
elif num_rows <= np.iinfo(np.uint16).max:
    arange_col = Array.from_numpy(np.arange(num_rows, dtype=np.uint16))
elif num_rows <= np.iinfo(np.uint32).max:
    arange_col = Array.from_numpy(np.arange(num_rows, dtype=np.uint32))
else:
    arange_col = Array.from_numpy(np.arange(num_rows, dtype=np.uint64))

table = table.append_column("row_index", ChunkedArray([arange_col])).append_column("values", ChunkedArray([Array.from_numpy(values)]))

layer = lonboard.SolidPolygonLayer(table=table, get_fill_color=colors)
map = lonboard.Map(layer)

(this also doesn't fully resolve the dateline issue)

which takes ~19 s to complete, plus about 10s to render. The version that converts to geopandas takes ~54s plus another ~10s to render, so this is a huge improvement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's cool! I should figure out how to use pyinstrument. Was it a one-line thing to create that?

├─ 11.166 SolidPolygonLayer.__init__  lonboard/_layer.py:1498
│     [26 frames hidden]  lonboard, ipywidgets, pyarrow, <__arr...
│        8.384 ParquetWriter.write_table  pyarrow/parquet/core.py:1084
  1. That's a lot of time just to write Parquet. I wouldn't have expected it to be that slow. For a long time I've thought about having a toggle for the user to switch off parquet encoding (e.g. if the Python environment is on the same machine as the browser). We go through Parquet currently because it creates a much smaller download in the case of remote Python environments.
  2. I'm curious what those other frames are. It still shouldn't be 3 seconds of overhead

Copy link
Collaborator Author

@keewis keewis Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was it a one-line thing to create that?

it has different modes for how you can run it (see the user guide). I put the code above in a python script, then ran it with python -m pyinstrument lonboard_example.py, but you could also use the context manager:

with pyinstrument.profile():
    ...

or the jupyter / ipython cell magic %%pyinstrument (load first with %load_ext pyinstrument)

Copy link
Collaborator Author

@keewis keewis Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious what those other frames are. It still shouldn't be 3 seconds of overhead

the ones beneath SolidPolygonLayer? Here you go:

├─ 11.018 SolidPolygonLayer.__init__  lonboard/_layer.py:1498
│  └─ 11.018 SolidPolygonLayer.__init__  lonboard/_layer.py:266
│     ├─ 9.058 SolidPolygonLayer.__init__  lonboard/_layer.py:80
│     │  └─ 9.058 SolidPolygonLayer.__init__  lonboard/_base.py:16
│     │     └─ 9.058 SolidPolygonLayer.__init__  ipywidgets/widgets/widget.py:500
│     │           [1 frames hidden]  ipywidgets
│     │              8.991 SolidPolygonLayer.get_state  ipywidgets/widgets/widget.py:589
│     │              ├─ 8.330 serialize_table  lonboard/_serialization.py:72
│     │              │  └─ 8.330 serialize_table_to_parquet  lonboard/_serialization.py:25
│     │              │     └─ 8.321 ParquetWriter.write_batch  pyarrow/parquet/core.py:1068
│     │              │           [0 frames hidden]  
│     │              │              8.321 ParquetWriter.write_table  pyarrow/parquet/core.py:1084
│     │              └─ 0.661 serialize_accessor  lonboard/_serialization.py:58
│     │                 └─ 0.661 serialize_pyarrow_column  lonboard/_serialization.py:52
│     │                    └─ 0.661 serialize_table_to_parquet  lonboard/_serialization.py:25
│     │                       └─ 0.641 ParquetWriter.write_batch  pyarrow/parquet/core.py:1068
│     │                             [1 frames hidden]  pyarrow
│     └─ 1.958 default_geoarrow_viewport  lonboard/_layer.py:215
│        └─ 1.865 total_bounds  lonboard/_geoarrow/ops/bbox.py:36
│           └─ 1.865 _total_bounds_nest_2  lonboard/_geoarrow/ops/bbox.py:80
│              └─ 1.865 _coords_bbox  lonboard/_geoarrow/ops/bbox.py:55
│                 ├─ 0.939 amax  <__array_function__ internals>:177
│                 │     [3 frames hidden]  numpy, <built-in>
│                 └─ 0.926 amin  <__array_function__ internals>:177
│                       [3 frames hidden]  numpy, <built-in>

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the most recent commits integrate the code we talked about into xdggs. If you have time to look at I'd appreciate another review, but either way thanks for all the help, this is already so much faster than the code I had before.

Copy link
Contributor

@kylebarron kylebarron Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW the next version of geoarrow.rust.core will have constructor functions: geoarrow/geoarrow-rs#810. And so you can pass the arrays directly into geoarrow.rust.core.polygons. Using those functions would ensure that the arrays generated are fully spec-compatible.

(As it is, you're generating data whose physical buffers are the same as geoarrow arrays, but you're missing some metadata like the correct field names for nested arrays and the correct nullability flags for nested arrays. Lonboard allows that input but it's not 100% to the geoarrow spec, especially if we enforce it in issues like geoarrow/geoarrow#64)

import lonboard
from lonboard import SolidPolygonLayer
from lonboard.colormap import apply_continuous_cmap
from matplotlib import colormaps
from matplotlib.colors import CenteredNorm, Normalize

if len(arr.dims) != 1 or cell_dim not in arr.dims:
raise ValueError(
f"exploration only works with a single dimension ('{cell_dim}')"
)

if cell_boundaries not in arr.coords:
raise ValueError(
f"cannot find the cell boundaries coordinate: '{cell_boundaries}'"
)

name = arr.name or "__temporary_name__"

gdf = (
arr.to_dataset(name=name)
.to_pandas()
.pipe(gpd.GeoDataFrame, geometry=cell_boundaries, crs=4326)
)

data = gdf[name]

if center is None:
vmin = data.min(skipna=True)
vmax = data.max(skipna=True)
normalizer = Normalize(vmin=vmin, vmax=vmax)
else:
halfrange = np.abs(data).max(skipna=True)
normalizer = CenteredNorm(vcenter=center, halfrange=halfrange)

normalized_data = normalizer(data)

colormap = colormaps[cmap]
colors = apply_continuous_cmap(normalized_data, colormap)

layer = SolidPolygonLayer.from_geopandas(gdf, filled=True, get_fill_color=colors)

return lonboard.Map(layer)