Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add array SVG image and table to _repr_html_ #9301

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

maxrjones
Copy link
Contributor

In the SciPy tutorial, I found that the lack of visual representation for NumPy backed arrays made it more difficult for new users to understand the Xarray data model. This PR uses Dask's approach (started in dask/dask#4794) to add a table with Bytes, Shape, and Data type, along with a SVG for any array whose backend doesn't already implement _repr_html_. The code's pretty rough (e.g., pre-commit fails) and the layout could use improvement but I wanted to find out if it's of interest before working on it more.

Relates to #8690, but only for html representation.

Here's a notebook as an example and here's what the repr currently looks like:
image

cc @scottyhq @JessicaS11 @TomNicholas

  • Closes #xxxx
  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

@maxrjones
Copy link
Contributor Author

If jinja is too heavy of a dependency, the next best thing would probably to see if this could go into https://github.com/benbovy/xarray-fancy-repr. Would love to hear any thoughts from @benbovy 🙏

@max-sixty
Copy link
Collaborator

Thanks @maxrjones !

I'd be +0.5 on this if folks think it makes the display clearer.

For maintainability — to what extent is svg.py modified from dask? It might be better to have those split out so we can easily take upgrades. (or only use it if dask is installed, though might be a bit confusing...)

@TomNicholas
Copy link
Member

This is great @maxrjones!

only use it if dask is installed

FYI this is no longer a dask-specific question, as cubed also has these html array reprs (code taken from dask but adjusted enough that it doesn't make sense for them to live in the same place).

@maxrjones
Copy link
Contributor Author

For maintainability — to what extent is svg.py modified from dask? It might be better to have those split out so we can easily take upgrades. (or only use it if dask is installed, though might be a bit confusing...)

The primary modification was adding labels along axes for the dimension names and adjusting offsets to allow space for those labels. Although not used right now, I also added flexibility to the colors with the idea that it would be nice to use different colors for coords vs. data variables. All of that could possibly be handled by modifying the SVGs returned from Dask.

I know cubed more directly vendors this code for its array _html_repr_, so maybe there's a motivation to separate it out from Dask similar to what happened with donfig.

@maxrjones
Copy link
Contributor Author

maxrjones commented Aug 1, 2024

😆 Tom our wavelengths are way too similar today, but you're just faster! I swear I didn't see your comment before hitting enter, helpful to know that you don't directly vendor it though

@TomNicholas
Copy link
Member

The PR which added the html repr to cubed is here: cubed-dev/cubed#216

(code taken from dask but adjusted enough that it doesn't make sense for them to live in the same place).

Looking back at the PR now I have a bit lower confidence in this claim.

Generally though I think having a separate repo just to avoid duplicating code in N=2 places seems like overkill, but if similar code lived in xarray for non-chunked arrays it would be N=3...

@maxrjones
Copy link
Contributor Author

iirc dask doesn't like to have any external dependencies, so right now the main potential users would be xarray and cubed. But maybe it'd be worthwhile asking somewhere if it could be useful as an optional NumPy / CuPy dependency that monkeypatches _repr_html_.

I'm wondering what the plan was with the formatting modules and NamedArray? because the SVG representation could be useful for the NamedArrays which might motivate not including it in modules that would get left behind in Xarray.

@dcherian
Copy link
Contributor

dcherian commented Aug 1, 2024

If jinja is too heavy of a dependency,

This is just one template, I bet you can do it pretty easily without jinja.


My main concern is that at first glance, this looks like a dask array, and that's going to confuse medium->experienced users quite a bit. To that end: we should definitely change some colors, explicitly add the array type, and maybe change some aspect ratio parameters to make it look different visually.

It'd also be nice to figure out how to show the actual values since we have them. It'd be nice not to have to expand one more level to see any small arrays.

@benbovy
Copy link
Member

benbovy commented Aug 2, 2024

Agreed with @dcherian. Also things like dimension labels would be nice to have for dask-baked variables as well.

That's probably beyond the scope of this PR, but I wonder if a more general solution would be to have an additional icon at the right of the variable inline repr (there is still some room for that) to show the same svg representation for all duck arrays, assuming these all have the common metadata we need for this representation: shape, etc. So we can leave the current "database" icon for showing the original repr of the wrapped array.

@maxrjones
Copy link
Contributor Author

Thanks for all the feedback, everyone! I'll have some time next week to work on the next iteration based on these recommendations. My goals will be:

  1. Remove jinja as a dependency
  2. Make SVG representation for non-Dask backed arrays distinct from Dask arrays
  3. Update the table to be more easily configurable and include the array type

I will also explore the feasibility of two additional ideas:

  1. Add dimension labels for dask backed arrays
  2. Separate this into a separate icon from the database icon, which could be used for all arrays meeting a subset of the Array API standard

@benbovy
Copy link
Member

benbovy commented Aug 7, 2024

Separate this into a separate icon from the database icon, which could be used for all arrays meeting a subset of the Array API standard

I'd strongly support exploring this, actually!

While the original reprs of Dask arrays, Numpy arrays, etc. have limitations in the context of Xarray, users are familiar with it. For array containers / wrappers like Dataset and DataArray I think it is important to keep things transparent, i.e., that the original repr(s) of the wrapped array(s) remain easily accessible and unchanged. It is also more future-proof since those 3rd-party array reprs may later change.

In terms of UI/UX a separate icon button next to the other ones has the advantage of not introducing another nested level compared to the Values collapsible section in the screenshot above. I think that in general it is a good thing to keep the number of nested sections as low as possible (especially considering DataTree objects). (EDIT: cf. @dcherian's comment which I didn't read carefully enough)

In terms of implementation a separate icon is pretty easily maintainable. It shouldn't be hard to add a new one as the html formatting has been written in a pretty composable way already, e.g., it would look like adding those lines in summarize_variable:

def summarize_variable(name, var, is_index=False, dtype=None) -> str:
    ...

    cube_id = "cube-" + str(uuid.uuid4())
    ...
    cube_repr = svg_cube_repr_html(variable)
    ...
    cube_icon = _icon("icon-cube")
    
    return (
        ...
        f"<input id='{cube_id}' class='xr-var-cube-in' type='checkbox'>"
        f"<label for='{cube_id}' title='Show/Hide data repr'>"
        f"{cube_icon}</label>"
        ...
        f"<div class='xr-var-cube'>{cube_repr}</div>"
    )

One concern I have with this approach, though, is the performance impact of generating and including another SVG image for each variable in the static html repr, which might become problematic for big Datasets with many variables. Such UI element could be generated on-demand as it is hidden by default, however there is no way to do that other than having a dynamic widget like in https://github.com/benbovy/xarray-fancy-repr.

That all being said, thanks @maxrjones for working on this PR, overall I think it would be a nice improvement!

@maxrjones
Copy link
Contributor Author

Separate this into a separate icon from the database icon, which could be used for all arrays meeting a subset of the Array API standard

I'd strongly support exploring this, actually!

While the original reprs of Dask arrays, Numpy arrays, etc. have limitations in the context of Xarray, users are familiar with it. For array containers / wrappers like Dataset and DataArray I think it is important to keep things transparent, i.e., that the original repr(s) of the wrapped array(s) remain easily accessible and unchanged. It is also more future-proof since those 3rd-party array reprs may later change.

In terms of UI/UX a separate icon button next to the other ones has the advantage of not introducing another nested level compared to the Values collapsible section in the screenshot above. I think that in general it is a good thing to keep the number of nested sections as low as possible (especially considering DataTree objects). (EDIT: cf. @dcherian's comment which I didn't read carefully enough)

In terms of implementation a separate icon is pretty easily maintainable. It shouldn't be hard to add a new one as the html formatting has been written in a pretty composable way already, e.g., it would look like adding those lines in summarize_variable:

def summarize_variable(name, var, is_index=False, dtype=None) -> str:
    ...

    cube_id = "cube-" + str(uuid.uuid4())
    ...
    cube_repr = svg_cube_repr_html(variable)
    ...
    cube_icon = _icon("icon-cube")
    
    return (
        ...
        f"<input id='{cube_id}' class='xr-var-cube-in' type='checkbox'>"
        f"<label for='{cube_id}' title='Show/Hide data repr'>"
        f"{cube_icon}</label>"
        ...
        f"<div class='xr-var-cube'>{cube_repr}</div>"
    )

One concern I have with this approach, though, is the performance impact of generating and including another SVG image for each variable in the static html repr, which might become problematic for big Datasets with many variables. Such UI element could be generated on-demand as it is hidden by default, however there is no way to do that other than having a dynamic widget like in benbovy/xarray-fancy-repr.

That all being said, thanks @maxrjones for working on this PR, overall I think it would be a nice improvement!

Thanks for all these thoughts! A solution to the performance impact could be to make adding that separate icon configurable via xarray.set_options. I'll run couple simple tests first to see if we can understand how the performance hit scales.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants