Handling datasets across HoloViz #340

maximlt · 2022-12-05T16:55:29Z

The datasets used across HoloViz come from various places:

some come from Bokeh's sampledata
some are hosted on S3 (not just in one bucket I believe)
some are hosted directly on the repos
some are linked from other sources (e.g. pd.read_csv('https://...'))
some are referenced in Intake catalogs

I suggest we define a list of datasets we want to use and define a way to manage them (where are they hosted? how to get fetch them?).

Having so many sources isn't ideal:

having more consistency on a single website and across websites help users focus on HoloViz features', instead of trying to understand what to the data is about and how it's structured everytime they encounter a new dataset (typical issue with the current hvPlot gallery)
it can make running the examples more difficult than it should be (users struggle running hvPlot examples)
some come bundled with the packages and as such increase their size

This is partially related to how the /examples folder should be handled holoviz-dev/nbsite#243

The text was updated successfully, but these errors were encountered:

maximlt · 2022-12-05T16:55:53Z

xref holoviz/lumen#369 (thanks @hoxbro!)

droumis · 2022-12-05T17:31:38Z

This would definitely be a great benefit to the community.

Here are some related data from our user survey that may inform the course of action. I think it makes sense to balance infrastructure concerns with consideration of the most common (therefore helpful) reported data approaches.

jlstevens · 2022-12-05T19:36:02Z

I agree it is a mess but I think all of the approaches you listed have validity. The one approach I like the least is to make intake an extra dependency (unless you know that intake is required for some other reason, I prefer not to have another dependency just to fetch sample data!).

One idea I've just had: mirror bokeh sample data, the samples shipped with the package and everything else into an S3 bucket. Then have a static page served from S3 describing all the datasets with the S3 URL to get to it - then we could point to this one page every time we need sample data. The biggest downside is that this would require an internet connection for all examples...

jbednar · 2022-12-05T20:10:48Z

For the gridded data examples in hvPlot we also pull from xarray sample data, which was an issue last week for a user who had to install various dependent packages before being able to run the examples, so it would be nice to clean this up. Maybe best would be to have our own S3 bucket to fetch from along with some other fallback for those without live internet access at the time of the examples (some way to pre-fetch the datasets locally such that we find them there before checking on S3).

droumis · 2024-04-25T01:15:39Z

superseded by #394

maximlt added this to Infra Dec 5, 2022

maximlt moved this to In Progress in Infra Dec 9, 2022

droumis closed this as completed Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling datasets across HoloViz #340

Handling datasets across HoloViz #340

maximlt commented Dec 5, 2022

maximlt commented Dec 5, 2022 •

edited

Loading

droumis commented Dec 5, 2022

jlstevens commented Dec 5, 2022

jbednar commented Dec 5, 2022

droumis commented Apr 25, 2024

Handling datasets across HoloViz #340

Handling datasets across HoloViz #340

Comments

maximlt commented Dec 5, 2022

maximlt commented Dec 5, 2022 • edited Loading

droumis commented Dec 5, 2022

jlstevens commented Dec 5, 2022

jbednar commented Dec 5, 2022

droumis commented Apr 25, 2024

maximlt commented Dec 5, 2022 •

edited

Loading