Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling datasets across HoloViz #340

Closed
maximlt opened this issue Dec 5, 2022 · 5 comments
Closed

Handling datasets across HoloViz #340

maximlt opened this issue Dec 5, 2022 · 5 comments

Comments

@maximlt
Copy link
Member

maximlt commented Dec 5, 2022

The datasets used across HoloViz come from various places:

  • some come from Bokeh's sampledata
  • some are hosted on S3 (not just in one bucket I believe)
  • some are hosted directly on the repos
  • some are linked from other sources (e.g. pd.read_csv('https://...'))
  • some are referenced in Intake catalogs

I suggest we define a list of datasets we want to use and define a way to manage them (where are they hosted? how to get fetch them?).

Having so many sources isn't ideal:

  • having more consistency on a single website and across websites help users focus on HoloViz features', instead of trying to understand what to the data is about and how it's structured everytime they encounter a new dataset (typical issue with the current hvPlot gallery)
  • it can make running the examples more difficult than it should be (users struggle running hvPlot examples)
  • some come bundled with the packages and as such increase their size

This is partially related to how the /examples folder should be handled holoviz-dev/nbsite#243

@maximlt
Copy link
Member Author

maximlt commented Dec 5, 2022

xref holoviz/lumen#369 (thanks @hoxbro!)

@maximlt maximlt added this to Infra Dec 5, 2022
@droumis
Copy link
Member

droumis commented Dec 5, 2022

This would definitely be a great benefit to the community.

Here are some related data from our user survey that may inform the course of action. I think it makes sense to balance infrastructure concerns with consideration of the most common (therefore helpful) reported data approaches.

image

@jlstevens
Copy link
Collaborator

I agree it is a mess but I think all of the approaches you listed have validity. The one approach I like the least is to make intake an extra dependency (unless you know that intake is required for some other reason, I prefer not to have another dependency just to fetch sample data!).

One idea I've just had: mirror bokeh sample data, the samples shipped with the package and everything else into an S3 bucket. Then have a static page served from S3 describing all the datasets with the S3 URL to get to it - then we could point to this one page every time we need sample data. The biggest downside is that this would require an internet connection for all examples...

@jbednar
Copy link
Member

jbednar commented Dec 5, 2022

For the gridded data examples in hvPlot we also pull from xarray sample data, which was an issue last week for a user who had to install various dependent packages before being able to run the examples, so it would be nice to clean this up. Maybe best would be to have our own S3 bucket to fetch from along with some other fallback for those without live internet access at the time of the examples (some way to pre-fetch the datasets locally such that we find them there before checking on S3).

@maximlt maximlt moved this to In Progress in Infra Dec 9, 2022
@droumis
Copy link
Member

droumis commented Apr 25, 2024

superseded by #394

@droumis droumis closed this as completed Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants