Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chaining data loaders #332

Open
Fil opened this issue Dec 6, 2023 · 9 comments · May be fixed by #1522
Open

Chaining data loaders #332

Fil opened this issue Dec 6, 2023 · 9 comments · May be fixed by #1522
Labels
enhancement New feature or request

Comments

@Fil
Copy link
Contributor

Fil commented Dec 6, 2023

Suppose we want to 1. download a dataset from an API and 2. analyze it. Currently a data loader must do both at the same time, and will run the download part again if we update the analysis code.

Ideally we'd want to separate this into two chained data loaders, that would still be live, i.e. if a page relies on 2 that relies on 1, editing 1 would tell the page to require the analysis again, which would trigger a new download. But editing 2 would only run the analysis again, not the download.

This would also make it easier to generate several files from a common (and slow) download.

@Fil Fil added the enhancement New feature or request label Dec 6, 2023
@mbostock
Copy link
Member

mbostock commented Dec 6, 2023

From #325 (comment):

Tthe API for chaining loaders isn’t really about FileAttachment since data loaders can be written in any language. Instead a data loader needs to be able to fetch a file from the preview server (and we need an equivalent server during build). Maybe we set an environment variable which data loaders can read to know the address of the preview server. We’d also need to detect and error on circular dependencies ideally.

@cinxmo cinxmo added this to the Future milestone Jan 16, 2024
@mythmon
Copy link
Member

mythmon commented Jan 21, 2024

I've been working on a small project to get some first hand experience with the CLI. In it I'm downloading a zip file from some hobbyist website, extracting couple dozen text files, and then running them through a custom parser. Parsing them takes about 30 seconds, which is a bit longer than I want to do in the markdown file. I'm iterating on the parser itself, so I'm re-running it every few minutes.

The options I see for my case are:

  • Do everything in one data loader, and naively download the zip every time. That seems rude to the small site.
  • Do everything in one data loader, and implement my own caching functionality. This seems tedious and exactly what I'd want a data loader to do for me.
  • Run the parser in the client. This feels slow, though it is nice for development to see the data load in as it generates.

From this, I have two wish list items. One is chained data loaders, the other is incremental data loaders that can somehow stream their results in to the client. I don't really know how that would work, and it's probably better suited for Notebooks anyways.

I can sympathize with wanting data loaders to be in any language, but it is very jarring to go from writing my code in a JS fenced code block script and having it work easily, to working in a .json.js data loader and suddenly all of my imports break and I lose all of the nice tools I was using a moment ago. It makes me feel like to properly use Observable CLI I need to be fluent in three varieties of JS: Markdown code blocks, browser imports, and file attachments.

Perhaps once we have the server-based dataloader workflow that Mike mentioned, we could then wrap that in a FileAttachment facade that makes it feel just like it does in Markdown files.

@Fil
Copy link
Contributor Author

Fil commented Jan 29, 2024

@mythmon FileAttachment supports streaming, see https://observablehq.com/@observablehq/streaming-shapefiles

@trebor
Copy link
Contributor

trebor commented Feb 2, 2024

an example of the use case in the chess bump chart example. any changes to the data transformation in the data loader require a full download of all of the data.

@espinielli
Copy link

This seems to point to something similar to having a dependency graph, like in the targets 📦 in R.
And the dependency is not only for the data loaders but also for assets (computational cells are already covered, aren't they?)

@Fil
Copy link
Contributor Author

Fil commented Feb 26, 2024

Tangentially related to #918.

@palewire
Copy link
Contributor

an example of the use case in the chess bump chart example. any changes to the data transformation in the data loader require a full download of all of the data.

I'm not seeing this implemented in the chess bump example. Am I missing something?

@mythmon
Copy link
Member

mythmon commented Jun 25, 2024

It's not implemented in the chess bump example. The example is a case where implementing this feature would improve the data loader(s), if Framework gained this features.

@palewire
Copy link
Contributor

palewire commented Jun 25, 2024

Gotcha. Here's my use case, for anyone interested.

I'd like Data Loader 1 to be a Python script that downloads a dataframe from s3, transforms the data with filter-y tricks and then writes out a JSON file that's ready to serve.

Then Data Loader 2 would be a Node.JS script that would open that very large JSON file, build a D3 graphic in a canvas object, and then write out a PNG file that could be ultimately served by the static site.

@Fil Fil linked a pull request Jul 17, 2024 that will close this issue
15 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants