Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chaining data loaders #1522

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

chaining data loaders #1522

wants to merge 6 commits into from

Conversation

Fil
Copy link
Contributor

@Fil Fil commented Jul 17, 2024

Implemented as a FILE_SERVER variable passed to data loaders (as an environment variable), allowing them to query an asset server.

For instance, a bash data loader mags.txt.sh can query another file (say, quakes.json) by calling,

curl ${FILE_SERVER}quakes.json | jq .features[].properties.mag

and a JavaScript data loader will call:

fetch(`${process.env.FILE_SERVER}quakes.json`).then((reponse) => response.json)). (etc)

In preview, when quakes.json is updated, mags.txt gets updated. If quakes.json is in fact generated by a data loader quakes.json.sh, then touching that script live-updates mags.txt. (The interpreter used by any of the data loaders is inconsequent: python can talk to typescript, and vice-versa.)

We track the dependency "graph" on-disk, by associating to any file (say filename.json) generated by a data loader a file src/.observablehq/cache/filename.json__dependencies that is a simple list of the paths that it requested against its FILE_SERVER.

TODO:

  • Pass the server address as an env variable to data loaders
    • in preview
    • in build
  • create dependency graph
  • watch dependencies
    • update the dependency graph when one of the files changes
      • support dependencies of dependencies of…:
        • build
        • live reload
    • store the graph information on disk so that we can quit and restart the preview server
  • unit tests
    • unit test that the source is not called twice if it has two dependents
  • error on circular dependencies, maybe?
  • write a separate file server? (it would be faster probably, but needs more code); would not change the API.
  • documentation

To test this easily within the preview server:

cp test/input/build/chain/chain*.* docs/

then open http://127.0.0.1:3000/chain

You can replace 3 in chain-source.json.sh by $RANDOM if you want to check that the source data loader runs only once for its two dependents.

Old description

We must use the same file server for browser preview and machine calls (i.e. chained data loaders), because concurrent requests on a same data loader must be joined.

But the paths are different. The data loader for caller.csv receives a $SERVER environment variable equal to
http://127.0.0.1:3000/_chain/caller.csv::, and might retrieve a dependency by concatenating that variable and the file path it needs, e.g. calling $SERVER/dependency.zip.

This should make it possible to derive the dependency graph, at least after we run the data loaders (not sure how to maintain state when we restart a server and we have a cache). Also, if the file is not found, we send an empty 404 instead of the decorated page intended for the browser; this makes it a bit more foolproof (tip: use curl -f to fail on 404).

The current server information is saved as a global (in process.env—should it be globalThis?) for now. It feels a bit wrong, but at the same time it really is a global state.

closes #332

@mbostock
Copy link
Member

This seems like the right direction. 👍

I’m thinking of going a step further and spinning up a temporary “asset server” each time we invoke a data loader, so that each data loader gets a separate port for making requests to load assets/dependent files. I think that’s cleaner than using a prefixed path on the preview server, means build and preview follow the same code path, and anyway it should be cheap to expose a little asset server port temporarily while running a data loader (since there’s no initialization other than opening the port). And the asset server can further restrict what each data loader has access too (only files, not pages). And we can give it clean paths.

I think we could also get away with not tracking the dependencies explicitly and just let circular data loaders deadlock for now…

@mythmon
Copy link
Member

mythmon commented Aug 30, 2024

Could we make the "protocol" of how to load data loaders flexible enough to provide other capabilities? For example, it would be nice to be able to explicitly record dependencies that Framework can't detect, and one way we could do that is by making an HTTP call to the injected environment variable to record a file that we don't need to request but should be watched for changes.

Two ways I could imagine doing this is either to make that environment variable less generic, like $DATA_SERVER instead of $SERVER, or by adding a path prefix like $SERVER/load/path/to/my/file.csv. That would allow something like $COMMAND_SERVER or $SERVER/track/path/to/library.js. (all very bike-sheddable names)

(to be clear, I'm not asking you to implement any of that in this PR, just to leave enough room in the naming to allow future extension)

@Fil
Copy link
Contributor Author

Fil commented Sep 16, 2024

I have rewritten this based on the suggestions in the review. There is a new TODO list.

@Fil Fil marked this pull request as ready for review September 17, 2024 14:03
@Fil
Copy link
Contributor Author

Fil commented Sep 17, 2024

Some of our examples will benefit from chained data loaders (in lieu of an intermediate archive). This will allow to split the analysis into independent small scripts. (thinking about this in the context of #1667).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Chaining data loaders
3 participants