Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetch and cache data at runtime #839

Closed
net opened this issue Feb 16, 2024 · 6 comments
Closed

Fetch and cache data at runtime #839

net opened this issue Feb 16, 2024 · 6 comments
Labels
enhancement New feature or request

Comments

@net
Copy link

net commented Feb 16, 2024

With Observable, we're able to connect an external database, then query that database on notebook load, often interpolating inputs from the notebook into our queries. Observable then proxies those queries through its database proxy service.

This provides three important benefits:

  • Notebooks can use live data.
  • Queries can use data from inputs.
  • Sensitive DB credentials or API keys don't have to be leaked to the browser.

As far as I can tell, Framework currently has no way to handle this. It looks like Framework once had DatabaseClient support #13, but it was removed for unclear reasons #594, #359.


Proposal

Data loaders are awesome. However, I believe a specific detail in their current implementation has certain drawbacks.

To illustrate, let's look at the three types of notebooks (I'm using the Observable parlance, but you could call these notebooks, pages, dashboards):

Point-In-Time Notebooks

These are notebooks where an analysis is built around a fixed snapshot of data. They often have accompanying prose that explains the data, the analysis, and authors' conclusions. The conclusions of these reports are fixed to a point in time. For these reports, you never want the underlying data to change after it's been analyzed—that could break the notebook! Data in the original source could've since been deleted or modified; the schema in the original source could've change; the original source may not exist at all!

The current default workflow of Framework does not well support these notebooks, since the cache is .gitignored. If the project is deployed from a different machine than the one that originally deployed the notebook, then the built data will change, which will change what's rendered in the notebook. For these notebooks, you'd ideally want to check the data into git. This means an override in .gitignore, or better, writing to a file that's not in cache/ and loading that in the notebook instead.

Live-But-Static Notebooks

These are notebooks where it is desired that the data reflects the current state of the underlying source, but where the data is the same for all viewers at a given time. You might want to see revenue over time, updated daily, but the query for that data only needs to be made once a day. The query, or processing of the query result, might be expensive so you want to cache the results rather than on each page load.

The current default workflow of Framework also does not well support these notebooks.

The current way is to handle these is to set up deployment on a cron, for example with a GitHub action. However, this has the drawback that you're then re-querying and re-processing all your data on the cadence you set—every minute, or hour, or day, etc.—even for notebooks that no one has accessed and will access for months. As projects grow, for projects with lot of notebooks, many of which will be infrequently viewed, this could end up being very expensive and slow, both for the underlying source database and for the process building and deploying the project.

Imagine your deployment taking five minutes because of an expensive data loader for a notebook that someone views once a month, because you have another notebook that gets viewed every day that needs up-to-date data.

This also means you have to choose between always deploying from a CI/CD setup, like a GitHub action, or being able to deploy from your local machine with yarn deploy. This is because your local machine will likely have cached files that are older than what was last deployed by the last CI/CD run.

Dynamic Notebooks

These notebooks make queries at runtime with values provided by the user. For example, a notebook that lets you paste in a user ID and get an analysis of that single user's data. A company might have millions of users, so it would be infeasible to build every user's data into the deployment ahead of time, and so a query—with parameters—at runtime is necessary.

As far as I can tell, the only way to do this with Framework is to expose your DB credentials or API keys by building them into the bundle that gets sent to the browser, then query the database or other source directly from the browser.

This is obviously far from ideal. It lets any user with access to the project (and any malicious extensions they might have installed) make arbitrary queries and retain credentials even after they've been removed from a project. You loose visibility into the custody of your credentials; instead of giving credentials to only Observable, you're giving credentials to every user that views the project and anything they have installed, and after that you have no idea where they could end up.

This also means that there's no built-in way to cache queries. You could cache in the browser, but you have to build that yourself, and it would only be per-client. You could cache in some external store, but you'd have to build that yourself too (and build the credentials for it into the deployment).


These, I believe, are the three main types of notebooks, and, as I've illustrated, the current design of Framework data loaders is not ideal for any of them.

However, there there are two complimentary changes that that make data loaders perfect for all three types while staying true to the concept of data loaders.

1. Support passing arguments to data loaders in the file name.

const userID = view(Inputs.text({ label: "User ID" }));

const data = await FileAttachment(`./data/userRecordings.csv?id=${userID}`).csv()

2. Run data loaders at runtime instead of build time.

Keep the cache, but don't build it into the deployment. Instead, run each data loader for a deployed project on Observable's servers, and cache the data there.

For point-in-time notebooks this makes it more explicit to users authoring such notebooks that they cannot not rely on their local cache to keep data fixed, and must instead write their data to a file that's checked into git. It lets users always deploy from their local machine, regardless of the requirements of their project's pages.

For live-but-static notebooks this, in combination with the first change, lets you cache and rebuild your data at arbitrary intervals by passing an argument that changes when you want to reload your data.

const currentHour = Math.round(Date.now() / 3600000) * 3600000

const data = await FileAttachment(`./data/topUsers.csv?ms=${currentHour}`).csv()

And for dynamic notebooks, this lets you load data on input change while still using Framework data loaders, and without having to leak sensitive credentials to the browser. The data loader would be run Observable's servers, and would be able to access the team's secrets set in their Observable settings.


I'm loving the new direction Framework is taking Observable in. I believe these two changes are both essential and fit perfectly within Framework's philosophy. ❤️

@net net added the enhancement New feature or request label Feb 16, 2024
@mbostock
Copy link
Member

mbostock commented Feb 17, 2024

Thanks for the feedback!

A quick response, but for the point-in-time data, yes, we recommend you put your data in docs as a static file rather than using a data loader to generate it implicitly. You can still write a data loader to generate the data, you just invoke it manually, like so:

node docs/data/my-data.csv.js > docs/data/my-data.csv

If you ever want to regenerate your data, you just run the command above again. And if you want to switch to generating it automatically during build, you just delete the static file and then Framework will once again implicitly invoke your data loader and cache its output:

rm docs/data/my-data.csv

This works because a static file (e.g., docs/data/my-data.csv) takes precedence over the equivalent data loader (e.g., docs/data/my-data.csv.js) in data loader routing. And because the design of data loaders makes them transparent/indistinguishable from static files from the client’s perspective (i.e., the client doesn’t know whether my-data.csv comes from a static file or a data loader). So you don’t have to futz with any .gitignore settings.

@mbostock
Copy link
Member

mbostock commented Feb 17, 2024

Regarding suggestion 1, please see #245 on parameterized data loaders.

The essential part of our parameterized data loader approach will be that parameter values are enumerated at build time. A parameterized page will be expressed as a dynamic route with named parameters in square brackets, such as docs/users/[userId]/index.md. From that page you could load FileAttachment("./recordings.csv"), and this would be served by a data loader docs/users/[userId]/recordings.csv.js. When run this data loader would be passed --userId=XXX for each user.

We require all values of userId to be enumerated at build time so that Framework can compute both the page and the data for each parameter value. (And in the case of multiple parameters, you’ll have to enumerate the expected combinations of parameter values at build time too.) We haven’t decided exactly how we’re going to enumerate parameter values yet, but it’ll be possible to do that in a data loader too (of course — since parameter values typically come from a database and you wouldn’t want to list all your users by hand!). Perhaps a data loader at docs/users/[userId].js.

Once we have parameterized data loaders, we should be able to allow FileAttachment(`./users/${userId}/recordings.csv`) too (so you can load data for multiple parameter values from the same page). But we’ll still need to enumerate the allowed values of userId at build time, and I’d like this to still be a syntax error if there isn’t a parameter declared for docs/users/[userId] so that we don’t allow dynamic file names in general; we want to be confident that the site won’t break when deployed.

(And I’ll explain more about why suggestion 2 isn’t compatible with our approach, and respond in more detail to the other two notebook types you describe as I have more time. There are ways to load data at runtime, but our philosophy is very much that it’s better to build the data ahead of time for instant page loads.)

@mbostock
Copy link
Member

mbostock commented Feb 17, 2024

Regarding what you’re calling “live-but-static notebooks” (note we don’t call anything created with Framework a “notebook” — it’s a data app, dashboard, report, etc. — and Framework isn’t intended to replace notebooks, which are for lightweight, collaborative, ad hoc exploration):

This use case is our bread and butter. 🍞🧈 So please allow me to disagree respectfully with your assessment that Framework does not support this use case well. This is exactly how we’ve been using Framework since creation and how we intend it to be used. 😅

The current way is to handle these is to set up deployment on a cron, for example with a GitHub action. However, this has the drawback that you're then re-querying and re-processing all your data on the cadence you set…

Right, we expect people to use continuous deployment to automate builds on their desired cadence. But two things.

First, we support incremental builds. If the data loader cache (typically in docs/.observablehq/cache) is populated, then the build will respect the cache and only re-run data loaders that are stale: where the cached output is missing or older than the corresponding data loader. This means you can use the data loader cache to control which data loaders run during build; you can choose to run all, some, or none of your data loaders whenever you build. You can see how we’ve implemented that in our deploy.yml here:

- id: date
run: echo "date=$(TZ=America/Los_Angeles date +'%Y-%m-%d')" >> $GITHUB_OUTPUT
- id: cache-data
uses: actions/cache@v4
with:
path: |
docs/.observablehq/cache
examples/*/docs/.observablehq/cache
key: data-${{ hashFiles('docs/data/*', 'examples/*/docs/data/*') }}-${{ steps.date.outputs.date }}
- if: steps.cache-data.outputs.cache-hit == 'true'
run: find docs/.observablehq/cache examples/*/docs/.observablehq/cache -type f -exec touch {} +

This approach allows us to rebuild our docs on every commit to main, but only run data loaders once per day — or when the data loaders change, or when we manually delete the cache. We currently use a daily cadence for data loaders in our docs, but you can write your own logic to determine how to invalidate some or all of the data loader cache, say keeping some cached data for 28 days and other data only one day. Configuring the cache prior to build is how you specify what you want to build.

(Incremental builds currently only applies to data loaders, but I can see us supporting incremental builds of specific pages in the future, too. E.g., you tell Framework to just build a specific page, and it combines it into the dist folder with whatever you built previously, and it figures out which data loaders need to run based on the specific page you asked to build. This hasn’t been a priority yet because building pages is typically instant compared to loading data.)

Second, I’ll acknowledge that the above is a bunch of work for users to figure out on their own! And that it would be especially difficult to have a “smart” cache based on usage — based on what pages are actually being looked at. That’s where the Observable platform comes in and why we offer a hosted service to make this work easier. We’re currently only offering hosted projects with access control, but our intent in the near future is to make it seamless to setup continuous deployment from your source control, including scheduled builds, with some intuitive and simple way of specifying how frequently you’d like to rebuild specific pages or data. We can also offer monitoring and analytics for projects that are hosted on Observable, so we could also implement “smart” builds exactly like you describe — changing the cadence of incremental builds based on which pages are actually being viewed.

Of course, you can build all of this stuff yourself since Framework is open-source, but our goal is to offer a compelling service that specializes in the operationalization/productionization of data apps and we’d like to take care of that oft tedious work for you.

@krosaen
Copy link

krosaen commented Feb 20, 2024

Great to hear about parameterized data loaders. I do wonder if it's anywhere on the roadmap to think about some sort of DAG of data loaders along with content addressable caching. Imagining it could be nice to have some data loader that does some relatively expensive production of data, and then later adding a dependent data loader that can make use of that intermediate data, and only needing to rebuild dependent data loaders outputs if the intermediate data they depend on contents have changed.

I can see how this kind of complexity might be at odds with ease of use, but thought I'd ask if it was planned or at least philosophically compatible with where you see data loaders going in the future.

@mbostock
Copy link
Member

Yes, we have the DAG idea covered in #332. The idea is that data loaders could be used to create intermediate files that are not (necessarily) published but that are consumed by downstream data loaders.

@mbostock
Copy link
Member

Closing as I think I’ve responded to the original points, and more concrete feature requests are covered by linked issues. Please reply if you have further thoughts, and thank you for the detailed feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants