-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fetch and cache data at runtime #839
Comments
Thanks for the feedback! A quick response, but for the point-in-time data, yes, we recommend you put your data in node docs/data/my-data.csv.js > docs/data/my-data.csv If you ever want to regenerate your data, you just run the command above again. And if you want to switch to generating it automatically during build, you just delete the static file and then Framework will once again implicitly invoke your data loader and cache its output: rm docs/data/my-data.csv This works because a static file (e.g., |
Regarding suggestion 1, please see #245 on parameterized data loaders. The essential part of our parameterized data loader approach will be that parameter values are enumerated at build time. A parameterized page will be expressed as a dynamic route with named parameters in square brackets, such as We require all values of Once we have parameterized data loaders, we should be able to allow (And I’ll explain more about why suggestion 2 isn’t compatible with our approach, and respond in more detail to the other two notebook types you describe as I have more time. There are ways to load data at runtime, but our philosophy is very much that it’s better to build the data ahead of time for instant page loads.) |
Regarding what you’re calling “live-but-static notebooks” (note we don’t call anything created with Framework a “notebook” — it’s a data app, dashboard, report, etc. — and Framework isn’t intended to replace notebooks, which are for lightweight, collaborative, ad hoc exploration): This use case is our bread and butter. 🍞🧈 So please allow me to disagree respectfully with your assessment that Framework does not support this use case well. This is exactly how we’ve been using Framework since creation and how we intend it to be used. 😅
Right, we expect people to use continuous deployment to automate builds on their desired cadence. But two things. First, we support incremental builds. If the data loader cache (typically in framework/.github/workflows/deploy.yml Lines 26 to 36 in 3f09fab
This approach allows us to rebuild our docs on every commit to main, but only run data loaders once per day — or when the data loaders change, or when we manually delete the cache. We currently use a daily cadence for data loaders in our docs, but you can write your own logic to determine how to invalidate some or all of the data loader cache, say keeping some cached data for 28 days and other data only one day. Configuring the cache prior to build is how you specify what you want to build. (Incremental builds currently only applies to data loaders, but I can see us supporting incremental builds of specific pages in the future, too. E.g., you tell Framework to just build a specific page, and it combines it into the Second, I’ll acknowledge that the above is a bunch of work for users to figure out on their own! And that it would be especially difficult to have a “smart” cache based on usage — based on what pages are actually being looked at. That’s where the Observable platform comes in and why we offer a hosted service to make this work easier. We’re currently only offering hosted projects with access control, but our intent in the near future is to make it seamless to setup continuous deployment from your source control, including scheduled builds, with some intuitive and simple way of specifying how frequently you’d like to rebuild specific pages or data. We can also offer monitoring and analytics for projects that are hosted on Observable, so we could also implement “smart” builds exactly like you describe — changing the cadence of incremental builds based on which pages are actually being viewed. Of course, you can build all of this stuff yourself since Framework is open-source, but our goal is to offer a compelling service that specializes in the operationalization/productionization of data apps and we’d like to take care of that oft tedious work for you. |
Great to hear about parameterized data loaders. I do wonder if it's anywhere on the roadmap to think about some sort of DAG of data loaders along with content addressable caching. Imagining it could be nice to have some data loader that does some relatively expensive production of data, and then later adding a dependent data loader that can make use of that intermediate data, and only needing to rebuild dependent data loaders outputs if the intermediate data they depend on contents have changed. I can see how this kind of complexity might be at odds with ease of use, but thought I'd ask if it was planned or at least philosophically compatible with where you see data loaders going in the future. |
Yes, we have the DAG idea covered in #332. The idea is that data loaders could be used to create intermediate files that are not (necessarily) published but that are consumed by downstream data loaders. |
Closing as I think I’ve responded to the original points, and more concrete feature requests are covered by linked issues. Please reply if you have further thoughts, and thank you for the detailed feedback! |
With Observable, we're able to connect an external database, then query that database on notebook load, often interpolating inputs from the notebook into our queries. Observable then proxies those queries through its database proxy service.
This provides three important benefits:
As far as I can tell, Framework currently has no way to handle this. It looks like Framework once had DatabaseClient support #13, but it was removed for unclear reasons #594, #359.
Proposal
Data loaders are awesome. However, I believe a specific detail in their current implementation has certain drawbacks.
To illustrate, let's look at the three types of notebooks (I'm using the Observable parlance, but you could call these notebooks, pages, dashboards):
Point-In-Time Notebooks
These are notebooks where an analysis is built around a fixed snapshot of data. They often have accompanying prose that explains the data, the analysis, and authors' conclusions. The conclusions of these reports are fixed to a point in time. For these reports, you never want the underlying data to change after it's been analyzed—that could break the notebook! Data in the original source could've since been deleted or modified; the schema in the original source could've change; the original source may not exist at all!
The current default workflow of Framework does not well support these notebooks, since the cache is .gitignored. If the project is deployed from a different machine than the one that originally deployed the notebook, then the built data will change, which will change what's rendered in the notebook. For these notebooks, you'd ideally want to check the data into git. This means an override in .gitignore, or better, writing to a file that's not in cache/ and loading that in the notebook instead.
Live-But-Static Notebooks
These are notebooks where it is desired that the data reflects the current state of the underlying source, but where the data is the same for all viewers at a given time. You might want to see revenue over time, updated daily, but the query for that data only needs to be made once a day. The query, or processing of the query result, might be expensive so you want to cache the results rather than on each page load.
The current default workflow of Framework also does not well support these notebooks.
The current way is to handle these is to set up deployment on a cron, for example with a GitHub action. However, this has the drawback that you're then re-querying and re-processing all your data on the cadence you set—every minute, or hour, or day, etc.—even for notebooks that no one has accessed and will access for months. As projects grow, for projects with lot of notebooks, many of which will be infrequently viewed, this could end up being very expensive and slow, both for the underlying source database and for the process building and deploying the project.
Imagine your deployment taking five minutes because of an expensive data loader for a notebook that someone views once a month, because you have another notebook that gets viewed every day that needs up-to-date data.
This also means you have to choose between always deploying from a CI/CD setup, like a GitHub action, or being able to deploy from your local machine with
yarn deploy
. This is because your local machine will likely have cached files that are older than what was last deployed by the last CI/CD run.Dynamic Notebooks
These notebooks make queries at runtime with values provided by the user. For example, a notebook that lets you paste in a user ID and get an analysis of that single user's data. A company might have millions of users, so it would be infeasible to build every user's data into the deployment ahead of time, and so a query—with parameters—at runtime is necessary.
As far as I can tell, the only way to do this with Framework is to expose your DB credentials or API keys by building them into the bundle that gets sent to the browser, then query the database or other source directly from the browser.
This is obviously far from ideal. It lets any user with access to the project (and any malicious extensions they might have installed) make arbitrary queries and retain credentials even after they've been removed from a project. You loose visibility into the custody of your credentials; instead of giving credentials to only Observable, you're giving credentials to every user that views the project and anything they have installed, and after that you have no idea where they could end up.
This also means that there's no built-in way to cache queries. You could cache in the browser, but you have to build that yourself, and it would only be per-client. You could cache in some external store, but you'd have to build that yourself too (and build the credentials for it into the deployment).
These, I believe, are the three main types of notebooks, and, as I've illustrated, the current design of Framework data loaders is not ideal for any of them.
However, there there are two complimentary changes that that make data loaders perfect for all three types while staying true to the concept of data loaders.
1. Support passing arguments to data loaders in the file name.
2. Run data loaders at runtime instead of build time.
Keep the cache, but don't build it into the deployment. Instead, run each data loader for a deployed project on Observable's servers, and cache the data there.
For point-in-time notebooks this makes it more explicit to users authoring such notebooks that they cannot not rely on their local cache to keep data fixed, and must instead write their data to a file that's checked into git. It lets users always deploy from their local machine, regardless of the requirements of their project's pages.
For live-but-static notebooks this, in combination with the first change, lets you cache and rebuild your data at arbitrary intervals by passing an argument that changes when you want to reload your data.
And for dynamic notebooks, this lets you load data on input change while still using Framework data loaders, and without having to leak sensitive credentials to the browser. The data loader would be run Observable's servers, and would be able to access the team's secrets set in their Observable settings.
I'm loving the new direction Framework is taking Observable in. I believe these two changes are both essential and fit perfectly within Framework's philosophy. ❤️
The text was updated successfully, but these errors were encountered: