You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
at (presently 2021-03-30), http://api.crossref.org/v1/works?filter=from-deposit-date:2021-03-01 yields, in the worst case, a pretty manageable 258kb json (per page) (see sample below).
Both the below data (includes 2020 stuff) and the docs suggest that this returns both newly added, as well as updated records:
from-deposit-date | {date} | metadata last (re)deposited since (inclusive) {date}
Takes barely a blink to get the below.
If we wanted to put everything in BigQuery, and only ever query that, the question would be, how we can efficiently and regularly add this to the warehouse.
It would probably have to be triggered on cron (for most of the days) and at runtime (for the last hours).
It's also cheap to do, so at runtime should work.
Though if there's an elegant way, it would be great to be able to completely abstract away the whole update and REST vs BigQuery stuff.
Lots of complexity could be cut.
Perhaps it is possible to do this:
at (presently 2021-03-30),
http://api.crossref.org/v1/works?filter=from-deposit-date:2021-03-01
yields, in the worst case, a pretty manageable 258kb json (per page) (see sample below).Both the below data (includes 2020 stuff) and the docs suggest that this returns both newly added, as well as updated records:
In the snapshot docs:
and from the REST API spec:
Takes barely a blink to get the below.
If we wanted to put everything in BigQuery, and only ever query that, the question would be, how we can efficiently and regularly add this to the warehouse.
It would probably have to be triggered on cron (for most of the days) and at runtime (for the last hours).
It's also cheap to do, so at runtime should work.
Perhaps something via the federated query functions in BigQuery?
This might still be a bad idea, I'm not sure.
Though if there's an elegant way, it would be great to be able to completely abstract away the whole update and REST vs BigQuery stuff.
Lots of complexity could be cut.
The diffed data, it appears is readily available:
The text was updated successfully, but these errors were encountered: