-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prepare for migration of report queries #30
base: main
Are you sure you want to change the base?
Conversation
bytesTotal: { | ||
name: 'Total Kilobytes', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found reports config file - seems a good idea to keep all the configs in one place (more transparent for future contributors).
I copied it over here (to experiment with) and added the queries.
I wouldn't be able to add the queries unless the format supports multiline strings - so just saved in JS.
Actually it is required to be readable with python - YAML?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't make it work with remote config (e.g. reading from httparchive
repo).
The file needs to be stored in this repo at the runtime.
publish(sql.type, { | ||
type: 'table', | ||
schema: 'reports', | ||
tags: ['crawl_reports'] | ||
}).query(ctx => constants.fillTemplate(sql.query, params)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In reports_*
datasets we could store intermediate aggregated data - it's easier to check for data issues in BQ than in GCS.
Cloud Function then will pick fresh row batches and save them to GCS.
Currently it's configured to have a table per metric per chart type, e.g httparchive.reports_timeseries.totalBytes
We could (but it seems a bit more complicated for maintaining and querying), store all the metrics for one chart type in a single table (and cluster by metric).
const params = { | ||
date: constants.currentMonth, | ||
rankFilter: constants.devRankFilter | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Query parameters.
I found only date
.
Need to list all the required and add the queries to test them with.
@tunetheweb here is a demo version that needs to be discussed. And I have no idea what to do with lenses and 2 more requests (see in description).. |
schema: 'reports_' + sql.type, | ||
tags: ['crawl_reports'] | ||
}).query(ctx => | ||
`/* {"dataform_trigger": "reports_complete", "date": "${params.date}", "metric": "${metric.id}", "type": "${sql.type}"} */` + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using this part of SQL query as a trigger even metadata for BQ to GCS (Firestore) exports.
Every time the table update query succeeds - it triggers the export.
We want to replace legacy reports script https://github.com/HTTPArchive/bigquery/tree/master/sql
This is the implementation that follows previous discussion.
TODO list for reports:
reports
) oncrawl_complete
for tech reports:
cwv_tech_report
tag after CrUX updateSupports features:
Run monthly histograms SQLs when crawl is finished
SQLs and exports run as part of a tag workflow invocation (when CrUX data is updated)
[?] Run longer term time series SQLs when crawl is finished
Example needed
Be able to run the time series in an incremental fashion
We can rerun any previous month by redefining the data variable in the corresponding actions.
[?] Handle different lenses (Top X, WordPress, Drupal, Magento)
To clarify
[?] Handle CrUX reports (monthly histograms and time series) having to run later.
Need to clarify
Be able to upload to cloud storage in GCP to allow it to be hosted on our CDN
Implemented with
dataform-export
function (together with Firestore export)Be able to run and only run reports missing (histograms) or missing dates (time series)
The actions to run can be selected manually in Dataform console.
Be able to force rerun (to override any existing reports).
Use
crawl-data
repo in Dataform to rerunBe able to run a subset of reports.
Select actions manually and rerun via Dataform console.
Resolves:
crawl
dataset httparchive.org#938