Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepare for migration of report queries #30

Draft
wants to merge 38 commits into
base: main
Choose a base branch
from
Draft

Conversation

max-ostapenko
Copy link
Contributor

@max-ostapenko max-ostapenko commented Nov 16, 2024

We want to replace legacy reports script https://github.com/HTTPArchive/bigquery/tree/master/sql
This is the implementation that follows previous discussion.

TODO list for reports:

  • run aggregated table updates with fresh reports data (dataset reports) on crawl_complete
  • create reports configuration file with a timeseries and a histogram
  • trigger GCS upload whenever data is updated in BQ

for tech reports:

  • run aggregated table updates on cwv_tech_report tag after CrUX update
  • create reports configuration file
  • update Firestore collections whenever data was updated in BQ

Supports features:

  • Run monthly histograms SQLs when crawl is finished
    SQLs and exports run as part of a tag workflow invocation (when CrUX data is updated)

  • [?] Run longer term time series SQLs when crawl is finished
    Example needed

  • Be able to run the time series in an incremental fashion
    We can rerun any previous month by redefining the data variable in the corresponding actions.

  • [?] Handle different lenses (Top X, WordPress, Drupal, Magento)
    To clarify

  • [?] Handle CrUX reports (monthly histograms and time series) having to run later.
    Need to clarify

  • Be able to upload to cloud storage in GCP to allow it to be hosted on our CDN
    Implemented with dataform-export function (together with Firestore export)

  • Be able to run and only run reports missing (histograms) or missing dates (time series)
    The actions to run can be selected manually in Dataform console.

  • Be able to force rerun (to override any existing reports).
    Use crawl-data repo in Dataform to rerun

  • Be able to run a subset of reports.
    Select actions manually and rerun via Dataform console.

Resolves:

@max-ostapenko max-ostapenko changed the title Preparing data for reports Prepare for migration of report queries Nov 16, 2024
Comment on lines +110 to +111
bytesTotal: {
name: 'Total Kilobytes',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found reports config file - seems a good idea to keep all the configs in one place (more transparent for future contributors).

I copied it over here (to experiment with) and added the queries.
I wouldn't be able to add the queries unless the format supports multiline strings - so just saved in JS.
Actually it is required to be readable with python - YAML?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't make it work with remote config (e.g. reading from httparchive repo).
The file needs to be stored in this repo at the runtime.

Comment on lines 10 to 14
publish(sql.type, {
type: 'table',
schema: 'reports',
tags: ['crawl_reports']
}).query(ctx => constants.fillTemplate(sql.query, params))
Copy link
Contributor Author

@max-ostapenko max-ostapenko Nov 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In reports_* datasets we could store intermediate aggregated data - it's easier to check for data issues in BQ than in GCS.
Cloud Function then will pick fresh row batches and save them to GCS.

Currently it's configured to have a table per metric per chart type, e.g httparchive.reports_timeseries.totalBytes
We could (but it seems a bit more complicated for maintaining and querying), store all the metrics for one chart type in a single table (and cluster by metric).

Comment on lines 2 to 5
const params = {
date: constants.currentMonth,
rankFilter: constants.devRankFilter
}
Copy link
Contributor Author

@max-ostapenko max-ostapenko Nov 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Query parameters.
I found only date.
Need to list all the required and add the queries to test them with.

@max-ostapenko
Copy link
Contributor Author

@tunetheweb here is a demo version that needs to be discussed.
Once we see that it covers all the requirements and agree on feasibility of the 3 topics in comments above - I'll finalise the part with uploading to GCS.

And I have no idea what to do with lenses and 2 more requests (see in description)..

schema: 'reports_' + sql.type,
tags: ['crawl_reports']
}).query(ctx =>
`/* {"dataform_trigger": "reports_complete", "date": "${params.date}", "metric": "${metric.id}", "type": "${sql.type}"} */` +
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using this part of SQL query as a trigger even metadata for BQ to GCS (Firestore) exports.
Every time the table update query succeeds - it triggers the export.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant