Skip to content

Contains code that processes M-Lab data and provides it in various formats for other use.

License

Notifications You must be signed in to change notification settings

m-lab/stats-pipeline

Folders and files

NameName
Last commit message
Last commit date
Nov 5, 2021
Oct 21, 2021
Jul 26, 2021
Feb 2, 2021
Nov 29, 2021
Nov 1, 2021
Jul 21, 2021
Nov 17, 2021
Nov 17, 2021
Jul 19, 2021
Oct 25, 2021
Nov 10, 2021
Feb 1, 2021
Jul 15, 2021
Nov 17, 2021
Nov 5, 2021
Jun 23, 2020
Jul 15, 2021
Nov 17, 2021
Nov 5, 2021
Apr 7, 2021
Nov 13, 2020
Nov 13, 2020
Oct 20, 2021
Oct 20, 2021

Repository files navigation

Version Build Status Coverage Status GoDoc Go Report Card

Statistics Pipeline Service

This repository contains code that processes NDT data and provides aggregate metrics by day for standard global, and some national geographies. The resulting aggregations are made available in JSON format, for use by other applications.

The stats-pipeline service is written in Go, runs on GKE, and generates and updates daily aggregate statistics. Access is provided in public BigQuery tables and in per-year JSON formatted files hosted on GCS.

Documentation Provided for the Statistics Pipeline Service

General Recommendations for All Aggregations of NDT data

In general, our recommendations for research aggregating NDT data are:

  • Don't oversimplify
  • Aggregate by ASN in addition to time/date and location
  • Be aware of, and illustrate multimodal distributions
  • Use histogram and logarithmic scales
  • Take into account, and compensate for, client bias and population drift

Roadmap

Below we list additional features, methods, geographies, etc. which may be considered for future versioned releases of stats-pipeline.

Geographies

  • US Zip Codes, US Congressional Districts, Block Groups, Blocks

Output Formats

  • histogram_daily_stats.csv - Same data as the JSON, but in CSV. Useful for importing into a spreadsheet.
  • histogram_daily_stats.sql - A SQL query which returns the same rows in the corresponding .json and .csv. Useful for verifying the exported data against the source and to tweak the query as needed by different use cases.