Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
capo.sql		capo.sql

README.md

Capo on BigQuery

capo.sql is a BigQuery function that uses Cheerio to parse strings of HTML.

Usage

To analyze pages in HTTP Archive, pass the HTML response body to the httparchive.fn.CAPO function:

SELECT
  page,
  httparchive.fn.CAPO(response_body) AS capo
FROM
  `httparchive.all.requests` TABLESAMPLE SYSTEM (0.001 PERCENT)
WHERE
  date = '2023-05-01' AND
  client = 'desktop' AND
  is_main_document
LIMIT
  1

Results:

page	vizWeight	weight	element
https://www.example.com/	██████████	9	`<title>Example Domain</title>`
https://www.example.com/	███████████	10	`<meta charset="utf-8">`
https://www.example.com/	███████████	10	`<meta http-equiv="Content-type" content="text/html; charset=utf-8">`
https://www.example.com/	███████████	10	`<meta name="viewport" content="width=device-width, initial-scale=1">`
https://www.example.com/	█████	4	`<style type="text/css">...</style>`

Note that the HTML for some pages might be so large that it causes UDF timeouts/OOMs. One workaround is to extract only the <head> content before passing it into the function:

httparchive.fn.CAPO(REGEXP_EXTRACT(response_body, r'(?i)(.*</head>)'))

Testing

If needed, you can pass arbitrary strings of HTML to the function on BigQuery without incurring any processing costs:

SELECT httparchive.fn.CAPO('''
<html>
  <head>
    <title>Example</title>
    <link rel="manifest" href="/manifest.json">
    <style></style>
    <script defer src="script.js"></script>
    <meta charset="utf-8">
  </head>
</html>
''')

Results:

vizWeight	weight	element
██████████	9	`<title>Example</title>`
█	0	`<link rel="manifest" href="/manifest.json">`
█████	4	`<style></style>`
███	2	`<script defer="" src="script.js"></script>`
███████████	10	`<meta charset="utf-8">`

Aggregate statistics

Using capo.sql it's possible to analyze the entire HTTP Archive corpus.

Here's an example of counting the number of pages whose <head> starts with an element of a given weight:

WITH firstHeadElements AS (
  SELECT
    httparchive.fn.CAPO(REGEXP_EXTRACT(response_body, r'(?i)(.*</head>)'))[SAFE_OFFSET(0)] AS capo
  FROM
    `httparchive.all.requests`
  WHERE
    date = '2023-05-01' AND
    client = 'mobile' AND
    is_main_document
)

SELECT
  capo.weight,
  COUNT(0) AS pages
FROM
  firstHeadElements
WHERE
  capo.weight IS NOT NULL
GROUP BY
  weight
ORDER BY
  weight DESC

Results:

weight	pages
10	800,752
9	202,438
8	30,555
7	98,241
6	14,210
5	600,997
4	1,546,590
3	142,417
2	85,653
1	22,148
0	1,211,124

Not every page will necessarily have an element of weight 10 (META) but these results definitely show a suprisingly high number of sites that lead with lower weights like 4 (SYNC_STYLES) and 0 (OTHER).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bigquery

bigquery

README.md

Capo on BigQuery

Usage

Testing

Aggregate statistics

Files

bigquery

Directory actions

More options

Directory actions

More options

Latest commit

History

bigquery

Folders and files

parent directory

README.md

Capo on BigQuery

Usage

Testing

Aggregate statistics