Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Focus on high scale technologies and improve rules confidence #33

Open
max-ostapenko opened this issue May 12, 2024 · 0 comments
Open

Comments

@max-ostapenko
Copy link

max-ostapenko commented May 12, 2024

@rviscomi @pmeenan @tunetheweb to wrap the topic of maintenance efforts...
Is it any helpful idea?

Is your feature request related to a problem? Please describe.
Currently the list of technologies is grown to more than 3K entries.

In order to continue improving scale and quality of insights provided by HTTP Archive crawls it may be better to focus on the most impactful tech.

Describe the solution you'd like

  1. HTTP Archive team could define the profile of technologies that serves the goals of the crawls.

  2. Here is an example of analysis that could be quickly verified on a monthly basis:

  • changes in tech popularity (may be due to tech lifecycle, degraded rules freshness, quality, completeness),
  • no rules are maintained (deprecated?) that are below a threshold of popularity.

E.g.:

WITH tech_report AS (
  SELECT
    tech.technology,
    COUNT(DISTINCT IF(date = "2024-03-01", root_page, NULL)) AS pages_20240301,
    COUNT(DISTINCT IF(date = "2024-04-01", root_page, NULL)) AS pages_20240401
  FROM `httparchive.all.pages` AS t
  CROSS JOIN UNNEST (t.technologies) AS tech
  WHERE
    date >= "2024-03-01"
    AND client = 'desktop'
    AND is_root_page = TRUE
  GROUP BY 1
),
tech_list AS (
  SELECT
    DISTINCT name AS technology
  FROM `max-ostapenko.wappalyzer.apps` -- to migrate to httparchive project
)

SELECT
  COALESCE( tech_list.technology, tech_report.technology ) AS technology,
  pages_20240301,
  pages_20240401,
  ROUND(1-SAFE_DIVIDE(pages_20240301,pages_20240401), 2) AS diff_perc,
  IF((pages_20240401 <= 100 OR pages_20240401 IS NULL), TRUE, FALSE) AS low_reach
FROM tech_list
FULL OUTER JOIN tech_report
ON tech_list.technology = tech_report.technology
ORDER BY
  pages_20240301 DESC,
  pages_20240401 ASC

Obviously the final reports should be actionable (example). And probably extend to 3-4 month to increase confidence.

Additional context

  1. Assists with analysis of particularly noticeable web trends.
  2. Makes issues more visible.
  3. A bit faster tech detection in crawls.
  4. BQ tech list table can be updated on PR merge
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant