You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Currently the list of technologies is grown to more than 3K entries.
In order to continue improving scale and quality of insights provided by HTTP Archive crawls it may be better to focus on the most impactful tech.
Describe the solution you'd like
HTTP Archive team could define the profile of technologies that serves the goals of the crawls.
Here is an example of analysis that could be quickly verified on a monthly basis:
changes in tech popularity (may be due to tech lifecycle, degraded rules freshness, quality, completeness),
no rules are maintained (deprecated?) that are below a threshold of popularity.
E.g.:
WITH tech_report AS (
SELECTtech.technology,
COUNT(DISTINCT IF(date="2024-03-01", root_page, NULL)) AS pages_20240301,
COUNT(DISTINCT IF(date="2024-04-01", root_page, NULL)) AS pages_20240401
FROM`httparchive.all.pages`AS t
CROSS JOIN UNNEST (t.technologies) AS tech
WHEREdate>="2024-03-01"AND client ='desktop'AND is_root_page = TRUE
GROUP BY1
),
tech_list AS (
SELECT
DISTINCT name AS technology
FROM`max-ostapenko.wappalyzer.apps`-- to migrate to httparchive project
)
SELECT
COALESCE( tech_list.technology, tech_report.technology ) AS technology,
pages_20240301,
pages_20240401,
ROUND(1-SAFE_DIVIDE(pages_20240301,pages_20240401), 2) AS diff_perc,
IF((pages_20240401 <=100OR pages_20240401 IS NULL), TRUE, FALSE) AS low_reach
FROM tech_list
FULL OUTER JOIN tech_report
ONtech_list.technology=tech_report.technologyORDER BY
pages_20240301 DESC,
pages_20240401 ASC
Obviously the final reports should be actionable (example). And probably extend to 3-4 month to increase confidence.
Additional context
Assists with analysis of particularly noticeable web trends.
Makes issues more visible.
A bit faster tech detection in crawls.
BQ tech list table can be updated on PR merge
The text was updated successfully, but these errors were encountered:
@rviscomi @pmeenan @tunetheweb to wrap the topic of maintenance efforts...
Is it any helpful idea?
Is your feature request related to a problem? Please describe.
Currently the list of technologies is grown to more than 3K entries.
In order to continue improving scale and quality of insights provided by HTTP Archive crawls it may be better to focus on the most impactful tech.
Describe the solution you'd like
HTTP Archive team could define the profile of technologies that serves the goals of the crawls.
Here is an example of analysis that could be quickly verified on a monthly basis:
E.g.:
Obviously the final reports should be actionable (example). And probably extend to 3-4 month to increase confidence.
Additional context
The text was updated successfully, but these errors were encountered: