[Bug]: GUI resource replay numbers e.g # of Urls, videos and pdf's differs greatly from numbers found in crawllog analysis #2239

tuehlarsen · 2024-12-13T09:58:36Z

Browsertrix Version

v1.12.0-6032e28

What did you expect to happen? What happened instead?

The only way to get a complete log for a 142 G local harvest in browsertrix is to manually under error logs to click on download all logs: I get a 400 MB complet log after a while with 1.4 mio lines. https://drive.google.com/file/d/1i8zsIkDIw-xr1FEbz-K2qoGSvQUlMz4F/view?usp=sharing

In the GUI cf resources e.g:
171 video/audio urls have been harvested (when you scroll down to the bottom and press END a couple of time - otherwise it only shows 100)
13084 urls in total
1807 html urls
140 pdfs
The numbers differs greatly from QA Assistance results and my crawllog analysis:

Here some numbers from my crawllog analysis:
Here video/audios:
prod [prod@kb-prod-adm-001 browsertrix_tests]$ grep "mp4" manual-20241202144717-f117fb01-3be.log | wc -l
20583 - they are dublets:
/2 = ~10.000 mp4

pdf'er:
prod [prod@kb-prod-adm-001 browsertrix_tests]$ grep ".pdf" manual-20241202144717-f117fb01-3be.log | awk -F, '{ print $5 }' | grep .pdf | awk '{ print $3 }' | sed 's/}}$//' | sed 's/"//g'| grep -v "pdf&" | sort -u | wc -l
356 pdf

There are about 14K with the "Skipping URL from unknown frame" but allmost all are in the archive replay.
Only 26 pdf urls with &nbsp or &gt are not harvested because of &* suffix.

Reproduction instructions

see above

Screenshots / Video

see above

Environment

No response

Additional details

No response

ikreymer · 2024-12-13T18:10:53Z

These are different numbers - the QA analysis considers only 'Pages', which are anchor links that the browser loads as a top-level page, eg. captured as pages. You are looking at all resources loaded in the crawl.

It is a question if we should count non-HTML resources here at all, currently we do. The purpose of QA is to only go through HTML pages, so we need to exclude pages that are pdfs, etc.. This is not meant to be a breakdown of all different types of resources in the crawl.

tuehlarsen added the bug Something isn't working label Dec 13, 2024

github-project-automation bot added this to Webrecorder Projects Dec 13, 2024

github-project-automation bot moved this to Triage in Webrecorder Projects Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: GUI resource replay numbers e.g # of Urls, videos and pdf's differs greatly from numbers found in crawllog analysis #2239

[Bug]: GUI resource replay numbers e.g # of Urls, videos and pdf's differs greatly from numbers found in crawllog analysis #2239

tuehlarsen commented Dec 13, 2024

ikreymer commented Dec 13, 2024

[Bug]: GUI resource replay numbers e.g # of Urls, videos and pdf's differs greatly from numbers found in crawllog analysis #2239

[Bug]: GUI resource replay numbers e.g # of Urls, videos and pdf's differs greatly from numbers found in crawllog analysis #2239

Comments

tuehlarsen commented Dec 13, 2024

Browsertrix Version

What did you expect to happen? What happened instead?

Reproduction instructions

Screenshots / Video

Environment

Additional details

ikreymer commented Dec 13, 2024