Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: GUI resource replay numbers e.g # of Urls, videos and pdf's differs greatly from numbers found in crawllog analysis #2239

Open
tuehlarsen opened this issue Dec 13, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@tuehlarsen
Copy link

Browsertrix Version

v1.12.0-6032e28

What did you expect to happen? What happened instead?

The only way to get a complete log for a 142 G local harvest in browsertrix is to manually under error logs to click on download all logs: I get a 400 MB complet log after a while with 1.4 mio lines. https://drive.google.com/file/d/1i8zsIkDIw-xr1FEbz-K2qoGSvQUlMz4F/view?usp=sharing

In the GUI cf resources e.g:
171 video/audio urls have been harvested (when you scroll down to the bottom and press END a couple of time - otherwise it only shows 100)
13084 urls in total
1807 html urls
140 pdfs
The numbers differs greatly from QA Assistance results and my crawllog analysis:

image

Here some numbers from my crawllog analysis:
Here video/audios:
prod [prod@kb-prod-adm-001 browsertrix_tests]$ grep "mp4" manual-20241202144717-f117fb01-3be.log | wc -l
20583 - they are dublets:
/2 = ~10.000 mp4

pdf'er:
prod [prod@kb-prod-adm-001 browsertrix_tests]$ grep ".pdf" manual-20241202144717-f117fb01-3be.log | awk -F, '{ print $5 }' | grep .pdf | awk '{ print $3 }' | sed 's/}}$//' | sed 's/"//g'| grep -v "pdf&" | sort -u | wc -l
356 pdf

There are about 14K with the "Skipping URL from unknown frame" but allmost all are in the archive replay.
Only 26 pdf urls with &nbsp or &gt are not harvested because of &* suffix.

Reproduction instructions

see above

Screenshots / Video

see above

Environment

No response

Additional details

No response

@tuehlarsen tuehlarsen added the bug Something isn't working label Dec 13, 2024
@ikreymer
Copy link
Member

These are different numbers - the QA analysis considers only 'Pages', which are anchor links that the browser loads as a top-level page, eg. captured as pages. You are looking at all resources loaded in the crawl.

It is a question if we should count non-HTML resources here at all, currently we do. The purpose of QA is to only go through HTML pages, so we need to exclude pages that are pdfs, etc.. This is not meant to be a breakdown of all different types of resources in the crawl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Triage
Development

No branches or pull requests

2 participants