You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the GUI cf resources e.g:
171 video/audio urls have been harvested (when you scroll down to the bottom and press END a couple of time - otherwise it only shows 100)
13084 urls in total
1807 html urls
140 pdfs
The numbers differs greatly from QA Assistance results and my crawllog analysis:
Here some numbers from my crawllog analysis:
Here video/audios:
prod [prod@kb-prod-adm-001 browsertrix_tests]$ grep "mp4" manual-20241202144717-f117fb01-3be.log | wc -l
20583 - they are dublets:
/2 = ~10.000 mp4
There are about 14K with the "Skipping URL from unknown frame" but allmost all are in the archive replay.
Only 26 pdf urls with   or > are not harvested because of &* suffix.
Reproduction instructions
see above
Screenshots / Video
see above
Environment
No response
Additional details
No response
The text was updated successfully, but these errors were encountered:
These are different numbers - the QA analysis considers only 'Pages', which are anchor links that the browser loads as a top-level page, eg. captured as pages. You are looking at all resources loaded in the crawl.
It is a question if we should count non-HTML resources here at all, currently we do. The purpose of QA is to only go through HTML pages, so we need to exclude pages that are pdfs, etc.. This is not meant to be a breakdown of all different types of resources in the crawl.
Browsertrix Version
v1.12.0-6032e28
What did you expect to happen? What happened instead?
The only way to get a complete log for a 142 G local harvest in browsertrix is to manually under error logs to click on download all logs: I get a 400 MB complet log after a while with 1.4 mio lines. https://drive.google.com/file/d/1i8zsIkDIw-xr1FEbz-K2qoGSvQUlMz4F/view?usp=sharing
In the GUI cf resources e.g:
171 video/audio urls have been harvested (when you scroll down to the bottom and press END a couple of time - otherwise it only shows 100)
13084 urls in total
1807 html urls
140 pdfs
The numbers differs greatly from QA Assistance results and my crawllog analysis:
Here some numbers from my crawllog analysis:
Here video/audios:
prod [prod@kb-prod-adm-001 browsertrix_tests]$ grep "mp4" manual-20241202144717-f117fb01-3be.log | wc -l
20583 - they are dublets:
/2 = ~10.000 mp4
pdf'er:$3 }' | sed 's/}}$ //' | sed 's/"//g'| grep -v "pdf&" | sort -u | wc -l
prod [prod@kb-prod-adm-001 browsertrix_tests]$ grep ".pdf" manual-20241202144717-f117fb01-3be.log | awk -F, '{ print $5 }' | grep .pdf | awk '{ print
356 pdf
There are about 14K with the "Skipping URL from unknown frame" but allmost all are in the archive replay.
Only 26 pdf urls with   or > are not harvested because of &* suffix.
Reproduction instructions
see above
Screenshots / Video
see above
Environment
No response
Additional details
No response
The text was updated successfully, but these errors were encountered: