[Archive.org] Not grabbing all URLs onscreen #109

Tenome · 2024-11-19T16:22:50Z

Site Link

Details

I'm trying to grab all the URLs in this search result, but the extension only grabs some of the links despite me loading way more than that into the browser. The filter used is https://archive.org/details/. I scrolled down and loaded the subsequent pages that way, but it seems to only grab what is roughly on screen instead of the entire list (around 177 URLs). You can test this yourself by scrolling down and letting it load more pages, and then searching for a URL from the top of the search results (which won't appear in the extracted list).

Support Information

Link Extractor - 0.7.8
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:133.0) Gecko/20100101 Firefox/133.0
permissions.origins: []
options: {"linksDisplay":-1,"flags":"ig","lazyLoad":true,"removeDuplicates":true,"defaultFilter":true,"saveState":true,"linksTruncate":false,"linksNoWrap":false,"contextMenu":true,"showUpdate":false}
links-table: {"time":1732033215784,"start":0,"length":-1,"order":[[0,"asc"]],"search":{"caseInsensitive":true,"search":"","regex":true,"smart":true,"return":false,"_hungarianMap":{}},"columns":[{"visible":true,"search":{"caseInsensitive":true,"search":"","regex":false,"smart":true,"return":false}},{"visible":false,"search":{"caseInsensitive":true,"search":"","regex":false,"smart":true,"return":false}},{"visible":false,"search":{"caseInsensitive":true,"search":"","regex":false,"smart":true,"return":false}},{"visible":false,"search":{"caseInsensitive":true,"search":"","regex":false,"smart":true,"return":false}},{"visible":false,"search":{"caseInsensitive":true,"search":"","regex":false,"smart":true,"return":false}},{"visible":false,"search":{"caseInsensitive":true,"search":"","regex":false,"smart":true,"return":false}}],"childRows":[]}
domains-table: {"time":1732033215788,"start":0,"length":-1,"order":[[0,"asc"]],"search":{"caseInsensitive":true,"search":"","regex":true,"smart":true,"return":false,"_hungarianMap":{}},"columns":[{"visible":true,"search":{"caseInsensitive":true,"search":"","regex":false,"smart":true,"return":false}}],"childRows":[]}

The text was updated successfully, but these errors were encountered:

smashedr · 2024-11-19T21:00:59Z

There is no limit to the number of links that can be extracted. While there may be some upper limit that I have not tested, I have tested over 2,000 links when I added datatables.

On further review, it seems that archive.org is adding/removing nodes as you scroll, regardless of direction. You can watch in the network tab in developer tools (Ctrl+Shift+I). Nodes are being added when you scroll down or up. It seems the maximum number of links it shows at one time is 300.

That being said, I can possibly see a feature request out of this. Possibly you could click Start Collecting and the extension could start collecting links in the current tab, then when your done navigating the in the current tab, click Stop Collecting at which point it could open the results of all collected links.

This would require a MutationObserver to listen for added elements, and while that is non-trivial, I would have to figure out how to get it to work on all shadowRoots (like those used by archive.org). This feature request may shit on the back burner for a while, but let me know your thoughts...

I need the MutationObserver for another feature I want to implement, Live Links where it will always show the number of links on the page in the toolbar icon, and optionally extract to a sidebar that can also auto-update with collected links, but, this is way on the back burner. The MutationObserver is a good start tho.

Tenome · 2024-11-19T22:25:01Z

I see, thanks for the explanation. I figured something weird was up, since this is the only extension of this type that actually manages to grab any URLs at all (I suspect it's because how IA truncates their URLs into h4 headers). Oh well, for now the alternate method I use it to load all the cover thumbnails and then replace the URLs in the network tab. So while that feature would definitely be useful, since I doubt I'm the only one who wants to do something like this, I'm not in any particular rush. Thanks.

Tenome added the bug Something isn't working label Nov 19, 2024

Tenome assigned smashedr Nov 19, 2024

Tenome changed the title ~~Is 177 links the maximum amount this will extract?~~ [Archive.org] Not grabbing all URLs onscreen Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Archive.org] Not grabbing all URLs onscreen #109

[Archive.org] Not grabbing all URLs onscreen #109

Tenome commented Nov 19, 2024 •

edited

Loading

smashedr commented Nov 19, 2024

Tenome commented Nov 19, 2024 •

edited

Loading

[Archive.org] Not grabbing all URLs onscreen #109

[Archive.org] Not grabbing all URLs onscreen #109

Comments

Tenome commented Nov 19, 2024 • edited Loading

Site Link

Details

Support Information

smashedr commented Nov 19, 2024

Tenome commented Nov 19, 2024 • edited Loading

Tenome commented Nov 19, 2024 •

edited

Loading

Tenome commented Nov 19, 2024 •

edited

Loading