Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Archive.org] Not grabbing all URLs onscreen #109

Open
Tenome opened this issue Nov 19, 2024 · 2 comments
Open

[Archive.org] Not grabbing all URLs onscreen #109

Tenome opened this issue Nov 19, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@Tenome
Copy link

Tenome commented Nov 19, 2024

Site Link

https://archive.org/details/software?tab=collection&query=-subject%3A%28ps2%29+-subject%3A%28ps1%29+-subject%3A%28sega+saturn%29+-subject%3A%28ps3%29&page=7&sort=-publicdate&and%5B%5D=subject%3A%22PC+Game%22&and%5B%5D=subject%3A%22PC-98%22&and%5B%5D=subject%3A%22IBM+PC%22&and%5B%5D=subject%3A%22macintosh%22&and%5B%5D=subject%3A%22IBM+PC+Compatible%22&and%5B%5D=subject%3A%22mac%22&and%5B%5D=subject%3A%22Doujin%22&and%5B%5D=subject%3A%22Doujin+Games%22&and%5B%5D=subject%3A%22doujin+soft%22&and%5B%5D=subject%3A%22Doujin+games%22&and%5B%5D=subject%3A%22doujin+games%22&and%5B%5D=subject%3A%22Doujin+Game%22&and%5B%5D=subject%3A%22Doujin+game%22&and%5B%5D=subject%3A%22doujin+game%22&and%5B%5D=subject%3A%22doujin%22&and%5B%5D=mediatype%3A%22software%22&and%5B%5D=language%3A%22Japanese%22

Details

I'm trying to grab all the URLs in this search result, but the extension only grabs some of the links despite me loading way more than that into the browser. The filter used is https://archive.org/details/. I scrolled down and loaded the subsequent pages that way, but it seems to only grab what is roughly on screen instead of the entire list (around 177 URLs). You can test this yourself by scrolling down and letting it load more pages, and then searching for a URL from the top of the search results (which won't appear in the extracted list).

Support Information

Link Extractor - 0.7.8
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:133.0) Gecko/20100101 Firefox/133.0
permissions.origins: []
options: {"linksDisplay":-1,"flags":"ig","lazyLoad":true,"removeDuplicates":true,"defaultFilter":true,"saveState":true,"linksTruncate":false,"linksNoWrap":false,"contextMenu":true,"showUpdate":false}
links-table: {"time":1732033215784,"start":0,"length":-1,"order":[[0,"asc"]],"search":{"caseInsensitive":true,"search":"","regex":true,"smart":true,"return":false,"_hungarianMap":{}},"columns":[{"visible":true,"search":{"caseInsensitive":true,"search":"","regex":false,"smart":true,"return":false}},{"visible":false,"search":{"caseInsensitive":true,"search":"","regex":false,"smart":true,"return":false}},{"visible":false,"search":{"caseInsensitive":true,"search":"","regex":false,"smart":true,"return":false}},{"visible":false,"search":{"caseInsensitive":true,"search":"","regex":false,"smart":true,"return":false}},{"visible":false,"search":{"caseInsensitive":true,"search":"","regex":false,"smart":true,"return":false}},{"visible":false,"search":{"caseInsensitive":true,"search":"","regex":false,"smart":true,"return":false}}],"childRows":[]}
domains-table: {"time":1732033215788,"start":0,"length":-1,"order":[[0,"asc"]],"search":{"caseInsensitive":true,"search":"","regex":true,"smart":true,"return":false,"_hungarianMap":{}},"columns":[{"visible":true,"search":{"caseInsensitive":true,"search":"","regex":false,"smart":true,"return":false}}],"childRows":[]}
@Tenome Tenome added the bug Something isn't working label Nov 19, 2024
@Tenome Tenome changed the title Is 177 links the maximum amount this will extract? [Archive.org] Not grabbing all URLs onscreen Nov 19, 2024
@smashedr
Copy link
Member

There is no limit to the number of links that can be extracted. While there may be some upper limit that I have not tested, I have tested over 2,000 links when I added datatables.

On further review, it seems that archive.org is adding/removing nodes as you scroll, regardless of direction. You can watch in the network tab in developer tools (Ctrl+Shift+I). Nodes are being added when you scroll down or up. It seems the maximum number of links it shows at one time is 300.


That being said, I can possibly see a feature request out of this. Possibly you could click Start Collecting and the extension could start collecting links in the current tab, then when your done navigating the in the current tab, click Stop Collecting at which point it could open the results of all collected links.

This would require a MutationObserver to listen for added elements, and while that is non-trivial, I would have to figure out how to get it to work on all shadowRoots (like those used by archive.org). This feature request may shit on the back burner for a while, but let me know your thoughts...

I need the MutationObserver for another feature I want to implement, Live Links where it will always show the number of links on the page in the toolbar icon, and optionally extract to a sidebar that can also auto-update with collected links, but, this is way on the back burner. The MutationObserver is a good start tho.

@Tenome
Copy link
Author

Tenome commented Nov 19, 2024

I see, thanks for the explanation. I figured something weird was up, since this is the only extension of this type that actually manages to grab any URLs at all (I suspect it's because how IA truncates their URLs into h4 headers). Oh well, for now the alternate method I use it to load all the cover thumbnails and then replace the URLs in the network tab. So while that feature would definitely be useful, since I doubt I'm the only one who wants to do something like this, I'm not in any particular rush. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants