Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How is the list of URLs generated? #17

Open
grigri9 opened this issue Aug 14, 2019 · 2 comments
Open

How is the list of URLs generated? #17

grigri9 opened this issue Aug 14, 2019 · 2 comments

Comments

@grigri9
Copy link

grigri9 commented Aug 14, 2019

First off, this is awesome and I just wanted to say thank you for keeping all this up to date!

Is there some kind of automated process for generating the list of URLs?

It looks like this is pulling from all http://resources.docs.salesforce.com/* URL paths.

I was thinking the https://www.salesforce.com/content/dam/web/en_us/www/documents/ URL path also has a good amount of useful content. There are sales pdfs in there but also whitepapers, datasheets and similar items that are very useful.

If this list of URLs is being generated by a google custom search engine or something similar it may be worthwhile to add that domain.

@richardvanhook
Copy link
Owner

Some basic shell scripting and crawling, but also significantly manual. :-(

Would love to expand it but unfortunately I'm time constrained at the moment with my current customer. Will leave this open as a future reminder.

@mattandneil
Copy link
Contributor

mattandneil commented Jan 22, 2020

These steps can be mechanical, here's an example that yields about 150 PDF files:

  1. Web search using the site operator (use the option with omitted results included)
    google.com/search?q=site:https://resources.docs.salesforce.com/sfdc/pdf&filter=0
    
  2. Log the hyperlinks to console, for copying and pasting to a shell script
    var h3s = document.getElementsByClassName('LC20lb')
    for (var i in h3s) if (h3s.hasOwnProperty(i))
    console.log(h3s[i].parentNode.getAttribute('href'));
    
  3. Next page, rinse and repeat, solve any CAPTCHA etc...
    search-result-pages

It finds PDF resources that have been linked on the public internet eg from help files and articles. However, many files in the catalog have zero backlinks and tend to disappear as the docs change over time. An attempt is also made to link retired files at their final version by linking to the specific release number.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants