Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better URL analysis during -bake #26

Open
tdammers opened this issue Jan 8, 2019 · 2 comments
Open

Better URL analysis during -bake #26

tdammers opened this issue Jan 8, 2019 · 2 comments

Comments

@tdammers
Copy link
Owner

tdammers commented Jan 8, 2019

The -bake command needs to analyze link hrefs and such to figure out what else to scrape. But the algorithm it uses is a bit too crude - anything not starting with http: or https: is considered a local link, but this is false; things can use other protocols, such as mailto:, javascript:, file:, data: etc., which the scraper shouldn't touch.

See #23 for example.

@andrewufrank
Copy link

I am very much confused on the issue of uploading a baked version and the difference between a baked version (which find the files with simple server) and the uploaded version, which does not find the files). I tried with my host and github.io - with the same result

the baking process takes a a static file selectedEntries.html and could just copy the file but it produces a folder selectedEntries.html which contains an index.html file (which is the original file). Why is this necessary?

surprisingly, the simpleserver contracts the relative url in the selectedEntries.html/index.html file (which are something like /doc/doca/filename.pdf), to be relative to the folder (not the file).

with simpleserver I see

http://localhost:8000/staticData/docs/docs4/4756_Intersection_Nonconvex_Polygons_Using_Alternate_Hierarchical_Decomposition.pdf

(witout the intervening folder) and with uploaded on github:

https://andrewufrank.github.io/staticData/selectedEntries.html/docs/docs4/4756_Intersection_Nonconvex_Polygons_Using_Alternate_Hierarchical_Decomposition.pdf

which means the relative url is relative to the file and then relative to the folder and the file not found.

If i remove the constructed folder and rename the index file to selectedEntries.html then it works with the simple server locally and uploaded on github.

I have constructed a small example as a model for academic homepage and have it running when served. you can find it at andrewufrank.github.io. The issue occurs wtih the publication list - pdf's in the 'all publications" are not found (untouched version from bake) and the publications in selected some (doctored) are found.

The source is [email protected]:andrewufrank/smallAcademic.git - when you run it with sprinkles be aware of a small change i have done in the source to separate the theme of a site from the data it serves and introduced a folder theme (this requires two changes in the code, replacing templates with theme/templates. If you dislike this change or think that it has something to do with the current issue, I can quickly undo it (but not tonight).

It is very confusing and I cannot understand why sprinkles does not just copy the satic files when baking. Otherwise - I like it a lot and it is very reactive to changes (when one understands when one has to restart).. Next I will try your suggestions for the classification of blogs

@andrewufrank
Copy link

Further investigation shows that the addition of the folder is not the cause. Sprinkles only misses the links to the next step the chain down and does not insert the related folder with linked pages. This is easily corrected by copying the full static folder.
The confusion was caused by some other error which made github.io not posting changed versions while i was experimenting.
Sorry for the bother (and the long message above) - it only remains that sprinkles does not follow all links in baking when they originate in static documents and then does not include the documents down from there. It is questionable if this is a proper use for bake and I may put the docs somewhere else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants