Better URL analysis during -bake #26

tdammers · 2019-01-08T09:12:47Z

The -bake command needs to analyze link hrefs and such to figure out what else to scrape. But the algorithm it uses is a bit too crude - anything not starting with http: or https: is considered a local link, but this is false; things can use other protocols, such as mailto:, javascript:, file:, data: etc., which the scraper shouldn't touch.

See #23 for example.

The text was updated successfully, but these errors were encountered:

andrewufrank · 2019-01-08T21:36:48Z

I am very much confused on the issue of uploading a baked version and the difference between a baked version (which find the files with simple server) and the uploaded version, which does not find the files). I tried with my host and github.io - with the same result

the baking process takes a a static file selectedEntries.html and could just copy the file but it produces a folder selectedEntries.html which contains an index.html file (which is the original file). Why is this necessary?

surprisingly, the simpleserver contracts the relative url in the selectedEntries.html/index.html file (which are something like /doc/doca/filename.pdf), to be relative to the folder (not the file).

with simpleserver I see

http://localhost:8000/staticData/docs/docs4/4756_Intersection_Nonconvex_Polygons_Using_Alternate_Hierarchical_Decomposition.pdf

(witout the intervening folder) and with uploaded on github:

https://andrewufrank.github.io/staticData/selectedEntries.html/docs/docs4/4756_Intersection_Nonconvex_Polygons_Using_Alternate_Hierarchical_Decomposition.pdf

which means the relative url is relative to the file and then relative to the folder and the file not found.

If i remove the constructed folder and rename the index file to selectedEntries.html then it works with the simple server locally and uploaded on github.

I have constructed a small example as a model for academic homepage and have it running when served. you can find it at andrewufrank.github.io. The issue occurs wtih the publication list - pdf's in the 'all publications" are not found (untouched version from bake) and the publications in selected some (doctored) are found.

The source is [email protected]:andrewufrank/smallAcademic.git - when you run it with sprinkles be aware of a small change i have done in the source to separate the theme of a site from the data it serves and introduced a folder theme (this requires two changes in the code, replacing templates with theme/templates. If you dislike this change or think that it has something to do with the current issue, I can quickly undo it (but not tonight).

It is very confusing and I cannot understand why sprinkles does not just copy the satic files when baking. Otherwise - I like it a lot and it is very reactive to changes (when one understands when one has to restart).. Next I will try your suggestions for the classification of blogs

andrewufrank · 2019-01-09T10:38:53Z

Further investigation shows that the addition of the folder is not the cause. Sprinkles only misses the links to the next step the chain down and does not insert the related folder with linked pages. This is easily corrected by copying the full static folder.
The confusion was caused by some other error which made github.io not posting changed versions while i was experimenting.
Sorry for the bother (and the long message above) - it only remains that sprinkles does not follow all links in baking when they originate in static documents and then does not include the documents down from there. It is questionable if this is a proper use for bake and I may put the docs somewhere else.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better URL analysis during -bake #26

Better URL analysis during -bake #26

tdammers commented Jan 8, 2019

andrewufrank commented Jan 8, 2019

andrewufrank commented Jan 9, 2019

Better URL analysis during -bake #26

Better URL analysis during -bake #26

Comments

tdammers commented Jan 8, 2019

andrewufrank commented Jan 8, 2019

andrewufrank commented Jan 9, 2019