Scraper error reporting #22

georgegebbett · 2022-05-16T12:17:06Z

As I have discovered over the course of this project, the Recipe metadata is not embedded in pages in any kind of uniform way. So far this has been rectified through users reporting on GitHub/Reddit that a page scrape has failed, and the scraper logic being updated.

Ideally I would like unscrapable pages to be reported back automatically, so that trends can be identified and the scraper updated.

The way I see this working is with the user opting in, which will then be stored on their user object in Mongo. When they then encounter a scraping failure, the backend makes a request to some kind of central API containing the URL of the failed page. These URLs will then be stored somewhere for review, and the scraper can be updated to match this.

JohannesFleischer · 2024-10-15T21:46:28Z

I think that it's not the best way to alter the parser every time when there is a popular site which can not be scraped. I think a better approach would be a modular and site dependent scraping approach and using the parser as a fallback option.

I think this is possible since a site on its own will quite certainly keep their schema consistent and is unlikely to change it. And every site has the same contents, so there is always a rather easy way to create a conversion logic. This mapping logic could then be written into configuration files which could be imported or managed via the UI.

This would enable everybody to easily create a mapping for a site, test it and contribute it to the project. This would reduce your work load and makes the whole process of adapting to new sites way quicker in my opinion.

What do you think?

georgegebbett added the enhancement New feature or request label May 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scraper error reporting #22

Scraper error reporting #22

georgegebbett commented May 16, 2022

JohannesFleischer commented Oct 15, 2024

Scraper error reporting #22

Scraper error reporting #22

Comments

georgegebbett commented May 16, 2022

JohannesFleischer commented Oct 15, 2024