Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraper error reporting #22

Open
georgegebbett opened this issue May 16, 2022 · 1 comment
Open

Scraper error reporting #22

georgegebbett opened this issue May 16, 2022 · 1 comment
Labels
enhancement New feature or request

Comments

@georgegebbett
Copy link
Owner

As I have discovered over the course of this project, the Recipe metadata is not embedded in pages in any kind of uniform way. So far this has been rectified through users reporting on GitHub/Reddit that a page scrape has failed, and the scraper logic being updated.

Ideally I would like unscrapable pages to be reported back automatically, so that trends can be identified and the scraper updated.

The way I see this working is with the user opting in, which will then be stored on their user object in Mongo. When they then encounter a scraping failure, the backend makes a request to some kind of central API containing the URL of the failed page. These URLs will then be stored somewhere for review, and the scraper can be updated to match this.

@georgegebbett georgegebbett added the enhancement New feature or request label May 16, 2022
@JohannesFleischer
Copy link

I think that it's not the best way to alter the parser every time when there is a popular site which can not be scraped. I think a better approach would be a modular and site dependent scraping approach and using the parser as a fallback option.

I think this is possible since a site on its own will quite certainly keep their schema consistent and is unlikely to change it. And every site has the same contents, so there is always a rather easy way to create a conversion logic. This mapping logic could then be written into configuration files which could be imported or managed via the UI.

This would enable everybody to easily create a mapping for a site, test it and contribute it to the project. This would reduce your work load and makes the whole process of adapting to new sites way quicker in my opinion.

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants