You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As I have discovered over the course of this project, the Recipe metadata is not embedded in pages in any kind of uniform way. So far this has been rectified through users reporting on GitHub/Reddit that a page scrape has failed, and the scraper logic being updated.
Ideally I would like unscrapable pages to be reported back automatically, so that trends can be identified and the scraper updated.
The way I see this working is with the user opting in, which will then be stored on their user object in Mongo. When they then encounter a scraping failure, the backend makes a request to some kind of central API containing the URL of the failed page. These URLs will then be stored somewhere for review, and the scraper can be updated to match this.
The text was updated successfully, but these errors were encountered:
I think that it's not the best way to alter the parser every time when there is a popular site which can not be scraped. I think a better approach would be a modular and site dependent scraping approach and using the parser as a fallback option.
I think this is possible since a site on its own will quite certainly keep their schema consistent and is unlikely to change it. And every site has the same contents, so there is always a rather easy way to create a conversion logic. This mapping logic could then be written into configuration files which could be imported or managed via the UI.
This would enable everybody to easily create a mapping for a site, test it and contribute it to the project. This would reduce your work load and makes the whole process of adapting to new sites way quicker in my opinion.
As I have discovered over the course of this project, the Recipe metadata is not embedded in pages in any kind of uniform way. So far this has been rectified through users reporting on GitHub/Reddit that a page scrape has failed, and the scraper logic being updated.
Ideally I would like unscrapable pages to be reported back automatically, so that trends can be identified and the scraper updated.
The way I see this working is with the user opting in, which will then be stored on their user object in Mongo. When they then encounter a scraping failure, the backend makes a request to some kind of central API containing the URL of the failed page. These URLs will then be stored somewhere for review, and the scraper can be updated to match this.
The text was updated successfully, but these errors were encountered: