You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I discovered a large-ish number of duplicate recipes in the search index today, and have been tracing back through the (re)crawling and (re)indexing logic to figure out how this could have happened.
I think that there are perhaps three logical inconsistencies in the code that cause problems:
Unlike the crawl_recipe worker method, the index_recipe worker method does not check for duplicate recipes before (re)indexing content -- and as a result, manual reindexing of recipes may insert duplicates if the operator is not careful when selecting a set of recipes.
We had what looked to me like a bug that meant that crawl_recipe may not have been matching recipes to their earliest-known crawled equivalent. This has been addressed by commit d1e0901 (and follow-up bugfix b2e0ef3, suggesting the need for some code refactoring here).
The Celery task workflow during recrawling is crawl_recipe -> index_recipe, compared to the reindexing workflow of simply index_recipe. Sometimes it's more convenient and performant to use the latter, especially because it does not involve any calls to the crawler service and (potentially) external HTTP requests, generally meaning that it can be much faster.
However, it's difficult to strictly enforce the set of recipes that may be selected for reindexing by operators, and we should make the backend robust against mistakes there (because they will happen). This implies that we should move the de-duplcation logic (that detects duplicates, and remaps the src and dst on the Recipe so that duplicates merge) into the index_recipe worker method.
Refactoring
It may make sense to relocate the find_earliest_crawl and find_latest_crawl utility methods from RecipeURL to Recipe, and to always drive the latter based on the resolved URL of the recipe.
The goals here are multiple, and roughly in this priority order:
To ensure that latest-and-earliest URL lookup for a recipe are handled correctly, robustly and with the latest-known information at indexing-time so that recipe duplication should not occur.
To minimize the number of database queries required to perform the lookups (currently this requires query/construction of a RecipeURL object, for example, and the lookups themselves query the CrawlURL model).
To organize the code in a comprehensible way.
The text was updated successfully, but these errors were encountered:
I'm coming to the conclusion that we're likely to continue to have some level of duplication within the Recipe model (database table) using the current schema.
As a result I think what is important is to hide-and-redirect the duplicate recipes, while continuing to display a single, primary copy of the recipe.
The modelling problem here is that we use hash(src) of a recipe as the search engine key for the document -- but src tracks the earliest-known URL for a recipe, and may change over time. I don't think that that invariant is possible to break, and it seems like a good way to anchor recipes to their (best-as-we-can-tell) origins.
So I think the trick is that as the best-known src changes, we ensure that only one of the n documents -- the one where the ID of the document matches hash(src) is displayed at a time. All others should be hidden. See also #71.
Problem
I discovered a large-ish number of duplicate recipes in the search index today, and have been tracing back through the (re)crawling and (re)indexing logic to figure out how this could have happened.
I think that there are perhaps three logical inconsistencies in the code that cause problems:
crawl_recipe
worker method, theindex_recipe
worker method does not check for duplicate recipes before (re)indexing content -- and as a result, manual reindexing of recipes may insert duplicates if the operator is not careful when selecting a set of recipes.crawl_recipe
may not have been matching recipes to their earliest-known crawled equivalent. This has been addressed by commit d1e0901 (and follow-up bugfix b2e0ef3, suggesting the need for some code refactoring here).crawler
reciperadar/recipes.py
and theRecipeURL
constructed by thebackend
recipe worker(s). This doesn't seem great; we should be able to accept oldest-known, latest-known, or anywhere in-between and expect reliable behaviour.Analysis
Deduplication
The Celery task workflow during recrawling is
crawl_recipe -> index_recipe
, compared to the reindexing workflow of simplyindex_recipe
. Sometimes it's more convenient and performant to use the latter, especially because it does not involve any calls to thecrawler
service and (potentially) external HTTP requests, generally meaning that it can be much faster.However, it's difficult to strictly enforce the set of recipes that may be selected for reindexing by operators, and we should make the backend robust against mistakes there (because they will happen). This implies that we should move the de-duplcation logic (that detects duplicates, and remaps the
src
anddst
on the Recipe so that duplicates merge) into theindex_recipe
worker method.Refactoring
It may make sense to relocate the
find_earliest_crawl
andfind_latest_crawl
utility methods fromRecipeURL
toRecipe
, and to always drive the latter based on the resolved URL of the recipe.The goals here are multiple, and roughly in this priority order:
RecipeURL
object, for example, and the lookups themselves query theCrawlURL
model).The text was updated successfully, but these errors were encountered: