You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Sep 20, 2024. It is now read-only.
Meta issue where I collect some proposals for future work.
reduce amount of irrelevant results
Do not follow outlinks of pages that were classified as irrelevant
Curate a blacklist of hosts irrelevant to the whole 'data' topic. (Provide a way to add to the list from the Crawl Metrics or Search Result UI?)
improve result sorting
The current score can not consider the manual classification. Unlike the current scoring-while-indexing approach, score-on-query would allow for that.
Results that have no automatic classification (e.g. not in English) are sorted badly
improve scheduling of concurrent crawls:
Currently pages are fetched in the order in which they are discovered / inserted into the crawl pipeline.
This means for two concurrent crawl jobs, that pages from the second job are only crawled once the first job completed its first crawls.
improve scoring & content-extraction for non-english pages
train more models (??)
develop/enhance a scalable way of supporting arbitrary languages
The text was updated successfully, but these errors were encountered:
Meta issue where I collect some proposals for future work.
reduce amount of irrelevant results
improve result sorting
improve scheduling of concurrent crawls:
Currently pages are fetched in the order in which they are discovered / inserted into the crawl pipeline.
This means for two concurrent crawl jobs, that pages from the second job are only crawled once the first job completed its first crawls.
improve scoring & content-extraction for non-english pages
The text was updated successfully, but these errors were encountered: