enhancements #30

noerw · 2018-08-16T07:51:25Z

Meta issue where I collect some proposals for future work.

reduce amount of irrelevant results
- Do not follow outlinks of pages that were classified as irrelevant
- Curate a blacklist of hosts irrelevant to the whole 'data' topic. (Provide a way to add to the list from the Crawl Metrics or Search Result UI?)
improve result sorting
- The current score can not consider the manual classification. Unlike the current scoring-while-indexing approach, score-on-query would allow for that.
- Results that have no automatic classification (e.g. not in English) are sorted badly
improve scheduling of concurrent crawls:
Currently pages are fetched in the order in which they are discovered / inserted into the crawl pipeline.
This means for two concurrent crawl jobs, that pages from the second job are only crawled once the first job completed its first crawls.
improve scoring & content-extraction for non-english pages
- train more models (??)
- develop/enhance a scalable way of supporting arbitrary languages

matthesrieke · 2018-08-16T13:17:59Z

Some thoughts on the blacklist:

Blacklisting of certain domains
- static blacklist is straightforward
- dynamic blacklist requires update of Storm Crawler release + inclusion of blacklist management API + UI

noerw · 2018-08-31T09:34:52Z

A static blacklist is implemented.
Outlinks of pages classified as unrelated (with a confidence of 0.5 and up) are not followed.
Result sorting is based on a score during query, considering the manual classification.
Concurrent crawls should be fetched in parallel now.

noerw added enhancement New feature or request meta labels Aug 16, 2018

Provide feedback