Skip to content
This repository has been archived by the owner on Sep 20, 2024. It is now read-only.

enhancements #30

Open
4 of 7 tasks
noerw opened this issue Aug 16, 2018 · 2 comments
Open
4 of 7 tasks

enhancements #30

noerw opened this issue Aug 16, 2018 · 2 comments
Labels
enhancement New feature or request meta

Comments

@noerw
Copy link
Contributor

noerw commented Aug 16, 2018

Meta issue where I collect some proposals for future work.

  • reduce amount of irrelevant results

    • Do not follow outlinks of pages that were classified as irrelevant
    • Curate a blacklist of hosts irrelevant to the whole 'data' topic. (Provide a way to add to the list from the Crawl Metrics or Search Result UI?)
  • improve result sorting

    • The current score can not consider the manual classification. Unlike the current scoring-while-indexing approach, score-on-query would allow for that.
    • Results that have no automatic classification (e.g. not in English) are sorted badly
  • improve scheduling of concurrent crawls:
    Currently pages are fetched in the order in which they are discovered / inserted into the crawl pipeline.
    This means for two concurrent crawl jobs, that pages from the second job are only crawled once the first job completed its first crawls.

  • improve scoring & content-extraction for non-english pages

    • train more models (??)
    • develop/enhance a scalable way of supporting arbitrary languages
@noerw noerw added enhancement New feature or request meta labels Aug 16, 2018
@matthesrieke
Copy link
Member

matthesrieke commented Aug 16, 2018

Some thoughts on the blacklist:

  • Blacklisting of certain domains
    • static blacklist is straightforward
    • dynamic blacklist requires update of Storm Crawler release + inclusion of blacklist management API + UI

@noerw
Copy link
Contributor Author

noerw commented Aug 31, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request meta
Projects
None yet
Development

No branches or pull requests

2 participants