Skip to content
This repository has been archived by the owner on Sep 20, 2024. It is now read-only.

Architecture

Norwin edited this page Aug 29, 2018 · 15 revisions
  • microservices (but not going wild with it)
  • backend is streambased, operates continuously -> results are available as they come in
  • abstraction of external APIs within controller via generic classes
  • all persistence in search index

Diagram Streaming

  • Crawler + Content Extractor: Apache Storm

    • gets seed urls by polling ES, starts crawling those
    • extracts content as part of crawl topology
    • scores content as part of crawl topology
  • Controller: Typescript + Express + Swagger

    • serves UI, proxies ES
    • translation API abstraction
    • search API abstraction
    • notification API abstraction
    • result export API abstraction
    • inserts seed-URLs into ES to start crawl
    • purges status indices to stop crawl once condition X is met
  • Elasticsearch: persistence

    • "results" index with fetched & scored URLs
    • "status" index for recursive crawl. one per crawl job, so they can be independently stopped

Message Flow startCrawl()

message flow startCrawl()

Clone this wiki locally