Architecture

microservices (but not going wild with it)
backend is streambased, operates continuously -> results are available as they come in
abstraction of external APIs within controller via generic classes
all persistence in search index

Diagram Streaming

Crawler + Content Extractor: Apache Storm
- gets seed urls by polling ES, starts crawling those
- extracts content as part of crawl topology
- scores content as part of crawl topology
Controller: Typescript + Express + Swagger
- serves UI, proxies ES
- translation API abstraction
- search API abstraction
- notification API abstraction
- result export API abstraction
- inserts seed-URLs into ES to start crawl
- purges status indices to stop crawl once condition X is met
Elasticsearch: persistence
- "results" index with fetched & scored URLs
- "status" index for recursive crawl. one per crawl job, so they can be independently stopped

Message Flow `startCrawl()`

message flow startCrawl()