- Given a website Url crawl for all the links and filters pdf urls.
- Download and Parse the pdf files and extract the text content.
- Use Stanford NER library to identify Named Entities in extracted text
- Save to GIG API
1. set category var according the source category. eg. (Tenders, Gazettes, etc.)
2. go run pdf_crawler.go "https://site.lk"