GIG-Scripts/pdf_crawler/README.md at master · LSFLK/GIG-Scripts · GitHub

Given a website Url crawl for all the links and filters pdf urls.
Download and Parse the pdf files and extract the text content.
Use Stanford NER library to identify Named Entities in extracted text
Save to GIG API

How to Run:

1. set category var according the source category. eg. (Tenders, Gazettes, etc.)
2. go run pdf_crawler.go "https://site.lk"