Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RGI source archive user story: Full Text Search of Resources including PDFs #34

Open
anderspeders opened this issue May 10, 2017 · 7 comments

Comments

@anderspeders
Copy link

Why

As a User I want to be able to search for words in the document header so that I can identify all documents mentioning for example “soverign wealth funds” or “environmental impact assessments”

What

Notes

@mattfullerton
Copy link
Contributor

By document header, the actual PDF in question is meant?

@anderspeders
Copy link
Author

Yes, that was the intention.

(I do not assume that search within structured PDFs is a feature included in CKAN)

@mattfullerton
Copy link
Contributor

Basically the content (or partial content, or whatever) has to make it into the search index. This happens for tabular data files. For other types of file, there are a couple of approaches out there:

https://github.com/stadt-karlsruhe/ckanext-extractor
https://github.com/transparenzportalhamburg/ckanext-fulltext (and https://github.com/transparenzportalhamburg/ckanext-highlighting)

@anderspeders
Copy link
Author

Thanks Matt - would you be able pick consider the most suitable approach and deploy a test on the staging server.

@mattfullerton mattfullerton self-assigned this May 23, 2017
@mattfullerton mattfullerton changed the title RGI source archive user story: Search RGI source archive user story: Full Text Search of Resources including PDFs May 29, 2017
@mattfullerton
Copy link
Contributor

@anderspeders @deirdrelee This is a big piece of work which I'm eager to do (been wanting to do this for years on a CKAN) but can't be done quickly. It will have repercussions for server power, deployment etc. We need to establish priority and then take a solid piece of time to test it/do it.

@mattfullerton
Copy link
Contributor

As requested:
I estimate about 4 days of work to get the system up and running, including sending PDFs to a file-to-text service with OCR.

Relevant:
https://github.com/transparenzportalhamburg/ckanext-fulltext
https://github.com/stadt-karlsruhe/ckanext-extractor

Beyond that (i.e. trying to ensure it works well for all 9000+ PDFs including foreign script types) is hard to estimate without having the system up and running first.

@anderspeders
Copy link
Author

Thanks Matt,

Appreciate this quick assessment. I suggest that we park it for now. We might get back to this work later so it is definately helpful to have it costed.

I am curious if there any examples of CKAN platforms (city or country) which have developed a public facing search functionality that leverages the OCR text?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants