-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RGI source archive user story: Full Text Search of Resources including PDFs #34
Comments
By document header, the actual PDF in question is meant? |
Yes, that was the intention. (I do not assume that search within structured PDFs is a feature included in CKAN) |
Basically the content (or partial content, or whatever) has to make it into the search index. This happens for tabular data files. For other types of file, there are a couple of approaches out there: https://github.com/stadt-karlsruhe/ckanext-extractor |
Thanks Matt - would you be able pick consider the most suitable approach and deploy a test on the staging server. |
@anderspeders @deirdrelee This is a big piece of work which I'm eager to do (been wanting to do this for years on a CKAN) but can't be done quickly. It will have repercussions for server power, deployment etc. We need to establish priority and then take a solid piece of time to test it/do it. |
As requested: Relevant: Beyond that (i.e. trying to ensure it works well for all 9000+ PDFs including foreign script types) is hard to estimate without having the system up and running first. |
Thanks Matt, Appreciate this quick assessment. I suggest that we park it for now. We might get back to this work later so it is definately helpful to have it costed. I am curious if there any examples of CKAN platforms (city or country) which have developed a public facing search functionality that leverages the OCR text? |
Why
As a User I want to be able to search for words in the document header so that I can identify all documents mentioning for example “soverign wealth funds” or “environmental impact assessments”
What
Notes
The text was updated successfully, but these errors were encountered: