RGI source archive user story: Full Text Search of Resources including PDFs #34

anderspeders · 2017-05-10T22:39:24Z

Why

As a User I want to be able to search for words in the document header so that I can identify all documents mentioning for example “soverign wealth funds” or “environmental impact assessments”

What

Notes

mattfullerton · 2017-05-15T19:53:19Z

By document header, the actual PDF in question is meant?

anderspeders · 2017-05-15T20:36:21Z

Yes, that was the intention.

(I do not assume that search within structured PDFs is a feature included in CKAN)

mattfullerton · 2017-05-16T07:25:55Z

Basically the content (or partial content, or whatever) has to make it into the search index. This happens for tabular data files. For other types of file, there are a couple of approaches out there:

https://github.com/stadt-karlsruhe/ckanext-extractor
https://github.com/transparenzportalhamburg/ckanext-fulltext (and https://github.com/transparenzportalhamburg/ckanext-highlighting)

anderspeders · 2017-05-19T03:51:14Z

Thanks Matt - would you be able pick consider the most suitable approach and deploy a test on the staging server.

mattfullerton · 2017-09-01T08:40:11Z

@anderspeders @deirdrelee This is a big piece of work which I'm eager to do (been wanting to do this for years on a CKAN) but can't be done quickly. It will have repercussions for server power, deployment etc. We need to establish priority and then take a solid piece of time to test it/do it.

mattfullerton · 2017-10-09T11:34:10Z

As requested:
I estimate about 4 days of work to get the system up and running, including sending PDFs to a file-to-text service with OCR.

Relevant:
https://github.com/transparenzportalhamburg/ckanext-fulltext
https://github.com/stadt-karlsruhe/ckanext-extractor

Beyond that (i.e. trying to ensure it works well for all 9000+ PDFs including foreign script types) is hard to estimate without having the system up and running first.

anderspeders · 2017-10-09T12:48:45Z

Thanks Matt,

Appreciate this quick assessment. I suggest that we park it for now. We might get back to this work later so it is definately helpful to have it costed.

I am curious if there any examples of CKAN platforms (city or country) which have developed a public facing search functionality that leverages the OCR text?

anderspeders added the RGI source library label May 10, 2017

mattfullerton added the CKAN (data presentation) label May 11, 2017

mattfullerton mentioned this issue May 16, 2017

RGI explanation of "Resource Search" #51

Closed

mattfullerton self-assigned this May 23, 2017

mattfullerton changed the title ~~RGI source archive user story: Search~~ RGI source archive user story: Full Text Search of Resources including PDFs May 29, 2017

deirdrelee modified the milestone: M 17_09_08 NRGI Fortnightly Sprint Aug 9, 2017

mattfullerton removed this from the M 17_09_08 NRGI Fortnightly Sprint milestone Sep 1, 2017

mattfullerton added the Deployment label Sep 1, 2017

deirdrelee assigned EricSoroos and unassigned mattfullerton Jul 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RGI source archive user story: Full Text Search of Resources including PDFs #34

RGI source archive user story: Full Text Search of Resources including PDFs #34

anderspeders commented May 10, 2017

mattfullerton commented May 15, 2017

anderspeders commented May 15, 2017

mattfullerton commented May 16, 2017

anderspeders commented May 19, 2017

mattfullerton commented Sep 1, 2017

mattfullerton commented Oct 9, 2017

anderspeders commented Oct 9, 2017

RGI source archive user story: Full Text Search of Resources including PDFs #34

RGI source archive user story: Full Text Search of Resources including PDFs #34

Comments

anderspeders commented May 10, 2017

Why

What

Notes

mattfullerton commented May 15, 2017

anderspeders commented May 15, 2017

mattfullerton commented May 16, 2017

anderspeders commented May 19, 2017

mattfullerton commented Sep 1, 2017

mattfullerton commented Oct 9, 2017

anderspeders commented Oct 9, 2017