Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial integration of NER into Spring batch #24

Open
mbelousov opened this issue Feb 17, 2017 · 6 comments
Open

Initial integration of NER into Spring batch #24

mbelousov opened this issue Feb 17, 2017 · 6 comments
Assignees
Labels

Comments

@mbelousov
Copy link

Hey guys,

Please have a look into feature/ner-integration branch. I have integrated NER (from feature/NamedEntityExtractor) into our development branch and implemented majority of stuff that we have discussed during a call.

  1. I've decided to keep separate data structure for annotated document (so renamed GATEDocument into AnnotatedDocument that has gate.Document as a property). Ideally it should extend gate.Document rather than have it as a property.

  2. It turned out that we need to have two files per patient (both .lst and .def), but if you know how to minimise the number of files please let me know.

  3. Currently it writes gateDoc XML into result txt files, so what we need to do is to implement getScrubbedContent properly for AnnotatedDocument.

  4. There are bunch of stuff that we could try to optimise in order to keep I/O operations to minimum, I have ignored this for now, but feel free to contribute.

Before we can merge it into development, let's test it and discuss all issues that are left.

@mbelousov
Copy link
Author

Regarding 3, it was already implemented (feature/ScrubberProcessor) by @dehghana, just need to be integrated.

@hkkenneth
Copy link

In NamedEntitiesWriter.write(), what's the purpose of Clear person gazetteers part? Do you want to make sure files generated in the previous run are erased?

This has problem in situation when one patient has many documents.
E.g. in the first call to write(), 2 documents for patient number 5 are passed in, those NEs will be written. Then in the second call to write(), 3 other documents for patient number 5 are passed in. The second 3 doc will overwrite the NEs found in the first 2 doc.

Also, Append to person gazetteers may input duplicate named entities in the .lst file - possibly not a breaking problem but may be inefficient.

A way to reduce I/O could be

  1. For list of Annotated files passed to NamedEntitiesWriter.write(), consolidate their named entities in a per patient fashion (e.g. this gives us Map<String, List> for Map<patient ID, NEs>)
  2. For each patient ID, read the existing ID.lst, and append to it the NEs which are not present in the file already.

The rest seems to work quite well.

@hkkenneth
Copy link

But I don't think DeidentifiedDocumentWriter.write() is the right place to "Remove all patients gazetteers".

First, it may delete .lst and .def prematurely (e.g. in the above example, when the first 2 doc for patient number 5 have been written out by DeidentifiedDocumentWriter, their .lst and .def got deleted and SecondPassItemProcessor.process() cannot find them when processing the second batch of 3 doc. (Worst thing is, even if the gazetteer files are not found, it is ignored silently by GATE).

Second, it may tamper with retry logic in Spring batch.

I think a better place is to do such clean up is in a listener after 2nd pass (inherently we need to keep track of how much doc per patient and count how many have been processed / written)?

@hkkenneth
Copy link

Thanks @mbelousov , I don't have further comments.

One last long term question:
Can writer be concurrent when we support multi-threading? Will two threads attempt to write to same patient's gazetteer file together?

@mbelousov
Copy link
Author

@hkkenneth, we have committed changes to fix the logic and for issue #9. Would be awesome if you could have a look into workflow and add job listener to clean output folders. Feel free to create a separate issue for that, but if you have any concerns regarding the logic of the current workflow let's discuss it here. Thanks!

@hkkenneth
Copy link

Not urgent - for the next version we may take a look whether duplicate entries in the .lst files may be bad for performance. Compare that with the approach of keeping a set of "written" NEs per patient throughout the entire firstPass until all gazetteer files are written?

@mbelousov mbelousov added the v1.0 label Mar 31, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants