Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Support for English notes? #374

Open
ruiouyangVA opened this issue Feb 11, 2025 · 2 comments
Open

Feature request: Support for English notes? #374

ruiouyangVA opened this issue Feb 11, 2025 · 2 comments

Comments

@ruiouyangVA
Copy link

ruiouyangVA commented Feb 11, 2025

Would it be difficult to adapt EDS-NLP for extracting custom named entities for clinical notes in English?
I would want a component along the lines of https://aphp.github.io/edsnlp/latest/pipes/ner/scores/charlson/.

(If this not a recommended idea let me know too -- and pointers in the right direction appreciated. Presumably I could build something off of medspacy)

What would need to be changed? I assume the tokenizers at

edsnlp/language.py
edsnlp/conjugator.py

And potentially the patterns at

/pipes/core/normalizer/pollution/patterns.py
/pipes/misc/consultation_dates/patterns.py
pipes/misc/dates/patterns/relative.py
/pipes/misc/dates/patterns/duration.py
/pipes/misc/dates/patterns/current.py
/pipes/misc/dates/patterns/absolute.py

/pipes/misc/quantities/patterns.py
/pipes/misc/reason/patterns.py
/pipes/misc/sections/patterns.py
/pipes/misc/tables/patterns.py 

/pipes/terminations.py

/pipes/qualifiers/negation/patterns.py
edsnlp/pipes/qualifiers/hypothesis/patterns.py ?
scripts/conjugate_verbs.py

As well as the resources at

edsnlp/resources/*(json|csv).gz

The code architecture is very clean and a lot of modifications (eg detecting sentence boundaries with newlines) make a lot of sense. Also I am one person and reinventing the wheel seems like a lot of work ...

Thanks!

@percevalw
Copy link
Member

Hi @ruiouyangVA, thanks for your interest in our library! Your approach makes sens.

While these components were originally designed with French in mind, many should work across most Latin languages, including English. Have you tried the eds.charlson matcher on your documents? Does it work out of the box? Indeed, looking at the patterns file, there’s nothing inherently language-specific about it.

For components like negation, parenthood, and hypothesis detection, they are adaptations of the NegEx and ConText algorithms. Translating the patterns should yield good results.

If you make any pattern adjustments for your langage, we’d be happy to integrate them! We don’t yet have a formal multilingual API, but we’re open to exploring solutions.

One of the problems I see is that we don't have access to the clinical reports in English, so we wouldn't be able to check the changes made to the non-French patterns. Regardless of the package you end up using, how do you plan to validate your extraction pipeline ?

@ruiouyangVA
Copy link
Author

Hello @percevalw, thanks for the sanity check and encouragement! I work with a subset of pathology data specific to prostate cancer, so I can't test the eds.charlson matcher. I will try to make a quick test today to see if I can add a pattern and extract some information. I wasn't sure if the NegEx and ConText plays into the NER pipelines so I'll take a closer look also while doing that.

Thank you for the clear documentation btw, it's very handy!

We do have ground truth data labelled by clinical collaborators, which will form our validation set. If the EDS team would like English clinical reports to check any changes, I would need to ask about how that is usually handled.

Lots of work to be done and not enough people as usual :') but hopefully I can build on the EDS work and make it faster in the future to add / debug / improve our extractors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants