Skip to content

Commit

Permalink
refacto: fix paths after pipelines to pipes renaming
Browse files Browse the repository at this point in the history
  • Loading branch information
percevalw committed Dec 1, 2023
1 parent b2b242b commit e4b1eb0
Show file tree
Hide file tree
Showing 164 changed files with 516 additions and 560 deletions.
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,14 @@ EDS-NLP is a collaborative NLP framework that aims primarily at extracting infor
At its core, it is a collection of components or pipes, either rule-based functions or
deep learning modules. These components are organized into a novel efficient and modular pipeline system, built for hybrid and multitask models. We use [spaCy](https://spacy.io) to represent documents and their annotations, and [Pytorch](https://pytorch.org/) as a deep-learning backend for trainable components.

EDS-NLP is versatile and can be used on any textual document. The rule-based components are fully compatible with spaCy's pipelines, and vice versa. This library is a product of collaborative effort, and we encourage further contributions to enhance its capabilities.
EDS-NLP is versatile and can be used on any textual document. The rule-based components are fully compatible with spaCy's components, and vice versa. This library is a product of collaborative effort, and we encourage further contributions to enhance its capabilities.

Check out our interactive [demo](https://aphp.github.io/edsnlp/demo/) !

## Features

- [Rule-based components](https://aphp.github.io/edsnlp/latest/pipelines/) for French clinical notes
- [Trainable components](https://aphp.github.io/edsnlp/latest/pipelines/trainable): NER, Span classification
- [Rule-based components](https://aphp.github.io/edsnlp/latest/pipes/) for French clinical notes
- [Trainable components](https://aphp.github.io/edsnlp/latest/pipes/trainable): NER, Span classification
- Support for trained multitask models with [weights sharing](https://aphp.github.io/edsnlp/latest/concepts/torch-component/#sharing-subcomponents)
- [Fast inference](https://aphp.github.io/edsnlp/latest/concepts/inference/), with multi-GPU support out of the box
- Easy to use, with a spaCy-like API
Expand Down
2 changes: 1 addition & 1 deletion changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@

### Changed

- Pipes (in edsnlp/pipelines) are now lazily loaded, which should improve the loading time of the library.
- Pipes (in edsnlp/pipes) are now lazily loaded, which should improve the loading time of the library.
- `to_disk` methods can now return a config to override the initial config of the pipeline (e.g., to load a transformer directly from the path storing its fine-tuned weights)
- The `eds.tokenizer` tokenizer has been added to entry points, making it accessible from the outside
- Deprecate old connectors (e.g. BratDataConnector) in favor of the new `edsnlp.data` API
Expand Down
14 changes: 7 additions & 7 deletions contributing.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,12 @@ We welcome contributions ! There are many ways to help. For example, you can:

1. Help us track bugs by filing issues
2. Suggest and help prioritise new functionalities
3. Develop a new pipeline ! Fork the project and propose a new functionality through a pull request
3. Develop a new pipe ! Fork the project and propose a new functionality through a pull request
4. Help us make the library as straightforward as possible, by simply asking questions on whatever does not seem clear to you.

## Development installation

To be able to run the test suite, run the example notebooks and develop your own pipeline, you should clone the repo and install it locally.
To be able to run the test suite, run the example notebooks and develop your own pipeline component, you should clone the repo and install it locally.

<div class="termy">

Expand Down Expand Up @@ -80,15 +80,15 @@ python -m pytest

Should your contribution propose a bug fix, we require the bug be thoroughly tested.

### Architecture of a pipeline
### Architecture of a pipeline component

Pipelines should follow the same pattern :
Pipes should follow the same pattern :

```
edsnlp/pipelines/<pipeline>
|-- <pipeline>.py # Defines the component logic
edsnlp/pipes/<pipe>
|-- <pipe>.py # Defines the component logic
|-- patterns.py # Defines matched patterns
|-- factory.py # Declares the pipeline to spaCy
|-- factory.py # Declares the component to spaCy
```

### Style Guide
Expand Down
11 changes: 5 additions & 6 deletions demo/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@
nlp.add_pipe("eds.normalizer")
nlp.add_pipe("eds.sentences")
{pipes}
# Qualifier pipelines
# Qualifier pipes
nlp.add_pipe("eds.negation")
nlp.add_pipe("eds.family")
nlp.add_pipe("eds.hypothesis")
Expand Down Expand Up @@ -109,7 +109,6 @@ def load_model(custom_regex: str, **enabled):
nlp.add_pipe("eds.sentences")

for title, name in PIPES.items():

if name == "drugs":
if enabled["drugs"]:
if enabled["fuzzy_drugs"]:
Expand All @@ -128,7 +127,7 @@ def load_model(custom_regex: str, **enabled):
pipes.append(f'nlp.add_pipe("eds.{name}")')

if pipes:
pipes.insert(0, "# Entity extraction pipelines")
pipes.insert(0, "# Entity extraction pipes")

if custom_regex:
nlp.add_pipe(
Expand Down Expand Up @@ -169,7 +168,7 @@ def load_model(custom_regex: str, **enabled):
"EDS-NLP is a contributive effort maintained by AP-HP's Data Science team. "
"Have a look at the "
"[documentation](https://aphp.github.io/edsnlp/) for "
"more information on the available pipelines."
"more information on the available components."
)

st.sidebar.header("Pipeline")
Expand Down Expand Up @@ -201,8 +200,8 @@ def load_model(custom_regex: str, **enabled):
continue
st_pipes[name] = st.sidebar.checkbox(title, value=True)
st.sidebar.markdown(
"These are just a few of the pipelines provided out-of-the-box by EDS-NLP. "
"See the [documentation](https://aphp.github.io/edsnlp/latest/pipelines/) "
"These are just a few of the components provided out-of-the-box by EDS-NLP. "
"See the [documentation](https://aphp.github.io/edsnlp/latest/pipes/) "
"for detail."
)

Expand Down
4 changes: 2 additions & 2 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ doc.ents[0]._.negation # (6)

1. 'eds' is the name of the language, which defines the [tokenizer](/tokenizers).
2. This example terminology provides a very simple, and by no means exhaustive, list of synonyms for COVID19.
3. In spaCy, pipelines are added via the [`nlp.add_pipe` method](https://spacy.io/api/language#add_pipe). EDS-NLP pipelines are automatically discovered by spaCy.
3. Similarly to spaCy, pipes are added via the [`nlp.add_pipe` method](https://spacy.io/api/language#add_pipe).
4. See the [matching tutorial](tutorials/matching-a-terminology.md) for mode details.
5. spaCy stores extracted entities in the [`Doc.ents` attribute](https://spacy.io/api/doc#ents).
6. The `eds.negation` component has adds a `negation` custom attribute.
Expand All @@ -71,7 +71,7 @@ To learn more about EDS-NLP, we have prepared a series of tutorials that should

## Available pipeline components

--8<-- "docs/pipelines/index.md:components"
--8<-- "docs/pipes/index.md:components"

## Disclaimer

Expand Down
18 changes: 9 additions & 9 deletions docs/pipes/architecture.md
Original file line number Diff line number Diff line change
@@ -1,40 +1,40 @@
# Basic Architecture

Most pipelines provided by EDS-NLP aim to qualify pre-extracted entities. To wit, the basic usage of the library:
Most pipes provided by EDS-NLP aim to qualify pre-extracted entities. To wit, the basic usage of the library:

1. Implement a normaliser (see `eds.normalizer`)
2. Add an entity recognition component (eg the simple but powerful `eds.matcher`)
3. Add zero or more entity qualification components, such as `eds.negation`, `eds.family` or `eds.hypothesis`. These qualifiers typically help detect false-positives.

## Scope

Since the basic usage of EDS-NLP components is to qualify entities, most pipelines can function in two modes:
Since the basic usage of EDS-NLP components is to qualify entities, most pipes can function in two modes:

1. Annotation of the extracted entities (this is the default). To increase throughput, only pre-extracted entities (found in `doc.ents`) are processed.
2. Full-text, token-wise annotation. This mode is activated by setting the `on_ents_only` parameter to `False`.

The possibility to do full-text annotation implies that one could use the pipelines the other way around, eg detecting all negations once and for all in an ETL phase, and reusing the results consequently. However, this is not the intended use of the library, which aims to help researchers downstream as a standalone application.
The possibility to do full-text annotation implies that one could use the pipes the other way around, eg detecting all negations once and for all in an ETL phase, and reusing the results consequently. However, this is not the intended use of the library, which aims to help researchers downstream as a standalone application.

## Result persistence

Depending on their purpose (entity extraction, qualification, etc), EDS-NLP pipelines write their results to `Doc.ents`, `Doc.spans` or in a custom attribute.
Depending on their purpose (entity extraction, qualification, etc), EDS-NLP pipes write their results to `Doc.ents`, `Doc.spans` or in a custom attribute.

### Extraction pipelines
### Extraction pipes

Extraction pipelines (matchers, the date detector or NER pipelines, for instance) keep their results to the `Doc.ents` attribute directly.
Extraction pipes (matchers, the date detector or NER pipes, for instance) keep their results to the `Doc.ents` attribute directly.

Note that spaCy prohibits overlapping entities within the `Doc.ents` attribute. To circumvent this limitation, we [filter spans][edsnlp.utils.filter.filter_spans], and keep all discarded entities within the `discarded` key of the `Doc.spans` attribute.

Some pipelines write their output to the `Doc.spans` dictionary. We enforce the following doctrine:
Some pipes write their output to the `Doc.spans` dictionary. We enforce the following doctrine:

- Should the pipe extract entities that are directly informative (typically the output of the `eds.matcher` component), said entities are stashed in the `Doc.ents` attribute.
- On the other hand, should the entity be useful to another pipe, but less so in itself (eg the output of the `eds.sections` or `eds.dates` component), it will be stashed in a specific key within the `Doc.spans` attribute.

### Entity tagging

Moreover, most pipelines declare [spaCy extensions](https://spacy.io/usage/processing-pipelines#custom-components-attributes), on the `Doc`, `Span` and/or `Token` objects.
Moreover, most pipes declare [spaCy extensions](https://spacy.io/usage/processing-pipelines#custom-components-attributes), on the `Doc`, `Span` and/or `Token` objects.

These extensions are especially useful for qualifier pipelines, but can also be used by other pipelines to persist relevant information. For instance, the `eds.dates` pipeline:
These extensions are especially useful for qualifier pipes, but can also be used by other pipes to persist relevant information. For instance, the `eds.dates` pipeline component:

1. Populates `#!python Doc.spans["dates"]`
2. For each detected item, keeps the normalised date in `#!python Span._.date`
8 changes: 4 additions & 4 deletions docs/pipes/core/contextual-matcher.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@

# Contextual Matcher {: #edsnlp.pipelines.core.contextual_matcher.factory.create_component }
# Contextual Matcher {: #edsnlp.pipes.core.contextual_matcher.factory.create_component }

During feature extraction, it may be necessary to search for additional patterns in their neighborhood, namely:

Expand All @@ -13,7 +13,7 @@ The ContextualMatcher allows to perform this extraction in a clear and concise w

## The configuration file

The whole ContextualMatcher pipeline is basically defined as a list of **pattern dictionaries**.
The whole ContextualMatcher pipeline component is basically defined as a list of **pattern dictionaries**.
Let us see step by step how to build such a list using the example stated just above.

### a. Finding mentions of cancer
Expand Down Expand Up @@ -326,10 +326,10 @@ dict(
)
```

::: edsnlp.pipelines.core.contextual_matcher.factory.create_component
::: edsnlp.pipes.core.contextual_matcher.factory.create_component
options:
only_parameters: true

## Authors and citation

The `eds.matcher` pipeline was developed by AP-HP's Data Science team.
The `eds.matcher` pipeline component was developed by AP-HP's Data Science team.
4 changes: 2 additions & 2 deletions docs/pipes/core/endlines.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Endlines {: #edsnlp.pipelines.core.endlines.factory.create_component }
# Endlines {: #edsnlp.pipes.core.endlines.factory.create_component }

::: edsnlp.pipelines.core.endlines.factory.create_component
::: edsnlp.pipes.core.endlines.factory.create_component
options:
heading_level: 2
show_bases: false
Expand Down
4 changes: 2 additions & 2 deletions docs/pipes/core/matcher.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Matcher {: #edsnlp.pipelines.core.matcher.factory.create_component }
# Matcher {: #edsnlp.pipes.core.matcher.factory.create_component }

::: edsnlp.pipelines.core.matcher.factory.create_component
::: edsnlp.pipes.core.matcher.factory.create_component
options:
heading_level: 2
show_bases: false
Expand Down
26 changes: 13 additions & 13 deletions docs/pipes/core/normalizer.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Normalisation {: #edsnlp.pipelines.core.normalizer.factory.create_component }
# Normalisation {: #edsnlp.pipes.core.normalizer.factory.create_component }

The normalisation scheme used by EDS-NLP adheres to the non-destructive doctrine. In other words,

Expand All @@ -10,8 +10,8 @@ is always true.

To achieve this, the input text is never modified. Instead, our normalisation strategy focuses on two axes:

1. Only the `NORM` and `tag_` attributes are modified by the `normalizer` pipeline ;
2. Pipelines (eg the [`pollution`](#pollution) pipeline) can mark tokens as _excluded_ by setting the extension `Token.tag_` to `EXCLUDED` or as _space_ by setting the extension `Token.tag_` to `SPACE`.
1. Only the `NORM` and `tag_` attributes are modified by the `normalizer` pipeline component ;
2. Pipes (e.g., [`pollution`](#pollution)) can mark tokens as _excluded_ by setting the extension `Token.tag_` to `EXCLUDED` or as _space_ by setting the extension `Token.tag_` to `SPACE`.
It enables downstream matchers to skip excluded tokens.

The normaliser can act on the input text in five dimensions :
Expand All @@ -26,12 +26,12 @@ The normaliser can act on the input text in five dimensions :

We recommend you also **add an end-of-line classifier to remove excess new line characters** (introduced by the PDF layout).

We provide a `endlines` pipeline, which requires training an unsupervised model.
We provide a `endlines` pipeline component, which requires training an unsupervised model.
Refer to [the dedicated page for more information](./endlines.md).

## Usage

The normalisation is handled by the single `eds.normalizer` pipeline. The following code snippet is complete, and should run as is.
The normalisation is handled by the single `eds.normalizer` pipeline component. The following code snippet is complete, and should run as is.

```python
import edsnlp
Expand All @@ -57,19 +57,19 @@ Moreover, every span exposes a `normalized_variant` extension getter, which comp

## Configuration

The pipeline can be configured using the following parameters :
The pipeline component can be configured using the following parameters :

::: edsnlp.pipelines.core.normalizer.factory.create_component
::: edsnlp.pipes.core.normalizer.factory.create_component
options:
only_parameters: true

## Pipelines
## Pipes

Let's review each subcomponent.

### Lowercase

The `eds.lowercase` pipeline transforms every token to lowercase. It is not configurable.
The `eds.lowercase` pipeline component transforms every token to lowercase. It is not configurable.

Consider the following example :

Expand Down Expand Up @@ -98,7 +98,7 @@ get_text(doc, attr="NORM", ignore_excluded=False)

### Accents

The `eds.accents` pipeline removes accents. To avoid edge cases,
The `eds.accents` pipeline component removes accents. To avoid edge cases,
the component uses a specified list of accentuated characters and their unaccented representation,
making it more predictable than using a library such as `unidecode`.

Expand Down Expand Up @@ -189,7 +189,7 @@ doc = nlp("Phrase avec des espaces \n et un retour à la ligne")

### Pollution

The pollution pipeline uses a set of regular expressions to detect pollutions (irrelevant non-medical text that hinders text processing). Corresponding tokens are marked as excluded (by setting `Token._.excluded` to `True`), enabling the use of the phrase matcher.
The pollution pipeline component uses a set of regular expressions to detect pollutions (irrelevant non-medical text that hinders text processing). Corresponding tokens are marked as excluded (by setting `Token._.excluded` to `True`), enabling the use of the phrase matcher.

Consider the following example :

Expand Down Expand Up @@ -248,7 +248,7 @@ nlp.add_pipe(
|---------------|---------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------|---------------------|
| `information` | Footnote present in a lot of notes, providing information to the patient about the use of its data | "L'AP-HP collecte vos données administratives à des fins ..." | `True` |
| `bars` | Barcodes wrongly parsed as text | "...NBNbWbWbNbWbNBNbNbWbW..." | `True` |
| `biology` | Parsed biology results table. It often contains disease names that often leads to *false positives* with NER pipelines. | "...¦UI/L ¦20 ¦ ¦ ¦20-70 Polyarthrite rhumatoïde Facteur rhumatoide ¦UI/mL ¦ ¦<10 ¦ ¦ ¦ ¦0-14..." | `False` |
| `biology` | Parsed biology results table. It often contains disease names that often leads to *false positives* with NER pipes. | "...¦UI/L ¦20 ¦ ¦ ¦20-70 Polyarthrite rhumatoïde Facteur rhumatoide ¦UI/mL ¦ ¦<10 ¦ ¦ ¦ ¦0-14..." | `False` |
| `doctors` | List of doctor names and specialities, often found in left-side note margins. Also source of potential *false positives*. | "... Dr ABC - Diabète/Endocrino ..." | `True` |
| `web` | Webpages URL and email adresses. Also source of potential *false positives*. | "... www.vascularites.fr ..." | `True` |
| `coding` | Subsection containing ICD-10 codes along with their description. Also source of potential *false positives*. | "... (2) E112 + Oeil (2) E113 + Neuro (2) E114 Démence (2) F03 MA (2) F001+G301 DCL G22+G301 Vasc (2) ..." | `False` |
Expand All @@ -275,4 +275,4 @@ nlp.add_pipe(

## Authors and citation

The `eds.normalizer` pipeline was developed by AP-HP's Data Science team.
The `eds.normalizer` pipeline component was developed by AP-HP's Data Science team.
4 changes: 2 additions & 2 deletions docs/pipes/core/sentences.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Sentences {: #edsnlp.pipelines.core.sentences.factory.create_component }
# Sentences {: #edsnlp.pipes.core.sentences.factory.create_component }

::: edsnlp.pipelines.core.sentences.factory.create_component
::: edsnlp.pipes.core.sentences.factory.create_component
options:
heading_level: 2
show_bases: false
Expand Down
4 changes: 2 additions & 2 deletions docs/pipes/core/terminology.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Terminology {: #edsnlp.pipelines.core.terminology.factory.create_component }
# Terminology {: #edsnlp.pipes.core.terminology.factory.create_component }

::: edsnlp.pipelines.core.terminology.factory.create_component
::: edsnlp.pipes.core.terminology.factory.create_component
options:
heading_level: 2
show_bases: false
Expand Down
Loading

0 comments on commit e4b1eb0

Please sign in to comment.