Skip to content

Commit

Permalink
docs: update for 0.10.0 release
Browse files Browse the repository at this point in the history
  • Loading branch information
percevalw committed Dec 1, 2023
1 parent e3fc882 commit 3c2e910
Show file tree
Hide file tree
Showing 8 changed files with 52 additions and 16 deletions.
20 changes: 15 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,28 +9,38 @@
EDS-NLP
=======

EDS-NLP is a collaborative NLP framework that aims at extracting information from French clinical notes.
EDS-NLP is a collaborative NLP framework that aims primarily at extracting information from French clinical notes.
At its core, it is a collection of components or pipes, either rule-based functions or
deep learning modules. These components are organized into a novel efficient and modular pipeline system, built for hybrid and multitask models. We use [spaCy](https://spacy.io) to represent documents and their annotations, and [Pytorch](https://pytorch.org/) as a deep-learning backend for trainable components.

EDS-NLP is versatile and can be used on any textual document. The rule-based components are fully compatible with spaCy's pipelines, and vice versa. This library is a product of collaborative effort, and we encourage further contributions to enhance its capabilities.
EDS-NLP is versatile and can be used on any textual document. The rule-based components are fully compatible with spaCy's components, and vice versa. This library is a product of collaborative effort, and we encourage further contributions to enhance its capabilities.

Check out our interactive [demo](https://aphp.github.io/edsnlp/demo/) !

## Features

- [Rule-based components](https://aphp.github.io/edsnlp/latest/pipes/) for French clinical notes
- [Trainable components](https://aphp.github.io/edsnlp/latest/pipes/trainable): NER, Span classification
- Support for trained multitask models with [weights sharing](https://aphp.github.io/edsnlp/latest/concepts/torch-component/#sharing-subcomponents)
- [Fast inference](https://aphp.github.io/edsnlp/latest/concepts/inference/), with multi-GPU support out of the box
- Easy to use, with a spaCy-like API
- Compatible with ruled-based spaCy pipelines
- Support for various io formats like [BRAT](https://aphp.github.io/edsnlp/latest/data/standoff/), [JSON](https://aphp.github.io/edsnlp/latest/data/json/), [Parquet](https://aphp.github.io/edsnlp/latest/data/parquet/), [Pandas](https://aphp.github.io/edsnlp/latest/data/pandas/) or [Spark](https://aphp.github.io/edsnlp/latest/data/spark/)

## Quick start

### Installation

You can install EDS-NLP via `pip`. We recommend pinning the library version in your projects, or use a strict package manager like [Poetry](https://python-poetry.org/).

```shell
pip install edsnlp==0.10.0beta1
pip install edsnlp
```

or if you want to use the trainable components (using pytorch)

```shell
pip install "edsnlp[ml]==0.10.0beta1"
pip install "edsnlp[ml]"
```

### A first pipeline
Expand Down Expand Up @@ -63,7 +73,7 @@ doc.ents[0]._.negation
# Out: True
```

## Documentation
## Documentation & Tutorials

Go to the [documentation](https://aphp.github.io/edsnlp) for more information.

Expand Down
7 changes: 4 additions & 3 deletions changelog.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,19 @@
# Changelog

## v0.10.0beta2
## v0.10.0

### Added

- New add unified `edsnlp.data` api (json, brat, spark, pandas) and LazyCollection object
to efficiently read / write data from / to different formats & sources.
- New unified processing API to select the execution execution backends via `docs.configure(...)`
- New unified processing API to select the execution execution backends via `data.set_processing(...)`
- The training scripts can now use data from multiple concatenated adapters
- Support quantized transformers (compatible with multiprocessing as well !)

### Changed

- Pipes (in edsnlp/pipelines) are now lazily loaded, which should improve the loading time of the library.
- `edsnlp.pipelines` has been renamed to `edsnlp.pipes`, but the old name is still available for backward compatibility
- Pipes (in `edsnlp/pipes`) are now lazily loaded, which should improve the loading time of the library.
- `to_disk` methods can now return a config to override the initial config of the pipeline (e.g., to load a transformer directly from the path storing its fine-tuned weights)
- The `eds.tokenizer` tokenizer has been added to entry points, making it accessible from the outside
- Deprecate old connectors (e.g. BratDataConnector) in favor of the new `edsnlp.data` API
Expand Down
29 changes: 28 additions & 1 deletion docs/concepts/inference.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ nlp.to("cuda") # same semantics as pytorch
doc = nlp(text)
```

To leverage multiple GPUs when processing multiple documents, refer to the [multiprocessing backend][edsnlp.processing.multiprocessing.execute_multiprocessing] description below.
To leverage multiple GPUs when processing multiple documents, refer to the [multiprocessing backend][edsnlp.processing.multiprocessing.execute_multiprocessing_backend] description below.

## Inference on multiple documents {: #edsnlp.core.lazy_collection.LazyCollection }

Expand All @@ -49,6 +49,33 @@ A lazy collection contains :

All methods (`.map`, `.map_pipeline`, `.set_processing`) of the lazy collection are chainable, meaning that they return a new object (no in-place modification).

For instance, the following code will load a model, read a folder of JSON files, apply the model to each document and write the result in a Parquet folder, using 4 CPUs and 2 GPUs.

```{ .python .no-check }
import edsnlp
# Load or create a model
nlp = edsnlp.load("path/to/model")
# Read some data (this is lazy, no data will be read until the end of of this snippet)
data = edsnlp.data.read_json("path/to/json_folder", converter="...")
# Apply each pipe of the model to our documents
data = data.map_pipeline(nlp)
# or equivalently : data = nlp.pipe(data)
# Configure the execution
data = data.set_processing(
# 4 CPUs to parallelize rule-based pipes, IO and preprocessing
num_cpu_workers=4,
# 2 GPUs to accelerate deep-learning pipes
num_gpu_workers=2,
)
# Write the result, this will execute the lazy collection
data.write_parquet("path/to/output_folder", converter="...", write_in_worker=True)
```

### Applying operations to a lazy collection

To apply an operation to a lazy collection, you can use the `.map` method. It takes a callable as input and an optional dictionary of keyword arguments. The function will be applied to each element of the collection.
Expand Down
2 changes: 1 addition & 1 deletion docs/concepts/pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ arbitrarily chain static components or trained deep learning components. Static

<div style="text-align: center" markdown="1">

![Example of a hybrid pipeline](/assets/images/hybrid-pipeline-example.svg){: style="height:150px" }
![Example of a hybrid pipeline](/assets/images/hybrid-pipeline-example.png){: style="height:150px" }

</div>

Expand Down
4 changes: 2 additions & 2 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,13 +15,13 @@ Check out our interactive [demo](https://aphp.github.io/edsnlp/demo/) !
You can install EDS-NLP via `pip`. We recommend pinning the library version in your projects, or use a strict package manager like [Poetry](https://python-poetry.org/).

```{: data-md-color-scheme="slate" }
pip install edsnlp==0.10.0beta1
pip install edsnlp
```

or if you want to use the trainable components (using pytorch)

```{: data-md-color-scheme="slate" }
pip install "edsnlp[ml]==0.10.0beta1"
pip install "edsnlp[ml]"
```

### A first pipeline
Expand Down
2 changes: 1 addition & 1 deletion edsnlp/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
import edsnlp.data # noqa: F401
import edsnlp.pipes

__version__ = "0.10.0beta2"
__version__ = "0.10.0"

BASE_DIR = Path(__file__).parent

Expand Down
2 changes: 0 additions & 2 deletions edsnlp/data/converters.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,6 @@
Converters are used to convert documents between python dictionaries and Doc objects.
There are two types of converters: readers and writers. Readers convert dictionaries to
Doc objects, and writers convert Doc objects to dictionaries.
Why are these classes instead of functions?
"""
import contextlib
import inspect
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[project]
name = "edsnlp"
description = "A set of spaCy components to extract information from clinical notes written in French"
description = "Modular, fast NLP framework, compatible with Pytorch and spaCy, offering tailored support for French clinical notes."
authors = [
{ name = "Data Science - DSN APHP", email = "[email protected]" }
]
Expand Down

0 comments on commit 3c2e910

Please sign in to comment.