Refactor to use Pytorch for training models #202

percevalw · 2023-04-04T12:11:10Z

Description

This PR refactors EDS-NLP to allow training models and performing inference using PyTorch as the deep-learning backend. Rather than a mere wrapper of Pytorch using spaCy, this is a new framework to build hybrid multi-task models.

To achieve this, instead of patching spaCy's pipeline, a new pipeline was implemented in a similar fashion to aphp/edspdf#12. The new pipeline tries to preserve the existing API, especially for non-machine learning uses such as rule-based components. This means that users can continue to use the library in the same way as before (spacy.blank('xx'), nlp.add_pipe(...)), while also having the option to train models using PyTorch. We still use spaCy data structures such as Doc and Span to represent the texts and their annotations.

It should be noted that this is a work-in-progress and will require further testing before it can be released. We should maybe release it under alpha version number ? Once testing is complete, the new version will be released as a stable version.

Core changes / new features:

Use the confit package to instantiate components (soon to be published)
Language.factory -> edsnlp.registry.factory.register (confit registry)
Lazy loading components from their entry point (had to patch spacy.Language.__init__) to avoid having to wrap every import torch statement for pure rule-based use cases. Hence, torch is not a required dependency
Training script with Pytorch only (tests/training/)
Re-implemented the trainable NER component with the new system under eds.ner
New efficient implementation for eds.transformer (to be used in place of spacy-transformer)
New eds.text_cnn embedding contextualizer

Checklist

Publish confit
Add Span sourcing options to eds.ner (from_ents, from_span_groups)
Add a training recipe ?
Re-implement the span qualifier from SpanQualifier trainable component #193
Update the documentation for NER
Add documentation for embedding components (eds.transformer, eds.text_cnn)
Add documentation for the new pipeline system
Add unit tests for the new pipeline
Update changelog

codecov · 2023-08-08T23:13:01Z

Codecov Report

Attention: 36 lines in your changes are missing coverage. Please review.

Comparison is base (1b62d35) 94.76% compared to head (df2bf0a) 96.58%.

❗ Current head df2bf0a differs from pull request most recent head 3ec32ab. Consider uploading reports for the commit 3ec32ab to get more accurate results

Files	Patch %	Lines
edsnlp/optimization.py	91.89%	6 Missing ⚠️
edsnlp/core/pipeline.py	98.45%	5 Missing ⚠️
edsnlp/data/base.py	87.17%	5 Missing ⚠️
edsnlp/data/brat.py	0.00%	5 Missing ⚠️
edsnlp/core/torch_component.py	97.84%	4 Missing ⚠️
edsnlp/data/standoff.py	98.25%	3 Missing ⚠️
edsnlp/core/registry.py	98.34%	2 Missing ⚠️
edsnlp/data/json.py	97.77%	2 Missing ⚠️
edsnlp/pipes/ner/adicap/models.py	85.71%	2 Missing ⚠️
edsnlp/data/converters.py	99.47%	1 Missing ⚠️
... and 1 more

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #202      +/-   ##
==========================================
+ Coverage   94.76%   96.58%   +1.81%     
==========================================
  Files         233      254      +21     
  Lines        6099     8356    +2257     
==========================================
+ Hits         5780     8071    +2291     
+ Misses        319      285      -34

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…peline

…cuda & quantization + faster transfer pipeline via tmp

…) + smoother multiprocessing stopping

… / writing

…_process`

…uced locking

…lude fields

… / writing

percevalw force-pushed the core-refacto branch from f576f57 to 0cffe36 Compare April 6, 2023 17:15

percevalw force-pushed the core-refacto branch 2 times, most recently from a5a3c48 to 1516017 Compare July 28, 2023 20:44

percevalw force-pushed the core-refacto branch 4 times, most recently from 08699ae to b66eea1 Compare August 8, 2023 22:55

percevalw force-pushed the core-refacto branch 2 times, most recently from 62c7fbc to d06dde6 Compare August 9, 2023 12:56

percevalw force-pushed the core-refacto branch 4 times, most recently from 440779e to a17230e Compare August 25, 2023 22:55

percevalw force-pushed the master branch from 9344042 to f8530c6 Compare September 15, 2023 11:47

percevalw force-pushed the core-refacto branch from a17230e to 4e5f2ed Compare September 29, 2023 08:56

percevalw marked this pull request as ready for review October 11, 2023 07:17

percevalw force-pushed the core-refacto branch 8 times, most recently from 8d5a3d7 to d17e677 Compare October 16, 2023 17:26

percevalw mentioned this pull request Oct 18, 2023

EDS-NLP refacto aphp/eds-pseudo#8

Merged

6 tasks

percevalw force-pushed the core-refacto branch 4 times, most recently from 7aa37ef to a6b7e0b Compare October 26, 2023 15:05

percevalw added 19 commits December 4, 2023 10:17

fix: verify files existence

74574af

fix: allow for empty initial project when packaging

d09a873

feat: allow to define multiple data sources in training script

a476d25

ci: enable pip cache

5364680

fix: allow empty docs in eds.transformer pipe

33abfc1

fix: serialize pipe meta dicts as well when transfering/pickling a pi…

c152e11

…peline

fix: replace multiprocess with patched multiprocessing to support mp …

c4a064f

…cuda & quantization + faster transfer pipeline via tmp

fix: support packaging in python 3.11

21dbb47

feat: support bitsandbytes quantization in transformer init

aa27cd0

chore: update changelog and bump version to 0.10.0beta2

348de67

feat: prepare finalizing writers (ie with batch accumulation to files…

5e61022

…) + smoother multiprocessing stopping

feat: add edsnlp.data support for parquet files with parallel reading…

60e6f11

… / writing

test: deduplicate docs tests (reference pages inserted in other pages)

77515ad

fix: detect torch pipes via the forward attribute instead of `batch…

8d9adb1

…_process`

fix: switch SimpleQueues for Queues to avoid pipe buffer overflow ind…

076ce99

…uced locking

test: improve coverage

45041e4

feat: allow overriding the config when loading a pipeline from the disk

f0c8e00

refacto: rename pipelines to pipes

839d93a

refacto: fix paths after pipelines to pipes renaming

8b68411

percevalw force-pushed the core-refacto branch from 4a37752 to df2bf0a Compare December 4, 2023 09:17

percevalw added 6 commits December 4, 2023 10:48

docs: update for 0.10.0 release

ae17b84

fix: more options for nlp.package() and better support for poetry inc…

50244af

…lude fields

chore: apply format rules

7203802

fix: update numpy build dependency marker to fix poetry installs

92d7b78

fix: use real hf tokenizer vocab size when adding new trainable token

c9c545c

feat: add edsnlp.data support for parquet files with parallel reading…

3ec32ab

… / writing

percevalw force-pushed the core-refacto branch from df2bf0a to 3ec32ab Compare December 4, 2023 09:49

percevalw merged commit b9b496e into master Dec 4, 2023
8 checks passed

percevalw mentioned this pull request Dec 4, 2023

Refactor the parallelization utils #212

Closed

percevalw deleted the core-refacto branch November 14, 2024 20:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor to use Pytorch for training models #202

Refactor to use Pytorch for training models #202

percevalw commented Apr 4, 2023 •

edited

Loading

codecov bot commented Aug 8, 2023 •

edited

Loading

Refactor to use Pytorch for training models #202

Refactor to use Pytorch for training models #202

Conversation

percevalw commented Apr 4, 2023 • edited Loading

Description

Checklist

codecov bot commented Aug 8, 2023 • edited Loading

Codecov Report

percevalw commented Apr 4, 2023 •

edited

Loading

codecov bot commented Aug 8, 2023 •

edited

Loading