1 line to OCR for images, PDFS and DOCX, Text Generation with GPT2 and new T5 models, Sequence Classification with XlmRoBerta, RoBerta, Xlnet, Longformer and Albert, Transformer based medical NER with MedicalBertForTokenClassifier, 80 new models, 20+ new languages including various African and Scandinavian and much more in John Snow Labs NLU 3.4.0 !
We are incredibly excited to announce John Snow Labs NLU 3.4.0 has been released!
This release features 11 new annotator classes
and 80
new models, including 3 OCR Transformers
which enable you to extract text
from various file types, support for GPT2
and new pretrained T5
models for Text Generation and dozens more of new transformer based models
for Token and Sequence Classification.
This includes 8 new Sequence classifier models
which can be pretrained in Huggingface and imported into Spark NLP and NLU.
Finally, the NLU tutorial page of the 140+ notebooks has been updated
New NLU OCR Features
3 new OCR based spells are supported, which enable extracting text
from files of type
JPEG
, PNG
, BMP
, WBMP
, GIF
, JPG
, TIFF
, DOCX
, PDF
in just 1 line of code.
You need a Spark OCR license for using these, which is available for free here and refer to the new
OCR tutorial notebook
Find more details on the NLU OCR documentation page
New NLU Healthcare Features
The healthcare side features a new MedicalBertForTokenClassifier
annotator which is a Bert based model for token classification problems like Named Entity Recognition
,
Parts of Speech
and much more. Overall there are 28
new models which include German De-Identification models, English NER models for extracting Drug Development Trials
,
Clinical Abbreviations and Acronyms
, NER models for chemical compounds/drugs and genes/proteins, updated MedicalBertForTokenClassifier
NER models for the medical domains Adverse drug Events
,
Anatomy
, Chemicals
, Genes
,Proteins
, Cellular/Molecular Biology
, Drugs
, Bacteria
, De-Identification
and general Medical and Clinical Named Entities.
For Entity Relation Extraction between entity pairs new models for interaction between Drugs and Proteins
.
For Entity Resolution new models for resolving Clinical Abbreviations and Acronyms
to their full length names and also a model for resolving Drug Substance Entities
to the categories
Clinical Drug
, Pharmacologic Substance
, Antibiotic
, Hazardous
or Poisonous Substance
and new resolvers for LOINC
and SNOMED
terminologies.
New NLU Open source Features
On the open source side we have new support for Open Ai's GPT2
for various text sequence to sequence problems and
additionally the following new Transformer models are supported :
RoBertaForSequenceClassification
, XlmRoBertaForSequenceClassification
, LongformerForSequenceClassification
,
AlbertForSequenceClassification
, XlnetForSequenceClassification
, Word2Vec
with various pre-trained weights for various problems!
New GPT2 models for generating text conditioned on some input,
New T5 style transfer models for active to passive
, formal to informal
, informal to formal
, passive to active
sequence to sequence generation.
Additionally, a new T5 model for generating SQL code from natural language input is provided.
On top of this dozens new Transformer based Sequence Classifiers and Token Classifiers have been released, this is includes for Token Classifier
the following models :
Multi-Lingual general NER models for 10 African Languages (Amharic
, Hausa
, Igbo
, Kinyarwanda
, Luganda
, Nigerian
, Pidgin
, Swahilu
, Wolof
, and Yorùbá
),
10 high resourced languages (10 high resourced languages (Arabic
, German
, English
, Spanish
, French
, Italian
, Latvian
, Dutch
, Portuguese
and Chinese
),
6 Scandinavian languages (Danish
, Norwegian-Bokmål
, Norwegian-Nynorsk
, Swedish
, Icelandic
, Faroese
) ,
Uni-Lingual NER models for general entites in the language Chinese
, Hindi
, Islandic
, Indonesian
and finally English NER models for extracting entities related to Stocks Ticker Symbols
, Restaurants
, Time
.
For Sequence Classification
new models for classifying Toxicity in Russian text
and English models for
Movie Reviews
, News Categorization
, Sentimental Tone
and General Sentiment
New NLU OCR Models
The following Transformers have been integrated from Spark OCR
NLU Spell | Transformer Class |
---|---|
nlu.load(img2text ) |
ImageToText |
nlu.load(pdf2text ) |
PdfToText |
nlu.load(doc2text ) |
DocToText |
New Open Source Models
Integration for the 49 new models from the colossal Spark NLP 3.4.0 release
New Healthcare Models
Integration for the 28 new models from the amazing Spark NLP for healthcare 3.4.0 release
Additional NLU resources
- NLU OCR tutorial notebook
- 140+ NLU Tutorials
- NLU in Action
- Streamlit visualizations docs
- The complete list of all 4000+ models & pipelines in 200+ languages is available on Models Hub.
- Spark NLP publications
- NLU documentation
- Discussions Engage with other community members, share ideas, and show off how you use Spark NLP and NLU!
1 line Install NLU on Google Colab
!wget https://setup.johnsnowlabs.com/nlu/colab.sh -O - | bash
1 line Install NLU on Kaggle
!wget https://setup.johnsnowlabs.com/nlu/kaggle.sh -O - | bash
Install via PIP
! pip install nlu pyspark streamlit==0.80.0