Release 1 line to OCR for images, PDFS and DOCX, Text Generation with GPT2 and new T5 models, Sequence Classification with XlmRoBerta, RoBerta, Xlnet, Longformer and Albert, Transformer based medical NER with MedicalBertForTokenClassifier, 80 new models, 20+ new languages including various African and Scandinavian and much more in John Snow Labs NLU 3.4.0 ! · JohnSnowLabs/nlu

We are incredibly excited to announce John Snow Labs NLU 3.4.0 has been released!
This release features 11 new annotator classes and 80 new models, including 3 OCR Transformers which enable you to extract text
from various file types, support for GPT2 and new pretrained T5 models for Text Generation and dozens more of new transformer based models
for Token and Sequence Classification.
This includes 8 new Sequence classifier models which can be pretrained in Huggingface and imported into Spark NLP and NLU.
Finally, the NLU tutorial page of the 140+ notebooks has been updated

New NLU OCR Features

3 new OCR based spells are supported, which enable extracting text from files of type
JPEG, PNG, BMP, WBMP, GIF, JPG, TIFF, DOCX, PDF in just 1 line of code.
You need a Spark OCR license for using these, which is available for free here and refer to the new
OCR tutorial notebook

Find more details on the NLU OCR documentation page

New NLU Healthcare Features

The healthcare side features a new MedicalBertForTokenClassifier annotator which is a Bert based model for token classification problems like Named Entity Recognition,
Parts of Speech and much more. Overall there are 28 new models which include German De-Identification models, English NER models for extracting Drug Development Trials,
Clinical Abbreviations and Acronyms, NER models for chemical compounds/drugs and genes/proteins, updated MedicalBertForTokenClassifier NER models for the medical domains Adverse drug Events,
Anatomy, Chemicals, Genes,Proteins, Cellular/Molecular Biology, Drugs, Bacteria, De-Identification and general Medical and Clinical Named Entities.
For Entity Relation Extraction between entity pairs new models for interaction between Drugs and Proteins.
For Entity Resolution new models for resolving Clinical Abbreviations and Acronyms to their full length names and also a model for resolving Drug Substance Entities to the categories
Clinical Drug, Pharmacologic Substance, Antibiotic, Hazardous or Poisonous Substance and new resolvers for LOINC and SNOMED terminologies.

New NLU Open source Features

On the open source side we have new support for Open Ai's GPT2 for various text sequence to sequence problems and
additionally the following new Transformer models are supported :
RoBertaForSequenceClassification, XlmRoBertaForSequenceClassification, LongformerForSequenceClassification,
AlbertForSequenceClassification, XlnetForSequenceClassification, Word2Vec with various pre-trained weights for various problems!

New GPT2 models for generating text conditioned on some input,
New T5 style transfer models for active to passive, formal to informal, informal to formal, passive to active sequence to sequence generation.
Additionally, a new T5 model for generating SQL code from natural language input is provided.

On top of this dozens new Transformer based Sequence Classifiers and Token Classifiers have been released, this is includes for Token Classifier the following models :
Multi-Lingual general NER models for 10 African Languages (Amharic, Hausa, Igbo, Kinyarwanda, Luganda, Nigerian, Pidgin, Swahilu, Wolof, and Yorùbá),
10 high resourced languages (10 high resourced languages (Arabic, German, English, Spanish, French, Italian, Latvian, Dutch, Portuguese and Chinese),
6 Scandinavian languages (Danish, Norwegian-Bokmål, Norwegian-Nynorsk, Swedish, Icelandic, Faroese) ,
Uni-Lingual NER models for general entites in the language Chinese, Hindi, Islandic, Indonesian
and finally English NER models for extracting entities related to Stocks Ticker Symbols, Restaurants, Time.

For Sequence Classification new models for classifying Toxicity in Russian text and English models for
Movie Reviews, News Categorization, Sentimental Tone and General Sentiment

New NLU OCR Models

The following Transformers have been integrated from Spark OCR

NLU Spell	Transformer Class
nlu.load(`img2text`)	ImageToText
nlu.load(`pdf2text`)	PdfToText
nlu.load(`doc2text`)	DocToText

New Open Source Models

Integration for the 49 new models from the colossal Spark NLP 3.4.0 release

Language	NLU Reference	Spark NLP Reference	Task	Annotator Class
en	en.gpt2.distilled	gpt2_distilled	Text Generation	GPT2Transformer
en	en.gpt2	gpt2	Text Generation	GPT2Transformer
en	en.gpt2.medium	gpt2_medium	Text Generation	GPT2Transformer
en	en.gpt2.large	gpt_large	Text Generation	GPT2Transformer
en	en.t5.active_to_passive_styletransfer	t5_active_to_passive_styletransfer	Text Generation	T5Transformer
en	en.t5.formal_to_informal_styletransfer	t5_formal_to_informal_styletransfer	Text Generation	T5Transformer
en	en.t5.grammar_error_corrector	t5_grammar_error_corrector	Text Generation	T5Transformer
en	en.t5.informal_to_formal_styletransfer	t5_informal_to_formal_styletransfer	Text Generation	T5Transformer
en	en.t5.passive_to_active_styletransfer	t5_passive_to_active_styletransfer	Text Generation	T5Transformer
en	en.t5.wikiSQL	t5_small_wikiSQL	Text Generation	T5Transformer
xx	xx.ner.masakhaner	xlm_roberta_large_token_classifier_masakhaner	Named Entity Recognition	XlmRoBertaForTokenClassification
xx	xx.ner.high_resourced_lang	xlm_roberta_large_token_classifier_hrl	Named Entity Recognition	XlmRoBertaForTokenClassification
xx	xx.ner.scandinavian	bert_token_classifier_scandi_ner	Named Entity Recognition	BertForTokenClassification
en	en.embed.electra.medical	electra_medal_acronym	Embeddings	BertEmbeddings
en	en.ner.restaurant	nerdl_restaurant_100d	Named Entity Recognition	NerDLModel
en	en.embed.word2vec.gigaword_wiki	word2vec_gigaword_wiki_300	Embeddings	Word2VecModel
en	en.embed.word2vec.gigaword	word2vec_gigaword_300	Embeddings	Word2VecModel
en	en.classify.xlm_roberta.imdb	xlm_roberta_base_sequence_classifier_imdb	Text Classification	XlmRoBertaForSequenceClassification
en	en.classify.xlm_roberta.ag_news	xlm_roberta_base_sequence_classifier_ag_news	Text Classification	XlmRoBertaForSequenceClassification
en	en.classify.roberta.imdb	roberta_base_sequence_classifier_imdb	Text Classification	RoBertaForSequenceClassification
en	en.classify.roberta.ag_news	roberta_base_sequence_classifier_ag_news	Text Classification	RoBertaForSequenceClassification
en	en.classify.albert.ag_news	albert_base_sequence_classifier_ag_news	Text Classification	AlbertForSequenceClassification
en	en.classify.albert.imdb	albert_base_sequence_classifier_imdb	Text Classification	AlbertForSequenceClassification
en	en.classify.ag_news.longformer	longformer_base_sequence_classifier_ag_news	Text Classification	LongformerForSequenceClassification
en	en.classify.imdb.xlnet	xlnet_base_sequence_classifier_imdb	Text Classification	XlnetForSequenceClassification
en	en.classify.finance_sentiment	bert_sequence_classifier_finbert_tone	Sentiment Analysis	BertForSequenceClassification
en	en.classify.imdb.longformer	longformer_base_sequence_classifier_imdb	Text Classification	LongformerForSequenceClassification
en	en.classify.ag_news.longformer	longformer_base_sequence_classifier_ag_news	Text Classification	LongformerForSequenceClassification
en	en.ner.time	roberta_token_classifier_timex_semeval	Named Entity Recognition	RoBertaForTokenClassification
en	en.ner.stocks_ticker	roberta_token_classifier_ticker	Named Entity Recognition	RoBertaForTokenClassification
ru	ru.classify.toxic	bert_sequence_classifier_toxicity	Text Classification	BertForSequenceClassification
it	it.classify.sentiment	bert_sequence_classifier_sentiment	Sentiment Analysis	BertForSequenceClassification
es	es.ner	wikiner_6B_100	Named Entity Recognition	NerDLModel
is	is.ner	roberta_token_classifier_icelandic_ner	Named Entity Recognition	RoBertaForTokenClassification
id	id.pos	roberta_token_classifier_pos_tagger	Part of Speech Tagging	RoBertaForTokenClassification
tr	tr.ner	turkish_ner_840B_300	Named Entity Recognition	NerDLModel
id	id.ner	xlm_roberta_large_token_classification_ner	Named Entity Recognition	XlmRoBertaForTokenClassification
de	de.ner	xlm_roberta_large_token_classifier_conll03	Named Entity Recognition	XlmRoBertaForTokenClassification
hi	hi.ner	bert_token_classifier_hi_en_ner	Named Entity Recognition	BertForTokenClassification
nl	nl.ner	wikiner_6B_100	Named Entity Recognition	NerDLModel
zh	zh.ner	bert_token_classifier_chinese_ner	Named Entity Recognition	BertForTokenClassification
fr	fr.classify.xlm_roberta.allocine	xlm_roberta_base_sequence_classifier_allocine	Text Classification	XlmRoBertaForSequenceClassification
ur	ur.classify.fakenews	classifierdl_urduvec_fakenews	Text Classification	ClassifierDLModel
ur	ur.classify.news	classifierdl_bert_news	Text Classification	ClassifierDLModel
fi	fi.embed_sentence.bert.uncased	bert_base_finnish_uncased	Embeddings	BertSentenceEmbeddings
fi	fi.embed_sentence.bert	bert_base_finnish_uncased	Embeddings	BertSentenceEmbeddings
fi	fi.embed_sentence.bert.cased	bert_base_finnish_cased	Embeddings	BertSentenceEmbeddings
te	te.embed.distilbert	distilbert_uncased	Embeddings	DistilBertEmbeddings
sw	sw.embed.xlm_roberta	xlm_roberta_base_finetuned_swahili	Embeddings	XlmRoBertaEmbeddings

New Healthcare Models

Integration for the 28 new models from the amazing Spark NLP for healthcare 3.4.0 release

Language	NLU Reference	Spark NLP Reference	Task	Annotator Class
en	en.med_ner.chemprot.bert	bert_token_classifier_ner_chemprot	Named Entity Recognition	MedicalBertForTokenClassifier
en	en.med_ner.chemprot.bert	bert_token_classifier_ner_chemprot	Named Entity Recognition	MedicalBertForTokenClassifier
en	en.classify.token_bert.ner_bacteria	bert_token_classifier_ner_bacteria	Named Entity Recognition	MedicalBertForTokenClassifier
en	en.classify.token_bert.ner_bacteria	bert_token_classifier_ner_bacteria	Named Entity Recognition	MedicalBertForTokenClassifier
en	en.classify.token_bert.ner_anatomy	bert_token_classifier_ner_anatomy	Named Entity Recognition	MedicalBertForTokenClassifier
en	en.classify.token_bert.ner_anatomy	bert_token_classifier_ner_anatomy	Named Entity Recognition	MedicalBertForTokenClassifier
en	en.classify.token_bert.ner_drugs	bert_token_classifier_ner_drugs	Named Entity Recognition	MedicalBertForTokenClassifier
en	en.classify.token_bert.ner_drugs	bert_token_classifier_ner_drugs	Named Entity Recognition	MedicalBertForTokenClassifier
en	en.classify.token_bert.ner_jsl_slim	bert_token_classifier_ner_jsl_slim	Named Entity Recognition	MedicalBertForTokenClassifier
en	en.classify.token_bert.ner_jsl_slim	bert_token_classifier_ner_jsl_slim	Named Entity Recognition	MedicalBertForTokenClassifier
en	en.classify.token_bert.ner_ade	bert_token_classifier_ner_ade	Named Entity Recognition	MedicalBertForTokenClassifier
en	en.classify.token_bert.ner_ade	bert_token_classifier_ner_ade	Named Entity Recognition	MedicalBertForTokenClassifier
en	en.classify.token_bert.ner_deid	bert_token_classifier_ner_deid	Named Entity Recognition	MedicalBertForTokenClassifier
en	en.classify.token_bert.ner_deid	bert_token_classifier_ner_deid	Named Entity Recognition	MedicalBertForTokenClassifier
en	en.classify.token_bert.ner_clinical	bert_token_classifier_ner_clinical	Named Entity Recognition	MedicalBertForTokenClassifier
en	en.classify.token_bert.ner_clinical	bert_token_classifier_ner_clinical	Named Entity Recognition	MedicalBertForTokenClassifier
en	en.classify.token_bert.ner_jsl	bert_token_classifier_ner_jsl	Named Entity Recognition	MedicalBertForTokenClassifier
en	en.classify.token_bert.ner_jsl	bert_token_classifier_ner_jsl	Named Entity Recognition	MedicalBertForTokenClassifier
en	en.classify.token_bert.ner_jsl	bert_token_classifier_ner_jsl	Named Entity Recognition	MedicalBertForTokenClassifier
en	en.classify.token_bert.ner_chemical	bert_token_classifier_ner_chemicals	Named Entity Recognition	MedicalBertForTokenClassifier
en	en.classify.token_bert.ner_chemical	bert_token_classifier_ner_chemicals	Named Entity Recognition	MedicalBertForTokenClassifier
en	en.classify.token_bert.bionlp	bert_token_classifier_ner_bionlp	Named Entity Recognition	MedicalBertForTokenClassifier
en	en.classify.token_bert.bionlp	bert_token_classifier_ner_bionlp	Named Entity Recognition	MedicalBertForTokenClassifier
en	en.classify.token_bert.cellular	bert_token_classifier_ner_cellular	Named Entity Recognition	MedicalBertForTokenClassifier
en	en.classify.token_bert.cellular	bert_token_classifier_ner_cellular	Named Entity Recognition	MedicalBertForTokenClassifier
en	en.med_ner.abbreviation_clinical	ner_abbreviation_clinical	Named Entity Recognition	MedicalNerModel
en	en.med_ner.drugprot_clinical	ner_drugprot_clinical	Named Entity Recognition	MedicalNerModel
en	en.ner.drug_development_trials	bert_token_classifier_drug_development_trials	Named Entity Recognition	BertForTokenClassification
en	en.med_ner.chemprot	ner_chemprot_biobert	Named Entity Recognition	MedicalNerModel
en	en.relation.drugprot	redl_drugprot_biobert	Relation Extraction	RelationExtractionDLModel
en	en.relation.drugprot.clinical	re_drugprot_clinical	Relation Extraction	RelationExtractionModel
en	en.resolve.clinical_abbreviation_acronym	sbiobertresolve_clinical_abbreviation_acronym	Entity Resolution	SentenceEntityResolverModel
en	en.resolve.clinical_abbreviation_acronym	sbiobertresolve_clinical_abbreviation_acronym	Entity Resolution	SentenceEntityResolverModel
en	en.resolve.umls_drug_substance	sbiobertresolve_umls_drug_substance	Entity Resolution	SentenceEntityResolverModel
en	en.resolve.loinc_cased	sbiobertresolve_loinc_cased	Entity Resolution	SentenceEntityResolverModel
en	en.resolve.loinc_uncased	sbluebertresolve_loinc_uncased	Entity Resolution	SentenceEntityResolverModel
en	en.embed_sentence.biobert.rxnorm	sbiobert_jsl_rxnorm_cased	Entity Resolution	BertSentenceEmbeddings
en	en.embed_sentence.bert_uncased.rxnorm	sbert_jsl_medium_rxnorm_uncased	Embeddings	BertSentenceEmbeddings
en	en.embed_sentence.bert_uncased.rxnorm	sbert_jsl_medium_rxnorm_uncased	Embeddings	BertSentenceEmbeddings
en	en.resolve.snomed_drug	sbiobertresolve_snomed_drug	Entity Resolution	SentenceEntityResolverModel
de	de.med_ner.deid_subentity	ner_deid_subentity	Named Entity Recognition	MedicalNerModel
de	de.med_ner.deid_generic	ner_deid_generic	Named Entity Recognition	MedicalNerModel
de	de.embed.w2v	w2v_cc_300d	Embeddings	WordEmbeddingsModel

Additional NLU resources

NLU OCR tutorial notebook
140+ NLU Tutorials
NLU in Action
Streamlit visualizations docs
The complete list of all 4000+ models & pipelines in 200+ languages is available on Models Hub.
Spark NLP publications
NLU documentation
Discussions Engage with other community members, share ideas, and show off how you use Spark NLP and NLU!

1 line Install NLU on Google Colab

!wget https://setup.johnsnowlabs.com/nlu/colab.sh -O - | bash

1 line Install NLU on Kaggle

!wget https://setup.johnsnowlabs.com/nlu/kaggle.sh -O - | bash

Install via PIP

! pip install nlu pyspark streamlit==0.80.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly