23 Mar 15:58

C-K-Loan

f9f8bb9

Multilingual DeBERTa Transformer Embeddings for 100+ Languages, Spanish Deidentification and NER for Randomized Clinical Trials - John Snow Labs NLU 3.4.2

We are very excited NLU 3.4.2 has been released.
On the open source side we have 5 new DeBERTa Transformer models for English and Multi-Lingual for 100+ languages.
DeBERTa improves over BERT and RoBERTa by introducing two novel techniques.

For the healthcare side we have new NER models for randomized clinical trials (RCT) which can detect entities of type
BACKGROUND, CONCLUSIONS, METHODS, OBJECTIVE, RESULTS from clinical text.
Additionally, new Spanish Deidentification NER models for entities like STATE, PATIENT, DEVICE, COUNTRY, ZIP, PHONE, HOSPITAL and many more.

New Open Source Models

Integrates models from Spark NLP 3.4.2 release

Language	NLU Reference	Spark NLP Reference	Task	Annotator Class
en	en.embed.deberta_v3_xsmall	deberta_v3_xsmall	Embeddings	DeBertaEmbeddings
en	en.embed.deberta_v3_small	deberta_v3_small	Embeddings	DeBertaEmbeddings
en	en.embed.deberta_v3_base	deberta_v3_base	Embeddings	DeBertaEmbeddings
en	en.embed.deberta_v3_large	deberta_v3_large	Embeddings	DeBertaEmbeddings
xx	xx.embed.mdeberta_v3_base	mdeberta_v3_base	Embeddings	DeBertaEmbeddings

New Healthcare Models

Integrates models from Spark NLP For Healthcare 3.4.2 release

Language	NLU Reference	Spark NLP Reference	Task	Annotator Class
en	en.med_ner.clinical_trials	bert_sequence_classifier_rct_biobert	Text Classification	MedicalBertForSequenceClassification
es	es.med_ner.deid.generic.roberta	ner_deid_generic_roberta_augmented	De-identification	MedicalNerModel
es	es.med_ner.deid.subentity.roberta	ner_deid_subentity_roberta_augmented	De-identification	MedicalNerModel
en	en.med_ner.deid.generic_augmented	ner_deid_generic_augmented	['Named Entity Recognition', 'De-identification']	MedicalNerModel
en	en.med_ner.deid.subentity_augmented	ner_deid_subentity_augmented	['Named Entity Recognition', 'De-identification']	MedicalNerModel

Additional NLU resources

140+ NLU Tutorials
NLU in Action
Streamlit visualizations docs
The complete list of all 4000+ models & pipelines in 200+ languages is available on Models Hub.
Spark NLP publications
NLU documentation
Discussions Engage with other community members, share ideas, and show off how you use Spark NLP and NLU!

1 line Install NLU on Google Colab

!wget https://setup.johnsnowlabs.com/nlu/colab.sh -O - | bash

1 line Install NLU on Kaggle

!wget https://setup.johnsnowlabs.com/nlu/kaggle.sh -O - | bash

Install via PIP

! pip install nlu pyspark streamlit==0.80.0

Assets 3

23 Feb 01:44

C-K-Loan

v3.4.1

6186a58

22 New models for 23 languages including various African and Indian languages, Medical Spanish models and more in NLU 3.4.1

We are very excited to announce the release of NLU 3.4.1
which features 22 new models for 23 languages where the
The open-source side covers new Embeddings for Vietnamese and English Clinical domains and Multilingual Embeddings for 12 Indian and 9 African Languages.
Additionally, there are new Sequence classifiers for Multilingual NER for 9 African languages,
German Sentiment Classifiers and English Emotion and Typo Classifiers.
The healthcare side covers Medical Spanish models, Classifiers for Drugs, Gender, the Pico Framework, and Relation Extractors for Adverse Drug events and Temporality.
Finally, Spark 3.2.X is now supported and bugs related to Databricks environments have been fixed.

General NLU Improvements

Support for Spark 3.2.x

New Open Source Models

Based on the amazing 3.4.1 Spark NLP Release
integrates new Multilingual embeddings for 12 Major Indian languages,
embeddings for Vietnamese, French, and English Clinical domains.
Additionally new Multilingual NER model for 9 African languages, English 6 Class Emotion classifier and Typo detectors.

New Embeddings

Multilingual ALBERT - IndicBert model pretrained exclusively on 12 major Indian languages with size smaller and performance on par or better than competing models. Languages covered are Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu.
Available with xx.embed.albert.indic
Fine tuned Vietnamese DistilBERT Base cased embeddings. Available with vi.embed.distilbert.cased
Clinical Longformer Embeddings which consistently out-performs ClinicalBERT for various downstream
tasks and on datasets. Available with en.embed.longformer.clinical
Fine tuned Static French Word2Vec Embeddings in 3 sizes, 200d, 300d and 100d. Available with fr.embed.word2vec_wiki_1000, fr.embed.word2vec_wac_200 and fr.embed.w2v_cc_300d

New Transformer based Token and Sequence Classifiers

Multilingual NER Distilbert model which detects entities DATE, LOC, ORG, PER for the languages 9 African languages (Hausa, Igbo, Kinyarwanda, Luganda, Nigerian, Pidgin, Swahili, Wolof, and Yorùbá).
Available with xx.ner.masakhaner.distilbert
German News Sentiment Classifier available with de.classify.news_sentiment.bert
English Emotion Classifier for 6 Classes available with en.classify.emotion.bert
**English Typo Detector **: available with en.classify.typos.distilbert

Language	NLU Reference	Spark NLP Reference	Task	Annotator Class
xx	xx.embed.albert.indic	albert_indic	Embeddings	AlbertEmbeddings
xx	xx.ner.masakhaner.distilbert	xlm_roberta_large_token_classifier_masakhaner	Named Entity Recognition	DistilBertForTokenClassification
en	en.embed.longformer.clinical	clinical_longformer	Embeddings	LongformerEmbeddings
en	en.classify.emotion.bert	bert_sequence_classifier_emotion	Text Classification	BertForSequenceClassification
de	de.classify.news_sentiment.bert	bert_sequence_classifier_news_sentiment	Sentiment Analysis	BertForSequenceClassification
en	en.classify.typos.distilbert	distilbert_token_classifier_typo_detector	Named Entity Recognition	DistilBertForTokenClassification
fr	fr.embed.word2vec_wiki_1000	word2vec_wiki_1000	Embeddings	WordEmbeddingsModel
fr	fr.embed.word2vec_wac_200	word2vec_wac_200	Embeddings	WordEmbeddingsModel
fr	fr.embed.w2v_cc_300d	w2v_cc_300d	Embeddings	WordEmbeddingsModel
vi	vi.embed.distilbert.cased	distilbert_base_cased	Embeddings	DistilBertEmbeddings

New Healthcare Models

Integrated from the amazing 3.4.1 Spark NLP For Healthcare Release.
which makes 2 new Annotator Classes available, MedicalBertForSequenceClassification and MedicalDistilBertForSequenceClassification,
various medical Spanish models, RxNorm Resolvers,
Transformer based sequence classifiers for Drugs, Gender and the PICO framework,
and Relation extractors for Temporality and Causality of Drugs and Adverse Events.

New Medical Spanish Models

Spanish Word2Vec Embeddings available with es.embed.sciwiki_300d
Spanish PHI Deidentification NER models with two different subsets of entities extracted, available with ner_deid_generic and ner_deid_subentity

New Resolvers

RxNorm resolvers with augmented concept data available with en.med_ner.supplement_clinical

New Transformer based Sequence Classifiers

Adverse Drug Event Classifier Biobert based available with en.classify.ade.seq_biobert
Patient Gender Classifier Biobert and Distilbert based available with en.classify.gender.seq_biobert
and available with en.classify.ade.seq_distilbert
PiCO Framework Classifier available with en.classify.pico.seq_biobert

New Relation Extractors

Temporal Relation Extractor available with en.relation.temporal_events_clinical
Adverse Drug Event Relation Extractors one version Biobert Embeddings and one non-DL version available with en.relation.adverse_drug_events.clinical available with [en.relation.adverse_drug_events.clinical.biobert](https://nlp.johnsnowlabs.com/2021/07/12/redl_ade_biobert_en.html...

Assets 3

22 Jan 18:43

C-K-Loan

v3.4.0

c0dfd53

1 line to OCR for images, PDFS and DOCX, Text Generation with GPT2 and new T5 models, Sequence Classification with XlmRoBerta, RoBerta, Xlnet, Longformer and Albert, Transformer based medical NER with MedicalBertForTokenClassifier, 80 new models, 20+ new languages including various African and Scandinavian and much more in John Snow Labs NLU 3.4.0 !

We are incredibly excited to announce John Snow Labs NLU 3.4.0 has been released!
This release features 11 new annotator classes and 80 new models, including 3 OCR Transformers which enable you to extract text
from various file types, support for GPT2 and new pretrained T5 models for Text Generation and dozens more of new transformer based models
for Token and Sequence Classification.
This includes 8 new Sequence classifier models which can be pretrained in Huggingface and imported into Spark NLP and NLU.
Finally, the NLU tutorial page of the 140+ notebooks has been updated

New NLU OCR Features

3 new OCR based spells are supported, which enable extracting text from files of type
JPEG, PNG, BMP, WBMP, GIF, JPG, TIFF, DOCX, PDF in just 1 line of code.
You need a Spark OCR license for using these, which is available for free here and refer to the new
OCR tutorial notebook

Find more details on the NLU OCR documentation page

New NLU Healthcare Features

The healthcare side features a new MedicalBertForTokenClassifier annotator which is a Bert based model for token classification problems like Named Entity Recognition,
Parts of Speech and much more. Overall there are 28 new models which include German De-Identification models, English NER models for extracting Drug Development Trials,
Clinical Abbreviations and Acronyms, NER models for chemical compounds/drugs and genes/proteins, updated MedicalBertForTokenClassifier NER models for the medical domains Adverse drug Events,
Anatomy, Chemicals, Genes,Proteins, Cellular/Molecular Biology, Drugs, Bacteria, De-Identification and general Medical and Clinical Named Entities.
For Entity Relation Extraction between entity pairs new models for interaction between Drugs and Proteins.
For Entity Resolution new models for resolving Clinical Abbreviations and Acronyms to their full length names and also a model for resolving Drug Substance Entities to the categories
Clinical Drug, Pharmacologic Substance, Antibiotic, Hazardous or Poisonous Substance and new resolvers for LOINC and SNOMED terminologies.

New NLU Open source Features

On the open source side we have new support for Open Ai's GPT2 for various text sequence to sequence problems and
additionally the following new Transformer models are supported :
RoBertaForSequenceClassification, XlmRoBertaForSequenceClassification, LongformerForSequenceClassification,
AlbertForSequenceClassification, XlnetForSequenceClassification, Word2Vec with various pre-trained weights for various problems!

New GPT2 models for generating text conditioned on some input,
New T5 style transfer models for active to passive, formal to informal, informal to formal, passive to active sequence to sequence generation.
Additionally, a new T5 model for generating SQL code from natural language input is provided.

On top of this dozens new Transformer based Sequence Classifiers and Token Classifiers have been released, this is includes for Token Classifier the following models :
Multi-Lingual general NER models for 10 African Languages (Amharic, Hausa, Igbo, Kinyarwanda, Luganda, Nigerian, Pidgin, Swahilu, Wolof, and Yorùbá),
10 high resourced languages (10 high resourced languages (Arabic, German, English, Spanish, French, Italian, Latvian, Dutch, Portuguese and Chinese),
6 Scandinavian languages (Danish, Norwegian-Bokmål, Norwegian-Nynorsk, Swedish, Icelandic, Faroese) ,
Uni-Lingual NER models for general entites in the language Chinese, Hindi, Islandic, Indonesian
and finally English NER models for extracting entities related to Stocks Ticker Symbols, Restaurants, Time.

For Sequence Classification new models for classifying Toxicity in Russian text and English models for
Movie Reviews, News Categorization, Sentimental Tone and General Sentiment

New NLU OCR Models

The following Transformers have been integrated from Spark OCR

NLU Spell	Transformer Class
nlu.load(`img2text`)	ImageToText
nlu.load(`pdf2text`)	PdfToText
nlu.load(`doc2text`)	DocToText

New Open Source Models

Integration for the 49 new models from the colossal Spark NLP 3.4.0 release

Language	NLU Reference	Spark NLP Reference	Task	Annotator Class
en	en.gpt2.distilled	gpt2_distilled	Text Generation	GPT2Transformer
en	en.gpt2	gpt2	Text Generation	GPT2Transformer
en	en.gpt2.medium	gpt2_medium	Text Generation	GPT2Transformer
en	en.gpt2.large	gpt_large	Text Generation	GPT2Transformer
en	en.t5.active_to_passive_styletransfer	t5_active_to_passive_styletransfer	Text Generation	T5Transformer
en	en.t5.formal_to_informal_styletransfer	t5_formal_to_informal_styletransfer	Text Generation	T5Transformer
en	en.t5.grammar_error_corrector	t5_grammar_error_corrector	Text Generation	T5Transformer
en	en.t5.informal_to_formal_styletransfer	t5_informal_to_formal_styletransfer	Text Generation	T5Transformer
en	en.t5.passive_to_active_styletransfer	t5_passive_to_active_styletransfer	Text Generation	T5Transformer
en	en.t5.wikiSQL	t5_small_wikiSQL	Text Generation ...

Assets 3

06 Dec 15:42

C-K-Loan

v3.3.1

fd7e73b

48 new Transformer based models in 9 new languages, including NER for Finance, Industry, Politcal Policies, COVID and Chemical Trials, various clinical and medical domains in Spanish and English and much more in NLU 3.3.1

We are incredibly excited to announce NLU 3.3.1 has been released with 48 new models in 9 languages!

It comes with 2 new types of state-of-the-art models,distilBERT and BERT for sequence classification with various pre-trained weights,
state-of-the-art bert based classifiers for problems in the domains of Finance, Sentiment Classification, Industry, News, and much more.

On the healthcare side, NLU features 22 new models in for English and Spanish with
with entity Resolver Models for LOINC, MeSH, NDC and SNOMED and UMLS Diseases,
NER models for Biomarkers, NIHSS-Guidelines, COVID Trials , Chemical Trials,
Bert based Token Classifier models for biological, genetical,cancer, cellular terms,
Bert for Sequence Classification models for clinical question vs statement classification
and finally Spanish Clinical NER and Resolver Models

Once again, we would like to thank our community for making another amazing release possible!

New Open Source Models and Features

Integrates the amazing Spark NLP 3.3.3 and 3.3.2 releases, featuring:

New state-of-the-art fine-tuned BERT models for Sequence Classification in English, French, German, Spanish, Japanese, Turkish, Russian, and multilingual languages.
DistilBertForSequenceClassification models in English, French and Urdu
Word2Vec models.
classify.distilbert_sequence.banking77 : Banking NER model trained on BANKING77 dataset, which provides a very fine-grained set of intents in a banking domain. It comprises 13,083 customer service queries labeled with 77 intents. It focuses on fine-grained single-domain intent detection. Can extract entities like activate_my_card, age_limit, apple_pay_or_google_pay, atm_support, automatic_top_up, balance_not_updated_after_bank_transfer, balance_not_updated_after_cheque_or_cash_deposit, beneficiary_not_allowed, cancel_transfer, card_about_to_expire, card_acceptance, card_arrival, card_delivery_estimate, card_linking, card_not_working, card_payment_fee_charged, card_payment_not_recognised, card_payment_wrong_exchange_rate, card_swallowed, cash_withdrawal_charge, cash_withdrawal_not_recognised, change_pin, compromised_card, contactless_not_working, country_support, declined_card_payment, declined_cash_withdrawal, declined_transfer, direct_debit_payment_not_recognised, disposable_card_limits, edit_personal_details, exchange_charge, exchange_rate, exchange_via_app, extra_charge_on_statement, failed_transfer, fiat_currency_support, get_disposable_virtual_card, get_physical_card, getting_spare_card, getting_virtual_card, lost_or_stolen_card, lost_or_stolen_phone, order_physical_card, passcode_forgotten, pending_card_payment, pending_cash_withdrawal, pending_top_up, pending_transfer, pin_blocked, receiving_money,
classify.distilbert_sequence.industry : Industry NER model which can extract entities like Advertising, Aerospace & Defense, Apparel Retail, Apparel, Accessories & Luxury Goods, Application Software, Asset Management & Custody Banks, Auto Parts & Equipment, Biotechnology, Building Products, Casinos & Gaming, Commodity Chemicals, Communications Equipment, Construction & Engineering, Construction Machinery & Heavy Trucks, Consumer Finance, Data Processing & Outsourced Services, Diversified Metals & Mining, Diversified Support Services, Electric Utilities, Electrical Components & Equipment, Electronic Equipment & Instruments, Environmental & Facilities Services, Gold, Health Care Equipment, Health Care Facilities, Health Care Services.
xx.classify.bert_sequence.sentiment : Multi-Lingual Sentiment Classifier This a bert-base-multilingual-uncased model finetuned for sentiment analysis on product reviews in six languages: English, Dutch, German, French, Spanish and Italian. It predicts the sentiment of the review as a number of stars (between 1 and 5). This model is intended for direct use as a sentiment analysis model for product reviews in any of the six languages above, or for further finetuning on related sentiment analysis tasks.
distilbert_sequence.policy : Policy Classifier This model was trained on 129.669 manually annotated sentences to classify text into one of seven political categories: ‘Economy’, ‘External Relations’, ‘Fabric of Society’, ‘Freedom and Democracy’, ‘Political System’, ‘Welfare and Quality of Life’ or ‘Social Groups’.
classify.bert_sequence.dehatebert_mono : Hate Speech Classifier This model was trained on 129.669 manually annotated sentences to classify text into one of seven political categories: ‘Economy’, ‘External Relations’, ‘Fabric of Society’, ‘Freedom and Democracy’, ‘Political System’, ‘Welfare and Quality of Life’ or ‘Social Groups’.

Complete List of Open Source Models :

Language	NLU Reference	Spark NLP Reference	Task
en	en.classify.bert_sequence.imdb_large	bert_large_sequence_classifier_imdb	Text Classification
en	en.classify.bert_sequence.imdb	bert_base_sequence_classifier_imdb	Text Classification
en	en.classify.bert_sequence.ag_news	bert_base_sequence_classifier_ag_news	Text Classification
en	en.classify.bert_sequence.dbpedia_14	bert_base_sequence_classifier_dbpedia_14	Text Classification
en	en.classify.bert_sequence.finbert	bert_sequence_classifier_finbert	Text Classification
en	en.classify.bert_sequence.dehatebert_mono	bert_sequence_classifier_dehatebert_mono	Text Classification
tr	tr.classify.bert_sequence.sentiment	bert_sequence_classifier_turkish_sentiment	Text Classification
de	de.classify.bert_sequence.sentiment	bert_sequence_classifier_sentiment	Text Classification
ru	ru.classify.bert_sequence.sentiment	bert_sequence_classifier_rubert_sentiment	Text Classification
ja	ja.classify.bert_sequence.sentiment	bert_sequence_classifier_japanese_sentiment	Text Classification
es	es.classify.bert_sequence.sentiment	bert_sequence_classifier_beto_sentiment_analysis	Text Classification
es	es.classify.bert_sequence.emotion	bert_sequence_classifier_beto_emotion_analysis	Text Classification
xx	[xx.classify.bert_sequence.sentiment](https://nlp.johnsnowlabs.com/2021/11/03/bert_seq...

Assets 3

30 Oct 07:29

C-K-Loan

v3.3.0

e612a9c

2000%+ Speedup on small data, 63 new models for 100+ Languages with 6 new supported Transformer classes including BERT, XLM-RoBERTa, alBERT, Longformer, XLnet based models, 48 NER profiling helathcare pipelines and much more in John Snow Labs NLU 3.3.0

We are incredibly excited to announce NLU 3.3.0 has been released!
It comes with a up to 2000%+ speedup on small datasets, 6 new Types of Deep Learning transformer models, including
RoBertaForTokenClassification,XlmRoBertaForTokenClassification,AlbertForTokenClassification,LongformerForTokenClassification,XlnetForTokenClassification,XlmRoBertaSentenceEmbeddings.
In total there are 63 NLP Models 6 New Languages Supported which are Igbo, Ganda, Dholuo, Naija, Wolof,Kinyarwanda with their corresponding ISO codes ig, lg, lou, pcm, wo,rw
with New SOTA XLM-RoBERTa models in Luganda, Kinyarwanda, Igbo, Hausa, and Amharic languages and 2 new Multilingual Embeddings with 100+ supported languages via XLM-Roberta are available.

On the healthcare NLP side we are glad to announce 18 new NLP for Healthcare models including
NER Profiling pretrained pipelines to run 48 different Clinical NER and 21 Different Biobert Models At Once Over the Input Text
New BERT-Based Deidentification NER Model, Sentence Entity Resolver Models For German Language
New Spell Checker Model For Drugs , 3 New Sentence Entity Resolver Models (3-char ICD10CM, RxNorm_NDC, HCPCS)
5 New Clinical NER Models (Trained By BertForTokenClassification Approach)
,Radiology NER Model Trained On cheXpert Datasetand New UMLS Sentence Entity Resolver Models

Additionally 2 new tutorials are avaiable, NLU & Streamlit Crashcourse and NLU for Healthcare Crashcourse of every of the 50 + healthcare Domains and 200+ healthcare models

New Features and Improvements

2000%+ Speedup prediction for small datasets

NLU pipelines now predict up to 2000% faster by optimizing integration with Spark NLP's light pipelines.
NLU will configure usage of this automatically, but it can be turned off as well via multithread=False

50x faster saving of NLU Pipelines

Up to 50x faster saving Spark NLP/ NLU models and pipelines! We have improved the way we package TensorFlow SavedModel while saving Spark NLP models & pipelines. For instance, it used to take up to 10 minutes to save the xlm_roberta_base model before Spark NLP 3.3.0, and now it only takes up to 15 seconds!

New Annotator Classes Integrated

The following new transformer classes are available with various pretrained weights in 1 line of code :

New Transformer Models

The following models are available from the amazing Spark NLP
3.3.0 and
3.3.1 releases
which includes NLP models for
Yiddish, Ukrainian, Telugu, Tamil, Somali, Sindhi, Russian, Punjabi, Nepali, Marathi, Malayalam, Kannada, Indonesian, Gujrati, Bosnian, Igbo, Ganda, Dholuo, Naija, Wolof,Kinyarwanda

Language	NLU Reference	Spark NLP Reference	Task
ig	ig.embed.xlm_roberta	xlm_roberta_base_finetuned_igbo	Embeddings
ig	ig.embed_sentence.xlm_roberta	sent_xlm_roberta_base_finetuned_igbo	Embeddings
lg	lg.embed.xlm_roberta	xlm_roberta_base_finetuned_luganda	Embeddings
lg	lg.embed_sentence.xlm_roberta	sent_xlm_roberta_base_finetuned_luganda	Embeddings
wo	wo.embed_sentence.xlm_roberta	sent_xlm_roberta_base_finetuned_wolof	Embeddings
wo	wo.embed.xlm_roberta	xlm_roberta_base_finetuned_wolof	Embeddings
rw	rw.embed_sentence.xlm_roberta	sent_xlm_roberta_base_finetuned_kinyarwanda	Embeddings
rw	rw.embed.xlm_roberta	xlm_roberta_base_finetuned_kinyarwanda	Embeddings
sw	sw.embed_sentence.xlm_roberta	sent_xlm_roberta_base_finetuned_swahili	Embeddings
sw	sw.embed.xlm_roberta	xlm_roberta_base_finetuned_swahili	Embeddings
ha	ha.embed.xlm_roberta	xlm_roberta_base_finetuned_hausa	Embeddings
ha	ha.embed_sentence.xlm_roberta	sent_xlm_roberta_base_finetuned_hausa	Embeddings
am	am.embed.xlm_roberta	xlm_roberta_base_finetuned_amharic	Embeddings
am	am.embed_sentence.xlm_roberta	sent_xlm_roberta_base_finetuned_amharic	Embeddings
yo	yo.embed_sentence.xlm_roberta	sent_xlm_roberta_base_finetuned_yoruba	Embeddings
yo	yo.embed.xlm_roberta	[xlm_roberta_base_finetuned_yoruba](https://nlp.johnsnowl...

Assets 3

18 Sep 12:42

C-K-Loan

v3.2.1

41e5a30

27 new models in 7 Languages, including Japanese NER, resolution models for SNOMED, ICDO, CPT and RxNorm codes and much more in NLU 3.2.1

We are very excited to announce NLU 3.2.1!
This release comes with models 27 new models for 7 languages which are transformer based.
New NER-Classifiers, BertSentenceEmbeddings, BertEmbeddings and BertForTokenClassificationEmbeddings
for Japanese, German, Dutch, Swedish, Spanish, French and English.
For healthcare there are new Entity Resolvers and MedicalNerModels
for Snomed Conditions, Cpt Measurements, Icd0, Rxnorm Dispositions, Posology and Deidentification.
Finally, a new tutorial notebook and a webinar are available, which showcase almost every feature of NLU
for the over 50 Domains in Healthcare/Clinical/Biomedical/etc..

New Transformer Models

Models in Japanese, German, Dutch, Swedish, Spanish, French and English from the great Spark NLP 3.2.3 release

nlu.load() Refrence	Spark NLP Refrence	Annotater class	Language
en.embed.bert.base_uncased_legal	bert_base_uncased_legal	BertEmbeddings	en
en.embed_sentence.bert.base_uncased_legal	sent_bert_base_uncased_legal	BertSentenceEmbeddings	en
en.embed.token_bert.classifier_ner_btc	bert_token_classifier_ner_btc	BertForTokenClassification	en
es.embed.bert.base_uncased	bert_base_uncased	BertEmbeddings	es
es.embed.bert.base_cased	bert_base_cased	BertEmbeddings	es
es.embed_sentence.bert.base_uncased	sent_bert_base_uncased	BertSentenceEmbeddings	es
es.embed_sentence.bert.base_cased	sent_bert_base_cased	BertSentenceEmbeddings	es
el.embed.bert.base_uncased	bert_base_uncased	BertEmbeddings	el
el.embed_sentence.bert.base_uncased	sent_bert_base_uncased	BertSentenceEmbeddings	el
sv.embed.bert.base_cased	bert_base_cased	BertEmbeddings	sv
sv.embed_sentence.bert.base_cased	sent_bert_base_cased	BertSentenceEmbeddings	sv
nl.embed_sentence.bert.base_cased	sent_bert_base_cased	BertSentenceEmbeddings	nl
nl.embed.bert.base_cased	bert_base_cased	BertEmbeddings	nl
fr.classify.sentiment.bert	classifierdl_bert_sentiment	ClassifierDLModel	fr
ja.embed.glove.cc_300d	japanese_cc_300d	WordEmbeddingsModel	ja
ja.ner.ud_gsd_cc_300d	ner_ud_gsd_cc_300d	NerDLModel	ja
ja.ner.ud_gsd_xlm_roberta_base	ner_ud_gsd_xlm_roberta_base	NerDLModel	ja
ja.embed.token_bert.classifier_ner_ud_gsd	bert_token_classifier_ner_ud_gsd	BertForTokenClassification	ja
de.embed_sentence.bert.base_cased	sent_bert_base_cased	BertSentenceEmbeddings	de
de.classify.sentiment.bert	classifierdl_bert_sentiment	ClassifierDLModel	de

New Healthcare Transformer Models

Models for Snomed Conditions, Cpt Measurements, Icd0, Rxnorm Dispositions, Posology and Deidentification from the amazing Spark NLP 3.2.2 for Healthcare Release

nlu.load() Refrences	Spark NLP Refrence	Annotater class	Language
en.resolve.snomed_conditions	sbertresolve_snomed_conditions	SentenceEntityResolverModel	en
en.resolve.cpt.procedures_measurements	sbiobertresolve_cpt_procedures_measurements_augmented	SentenceEntityResolverModel	en
en.resolve.icdo.base	sbiobertresolve_icdo_base	SentenceEntityResolverModel	en
en.resolve.rxnorm.disposition.sbert	sbertresolve_rxnorm_disposition	SentenceEntityResolverModel	en
en.resolve.rxnorm_disposition.sbert	sbertresolve_rxnorm_disposition	SentenceEntityResolverModel	en
en.med_ner.posology.experimental	ner_posology_experimental	MedicalNerModel	en
en.med_ner.deid.subentity_augmented	ner_deid_subentity_augmented	MedicalNerModel	en

New Notebooks

NLU Healthcare Overview and Crashcourse

Enhancements

Columns of the Pandas DataFrame returned by NLU will now be sorted alphabetically

Bugfixes

Fixed a bug that caused output levels no beeing inferred properly
Fixed a bug that caused SentenceResolver visualizations not to appear.

Assets 3

11 Sep 09:58

C-K-Loan

v3.2.0

2e12d75

100+ Transformers Models in 40+ languages, 3-D Streamlit Entity-Embedding-Manifold visualizations, Multi-Lingual NER, Longformers, TokenDistilBERT, Trainable Sentence Resolvers, 7% less memory usage and much more in NLU 3.2.0

We are extremely excited to announce the release of NLU 3.2.0
which marks the 1-year anniversary of the birth of this magical library.
This release packs features and improvements in every division of NLU's aspects,
89 new NLP models with new Models including Longformer, TokenBert, TokenDistilBert and Multi-Lingual NER for 40+ Languages.
12 new Healthcare models with trainable sentence resolvers and models Adverse Drug Relations, Clinical Token Bert Models, NER Models for Radiology, Drugs, Posology, Administration Cycles, RXNorm, and new Medical Assertion models.
New Streamlit visualizations enable you to see Entities in 3-D, 2-D, and 1-D Manifolds which are applicable to Entities and their Embeddings, Detected by Named-Entity-Recognizer models.
Finally, a ~7% decrease in Memory consumption in NLU's core which benefits every computation, achieved by leveraging Pyarrow.
We are incredibly thankful to our community, which helped us come this far, and are looking forward to another magical year of NLU!

Streamlit Entity Manifold visualization

`function` `pipe.viz_streamlit_entity_embed_manifold`

Visualize recognized entities by NER models via their Entity Embeddings in 1-D, 2-D, or 3-D by Reducing Dimensionality via 10+ Supported methods from Manifold Algorithms
and Matrix Decomposition Algorithms.
You can pick additional NER models and compare them via the GUI dropdown on the left.

Reduces Dimensionality of high dimensional Entity Embeddings to 1-D, 2-D, or 3-D and plot the resulting data in an interactive Plotly plot
Applicable with any of the 330+ Named Entity Recognizer models
Gemerates NUM-DIMENSIONS * NUM-NER-MODELS * NUM-DIMENSION-REDUCTION-ALGOS plots

nlu.load('ner').viz_streamlit_sentence_embed_manifold(['Hello From John Snow Labs', 'Peter loves to visit New York'])

or just run

streamlit run https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/examples/streamlit/09_entity_embedding_manifolds.py

`function parameters` `pipe.viz_streamlit_sentence_embed_manifold`

Argument	Type	Default	Description
`default_texts`	`List[str]`	"Donald Trump likes to visit New York", "Angela Merkel likes to visit Berlin!", 'Peter hates visiting Paris')	List of strings to apply classifiers, embeddings, and manifolds to.
`title`	`str`	`'NLU ❤️ Streamlit - Prototype your NLP startup in 0 lines of code🚀'`	Title of the Streamlit app
`sub_title`	`Optional[str]`	"Apply any of the 10+ `Manifold` or `Matrix Decomposition` algorithms to reduce the dimensionality of `Entity Embeddings` to `1-D`, `2-D` and `3-D` "	Sub title of the Streamlit app
`default_algos_to_apply`	`List[str]`	`["TSNE", "PCA"]`	A list Manifold and Matrix Decomposition Algorithms to apply. Can be either `'TSNE'`,`'ISOMAP'`,`'LLE'`,`'Spectral Embedding'`, `'MDS'`,`'PCA'`,`'SVD aka LSA'`,`'DictionaryLearning'`,`'FactorAnalysis'`,`'FastICA'` or `'KernelPCA'`,
`target_dimensions`	`List[int]`	`(1,2,3)`	Defines the target dimension embeddings will be reduced to
`show_algo_select`	`bool`	`True`	Show selector for Manifold and Matrix Decomposition Algorithms
`set_wide_layout_CSS`	`bool`	`True`	Whether to inject custom CSS or not.
`num_cols`	`int`	`2`	How many columns should for the layout in streamlit when rendering the similarity matrixes.
`key`	`str`	`"NLU_streamlit"`	Key for the Streamlit elements drawn
`show_logo`	`bool`	`True`	Show logo
`display_infos`	`bool`	`False`	Display additonal information about ISO codes and the NLU namespace structure.
`n_jobs`	`Optional[int]`	`3`	`False`

Sentence Entity Resolver Training

Sentence Entity Resolver Training Tutorial Notebook
Named Entities are sub pieces in textual data which are labeled with classes.
These classes and strings are still ambiguous though and it is not possible to group semantically identically entities without any definition of terminology.
With the Sentence Resolver you can train a state-of-the-art deep learning architecture to map entities to their unique terminological representation.

Train a Sentence resolver on a dataset with columns named y , _y and text. y is a label, _y is an extra identifier label, text is the raw text

import pandas as pd 
import nlu
dataset = pd.DataFrame({
    'text': ['The Tesla company is good to invest is', 'TSLA is good to invest','TESLA INC. we should buy','PUT ALL MONEY IN TSLA inc!!'],
    'y': ['23','23','23','23'],
    '_y': ['TESLA','TESLA','TESLA','TESLA'],

})

trainable_pipe = nlu.load('train.resolve_sentence')
fitted_pipe  = trainable_pipe.fit(dataset)
res = fitted_pipe.predict(dataset)
fitted_pipe.predict(["Peter told me to buy Tesla ", 'I have money to loose, is TSLA a good option?'])

	sentence_resolution_resolve_sentence_confidence	sentence_resolution_resolve_sentence_code	sentence_resolution_resolve_sentence	sentence
0	'1.0000'	'23'	'TESLA'	'The Tesla company is good to invest is'
1	'1.0000'	'23'	'TESLA'	'TSLA is good to invest'
2	'1.0000'	'23'	'TESLA'	'TESLA INC. we should buy'
3	'1.0000'	'23'	'TESLA'	'PUT ALL MONEY IN TSLA inc!!'

Alternatively you can also use non-default healthcare embeddings.

trainable_pipe = nlu.load('en.embed.glove.biovec train.resolve_sentence')

Transformer Models

New models from the spectacular Spark NLP 3.2.0 + releases are integrated.
89 new models in total, with new LongFormer, TokenBert, TokenDistilBert and Multi-Lingual NER for 40+ languages.
The supported languages with their ISO 639-1 code are : af, ar, bg, bn, de, el, en, es, et, eu, fa, fi, fr, he, hi, hu, id, it, ja, jv, ka, kk, ko, ml, mr, ms, my, nl, pt, ru, sw, ta, te, th, tl, tr, ur, vi, yo, and zh

nlu.load() Refrence	Spark NLP Refrence	Annotator Class	language
en.embed.longformer	longformer_base_4096	LongformerEmbeddings	en
en.embed.longformer.large	longformer_large_4096	LongformerEmbeddings	en
en.ner.ontonotes_roberta_base	ner_ontonotes_roberta_base	NerDLModel	en
en.ner.ontonotes_roberta_large	ner_ontonotes_roberta_large	NerDLModel	en
en.ner.ontonotes_distilbert_base_cased	[ner_ontonotes_distilbert_base_cased](https://nlp.john...

Assets 3

23 Jul 03:54

C-K-Loan

3.1.1

63b1d22

Sentence Embedding Visualizations, 20+ New Models, 2 New Trainable Models, Drug Normalizer and more in John Snow Labs NLU 3.1.1

We are very excited to announce NLU 3.1.1 has been released!
It features a new Sentence Embedding visualization component for Streamlit which supports all 10+ previous dimension
reduction techniques. Additionally, all embedding visualizations now support Latent Dirichlet Allocation for dimension reduction.
Finally, 2 new trainable models for NER and chunk resolution are supported, a new drug normalizer algorithm has been added,
20+ new pre-trained models including Multi-Lingual, German,
various healthcare models and improved NER defaults when using licensed models that have NER dependencies.

Streamlit Sentence Embedding visualization via Manifold and Matrix Decomposition algorithms

`function` `pipe.viz_streamlit_sentence_embed_manifold`

Visualize Sentence Embeddings in 1-D, 2-D, or 3-D by Reducing Dimensionality via 12 Supported methods from Manifold Algorithms
and Matrix Decomposition Algorithms.
Additionally, you can color the lower dimensional points with a label that has been previously assigned to the text by specifying a list of nlu references in the additional_classifiers_for_coloring parameter.
You can also select additional classifiers via the GUI.

Reduces Dimensionality of high dimensional Sentence Embeddings to 1-D, 2-D, or 3-D and plot the resulting data in an interactive Plotly plot
Applicable with any of the 100+ Sentence Embedding models
Color points by classifying with any of the 100+ Document Classifiers
Gemerates NUM-DIMENSIONS * NUM-EMBEDDINGS * NUM-DIMENSION-REDUCTION-ALGOS plots

text= """You can visualize any of the 100 + Sentence Embeddings
with 10+ dimension reduction algorithms
and view the results in 3D, 2D, and 1D  
which can be colored by various classifier labels!
"""
nlu.load('embed_sentence.bert').viz_streamlit_sentence_embed_manifold(text)

`function parameters` `pipe.viz_streamlit_sentence_embed_manifold`

Argument	Type	Default	Description
`default_texts`	`List[str]`	("Donald Trump likes to party!", "Angela Merkel likes to party!", 'Peter HATES TO PARTTY!!!! :(')	List of strings to apply classifiers, embeddings, and manifolds to.
`text`	`Optional[str]`	`'Billy likes to swim'`	Text to predict classes for.
`sub_title`	`Optional[str]`	"Apply any of the 11 `Manifold` or `Matrix Decomposition` algorithms to reduce the dimensionality of `Sentence Embeddings` to `1-D`, `2-D` and `3-D` "	Sub title of the Streamlit app
`default_algos_to_apply`	`List[str]`	`["TSNE", "PCA"]`	A list Manifold and Matrix Decomposition Algorithms to apply. Can be either `'TSNE'`,`'ISOMAP'`,`'LLE'`,`'Spectral Embedding'`, `'MDS'`,`'PCA'`,`'SVD aka LSA'`,`'DictionaryLearning'`,`'FactorAnalysis'`,`'FastICA'` or `'KernelPCA'`,
`target_dimensions`	`List[int]`	`(1,2,3)`	Defines the target dimension embeddings will be reduced to
`show_algo_select`	`bool`	`True`	Show selector for Manifold and Matrix Decomposition Algorithms
`show_embed_select`	`bool`	`True`	Show selector for Embedding Selection
`show_color_select`	`bool`	`True`	Show selector for coloring plots
`display_embed_information`	`bool`	`True`	Show additional embedding information like `dimension`, `nlu_reference`, `spark_nlp_reference`, `sotrage_reference`, `modelhub link` and more.
`set_wide_layout_CSS`	`bool`	`True`	Whether to inject custom CSS or not.
`num_cols`	`int`	`2`	How many columns should for the layout in streamlit when rendering the similarity matrixes.
`key`	`str`	`"NLU_streamlit"`	Key for the Streamlit elements drawn
`additional_classifiers_for_coloring`	`List[str]`	`['sentiment.imdb']`	List of additional NLU references to load for generting hue colors
`show_model_select`	`bool`	`True`	Show a model selection dropdowns that makes any of the 1000+ models avaiable in 1 click
`model_select_position`	`str`	`'side'`	Whether to output the positions of predictions or not, see `pipe.predict(positions=true`) for more info
`show_logo`	`bool`	`True`	Show logo
`display_infos`	`bool`	`False`	Display additonal information about ISO codes and the NLU namespace structure.
`n_jobs`	`Optional[int]`	`3`	`False`

General Streamlit enhancements

Support for Latent Dirichlet Allocation

The Latent Dirichlet Allocation algorithm is now supported
for the Word Embedding Visualizations and the Sentence Embedding Visualizations

Normalization of Vectors before calculating sentence similarity.

WordEmbedding vectors will now be normalized before calculating similarity scores, which bounds each similarity between 0 and 1

Control order of plots

You can now control the order in Which visualizations appear in the main GUI

Sentence Embedding Visualization

Chunk Entity Resolver Training

Chunk Entity Resolver Training Tutorial Notebook
Named Entities are sub pieces in textual data which are labeled with classes.
These classes and strings are still ambigous though and it is not possible to group semantically identically entities without any definition of terminology.
With the Chunk Resolver you can train a state-of-the-art deep learning architecture to map entities to their unique terminological representation.

Train a chunk resolver on a dataset with columns named y , _y and text. y is a label, _y is an extra identifier label, text is the raw text

import pandas as pd 
dataset = pd.DataFrame({
    'text': ['The Tesla company is good to invest is', 'TSLA is good to invest','TESLA INC. we should buy','PUT ALL MONEY IN TSLA inc!!'],
    'y': ['23','23','23','23']
    '_y': ['TESLA','TESLA','TESLA','TESLA'], 

})


trainable_pipe = nlu.load('train.resolve_chunks')
fitted_pipe  = trainable_pipe.fit(dataset)
res = fitted_pipe.predict(dataset)
fitted_pipe.predict(["Peter told me to buy Tesla ", 'I have money to loose, is TSLA a good option?'])

entity_resolution_confidence	entity_resolution_code	entity_resolution	document
'1.0000'	'23'	'TESLA'	Peter told me to buy Tesla
'1.0000'	'23'	'TESLA'	I have money to loose, is TSLA a good option?

Train with default glove embeddings

untrained_chunk_resolver = nlu.load('train.resolve_chunks')
trained_chunk_resolver  =  untrained_chunk_resolver.fit(df)
trained_chunk_resolver.predict(df)

Train with custom embeddings

# Use BIo GLove
untrained_chunk_resolver = nlu.load('en.embed.glove.biovec train.resolve_chunks')
trained_chunk_resolver  =  untrained_chunk_resolver.fit(df)
trained_chunk_resolver.predict(df)

Rule based NER with Context Matcher

Rule based NER with context matching tutorial notebook
Define a rule-based NER algorithm by providing Regex Patterns and resolution mappings.
The confidence value is computed using a heuristic approach based on how many matches it has.
A dictionary can be pro...

Assets 3

28 Jun 09:10

C-K-Loan

v3.1.0

48d34e1

2600+ New Models for 200+ Languages and 10+ Dimension Reduction Algorithms for Streamlit Word-Embedding visualizations in 3-D

We are extremely excited to announce the release of NLU 3.1 !
This is our biggest release so far and it comes with over 2600+ new models in 200+ languages, including DistilBERT, RoBERTa, and XLM-RoBERTa and Huggingface based Embeddings from the incredible Spark-NLP 3.1.0 release,
new Streamlit Visualizations for visualizing Word Embeddings in 3-D, 2-D, and 1-D,
New Healthcare pipelines for healthcare code mappings
and finally confidence extraction for open source NER models.
Additionally, the NLU Namespace has been renamed to the NLU Spellbook, to reflect the magicalness of each 1-liners represented by them!

Streamlit Word Embedding visualization via Manifold and Matrix Decomposition algorithms

`function` `pipe.viz_streamlit_word_embed_manifold`

Visualize Word Embeddings in 1-D, 2-D, or 3-D by Reducing Dimensionality via 11 Supported methods from Manifold Algorithms
and Matrix Decomposition Algorithms.
Additionally, you can color the lower dimensional points with a label that has been previously assigned to the text by specifying a list of nlu references in the additional_classifiers_for_coloring parameter.

Reduces Dimensionality of high dimensional Word Embeddings to 1-D, 2-D, or 3-D and plot the resulting data in an interactive Plotly plot
Applicable with any of the 100+ Word Embedding models
Color points by classifying with any of the 100+ Parts of Speech Classifiers or Document Classifiers
Gemerates NUM-DIMENSIONS * NUM-EMBEDDINGS * NUM-DIMENSION-REDUCTION-ALGOS plots

nlu.load('bert',verbose=True).viz_streamlit_word_embed_manifold(default_texts=THE_MATRIX_ARCHITECT_SCRIPT.split('\n'),default_algos_to_apply=['TSNE'],MAX_DISPLAY_NUM=5)

`function parameters` `pipe.viz_streamlit_word_embed_manifold`

Argument	Type	Default	Description
`default_texts`	`List[str]`	("Donald Trump likes to party!", "Angela Merkel likes to party!", 'Peter HATES TO PARTTY!!!! :(')	List of strings to apply classifiers, embeddings, and manifolds to.
`text`	`Optional[str]`	`'Billy likes to swim'`	Text to predict classes for.
`sub_title`	`Optional[str]`	"Apply any of the 11 `Manifold` or `Matrix Decomposition` algorithms to reduce the dimensionality of `Word Embeddings` to `1-D`, `2-D` and `3-D` "	Sub title of the Streamlit app
`default_algos_to_apply`	`List[str]`	`["TSNE", "PCA"]`	A list Manifold and Matrix Decomposition Algorithms to apply. Can be either `'TSNE'`,`'ISOMAP'`,`'LLE'`,`'Spectral Embedding'`, `'MDS'`,`'PCA'`,`'SVD aka LSA'`,`'DictionaryLearning'`,`'FactorAnalysis'`,`'FastICA'` or `'KernelPCA'`,
`target_dimensions`	`List[int]`	`(1,2,3)`	Defines the target dimension embeddings will be reduced to
`show_algo_select`	`bool`	`True`	Show selector for Manifold and Matrix Decomposition Algorithms
`show_embed_select`	`bool`	`True`	Show selector for Embedding Selection
`show_color_select`	`bool`	`True`	Show selector for coloring plots
`MAX_DISPLAY_NUM`	`int`	`100`	Cap maximum number of Tokens displayed
`display_embed_information`	`bool`	`True`	Show additional embedding information like `dimension`, `nlu_reference`, `spark_nlp_reference`, `sotrage_reference`, `modelhub link` and more.
`set_wide_layout_CSS`	`bool`	`True`	Whether to inject custom CSS or not.
`num_cols`	`int`	`2`	How many columns should for the layout in streamlit when rendering the similarity matrixes.
`key`	`str`	`"NLU_streamlit"`	Key for the Streamlit elements drawn
`additional_classifiers_for_coloring`	`List[str]`	`['pos', 'sentiment.imdb']`	List of additional NLU references to load for generting hue colors
`show_model_select`	`bool`	`True`	Show a model selection dropdowns that makes any of the 1000+ models avaiable in 1 click
`model_select_position`	`str`	`'side'`	Whether to output the positions of predictions or not, see `pipe.predict(positions=true`) for more info
`show_logo`	`bool`	`True`	Show logo
`display_infos`	`bool`	`False`	Display additonal information about ISO codes and the NLU namespace structure.
`n_jobs`	`Optional[int]`	`3`	`False`

Larger Example showcasing more dimension reduction techniques on a larger corpus :

Supported Manifold Algorithms

Supported Matrix Decomposition Algorithms

New Healthcare Pipelines Pipelines

Five new healthcare code mapping pipelines:

nlu.load(en.resolve.icd10cm.umls): This pretrained pipeline maps ICD10CM codes to UMLS codes without using any text data. You’ll just feed white space-delimited ICD10CM codes and it will return the corresponding UMLS codes as a list. If there is no mapping, the original code is returned with no mapping.

{'icd10cm': ['M89.50', 'R82.2', 'R09.01'],'umls': ['C4721411', 'C0159076', 'C0004044']}

nlu.load(en.resolve.mesh.umls): This pretrained pipeline maps MeSH codes to UMLS codes without using any text data. You’ll just feed white space-delimited MeSH codes and it will return the corresponding UMLS codes as a list. If there is no mapping, the original code is returned with no mapping.

{'mesh': ['C028491', 'D019326', 'C579867'],'umls': ['C0970275', 'C0886627', 'C3696376']}

nlu.load(en.resolve.rxnorm.umls): This pretrained pipeline maps RxNorm codes to UMLS codes without using any text data. You’ll just feed white space-delimited RxNorm codes and it will return the corresponding UMLS codes as a list. If there is no mapping, the original code is returned with no mapping.

{'rxnorm': ['1161611', '315677', '343663'],'umls': ['C3215948', 'C0984912', 'C1146501']}

nlu.load(en.resolve.rxnorm.mesh): This pretrained pipeline maps RxNorm codes to MeSH codes without using any text data. You’ll just feed white space-delimited RxNorm codes and it will return the corresponding MeSH codes as a list. If there is no mapping, the original code is returned with no mapping.

{'rxnorm': ['1191', '6809', '47613'],'mesh': ['D001241', 'D008687', 'D019355']}

nlu.load(en.resolve.snomed.umls): This pretrained pipeline maps SNOMED codes to UMLS codes without using any text...

Assets 3

03 Jun 02:50

C-K-Loan

3.0.2

a7b1c0b

Streamlit visualizations, improved T5, models for Farsi, Hebrew, Korean, Turkish and UMLS, LOINC, HPO, Resolvers in NLU 3.0.2

This release contains examples and tutorials on how to visualize the 1000+ state-of-the-art NLP models provided by NLU in just 1 line of code in streamlit.
It includes simple 1-liners you can sprinkle into your Streamlit app to for features like Dependency Trees, Named Entities (NER), text classification results, semantic simmilarity,
embedding visualizations via ELMO, BERT, ALBERT, XLNET and much more . Additionally, improvements for T5, various resolvers have been added and models Farsi, Hebrew, Korean, and Turkish

This is the ultimate NLP research tool. You can visualize and compare the results of hundreds of context aware deep learning embeddings and compare them with classical vanilla embeddings like Glove
and can see with your own eyes how context is encoded by transformer models like BERT or XLNETand many more !
Besides that, you can also compare the results of the 200+ NER models John Snow Labs provides and see how peformances changes with varrying ebeddings, like Contextual, Static and Domain Specific Embeddings.

Install

For detailed instructions refer to the NLU install documentation here
You need Open JDK 8 installed and the following python packages

pip install nlu streamlit pyspark==3.0.1 sklearn plotly

Problems? Connect with us on Slack!

Impatient and want some action?

Just run this Streamlit app, you can use it to generate python code for each NLU-Streamlit building block

streamlit run https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/examples/streamlit/01_dashboard.py

Quick Starter cheat sheet - All you need to know in 1 picture for NLU + Streamlit

For NLU models to load, see the NLU Namespace or the John Snow Labs Modelshub or go straight to the source.

Examples

Just try out any of these.
You can use the first example to generate python-code snippets which you can
recycle as building blocks in your streamlit apps!

Example: `01_dashboard`

streamlit run https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/examples/streamlit/01_dashboard.py

Example: `02_NER`

streamlit run https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/examples/streamlit/02_NER.py

Example: `03_text_similarity_matrix`

streamlit run https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/examples/streamlit/03_text_similarity_matrix.py

Example: `04_dependency_tree`

streamlit run https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/examples/streamlit/04_dependency_tree.py

Example: `05_classifiers`

streamlit run https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/examples/streamlit/05_classifiers.py

Example: `06_token_features`

streamlit run https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/examples/streamlit/06_token_features.py

How to use NLU?

All you need to know about NLU is that there is the nlu.load() method which returns a NLUPipeline object
which has a .predict() that works on most common data types in the pydata stack like Pandas dataframes .
Ontop of that, there are various visualization methods a NLUPipeline provides easily integrate in Streamlit as re-usable components. viz() method

Overview of NLU + Streamlit buildingblocks

Method	Description
`nlu.load('<Model>').predict(data)`	Load any of the 1000+ models by providing the model name any predict on most Pythontic data strucutres like Pandas, strings, arrays of strings and more
`nlu.load('<Model>').viz_streamlit(data)`	Display full NLU exploration dashboard, that showcases every feature avaiable with dropdown selectors for 1000+ models
`nlu.load('<Model>').viz_streamlit_similarity([string1, string2])`	Display similarity matrix and scalar similarity for every word embedding loaded and 2 strings.
`nlu.load('<Model>').viz_streamlit_ner(data)`	Visualize predicted NER tags from Named Entity Recognizer model
`nlu.load('<Model>').viz_streamlit_dep_tree(data)`	Visualize Dependency Tree together with Part of Speech labels
`nlu.load('<Model>').viz_streamlit_classes(data)`	Display all extracted class features and confidences for every classifier loaded in pipeline
`nlu.load('<Model>').viz_streamlit_token(data)`	Display all detected token features and informations in Streamlit
`nlu.load('<Model>').viz(data, write_to_streamlit=True)`	Display the raw visualization without any UI elements. See viz docs for more info. By default all aplicable nlu model references will be shown.
`nlu.enable_streamlit_caching()`	Enable caching the `nlu.load()` call. Once enabled, the `nlu.load()` method will automatically cached. This is recommended to run first and for large peformance gans

Detailed visualizer information and API docs

`function` `pipe.viz_streamlit`

Display a highly configurable UI that showcases almost every feature available for Streamlit visualization with model selection dropdowns in your applications.
Ths includes :

Similarity Matrix & Scalars & Embedding Information for any of the 100+ Word Embedding Models
NER visualizations for any of the 200+ Named entity recognizers
Labled & Unlabled Dependency Trees visualizations with Part of Speech Tags for any of the 100+ Part of Speech Models
Token informations predicted by any of the 1000+ models
Classification results predicted by any of the 100+ models classification models
Pipeline Configuration & Model Information & Link to John Snow Labs Modelshub for all loaded pipelines
Auto generate Python code that can be copy pasted to re-create the individual Streamlit visualization blocks.
NlLU takes the first model specified as nlu.load() for the first visualization run.
Once the Streamlit app is running, additional models can easily be added via the UI.
It is recommended to run this first, since you can generate Python code snippets to recreate individual Streamlit visualization blocks

nlu.load('ner').viz_streamlit(['I love NLU and Streamlit!','I hate buggy software'])

`function parameters` `pipe.viz_streamlit`

Argument	Type	Default	Description
`text`	`Union [str, List[str], pd.DataFrame, pd.Series]`	`'NLU and Streamlit go together like peanutbutter and jelly'`	Default text for the `Classification`, `Named Entitiy Recognizer`, `Token Information` and `Dependency Tree` visualizations
`similarity_texts`	`Union[List[str],Tuple[str,str]]`	`('Donald Trump Likes to part', 'Angela Merkel likes to party')`	Default texts for the `Text similarity` visualization. Should contain `exactly 2 strings` which will be compared `token embedding wise`. For each embedding active, a `token wise similarity matrix` and a `similarity scalar`
`model_selection`	`List[str]`	`[]`	List of nlu references to display in the model selector, see [the NLU Namespace](https://nlu.johnsno...

Assets 3

Releases: JohnSnowLabs/nlu

Multilingual DeBERTa Transformer Embeddings for 100+ Languages, Spanish Deidentification and NER for Randomized Clinical Trials - John Snow Labs NLU 3.4.2

Multilingual DeBERTa Transformer Embeddings for 100+ Languages, Spanish Deidentification and NER for Randomized Clinical Trials - John Snow Labs NLU 3.4.2

New Open Source Models

New Healthcare Models

Additional NLU resources

1 line Install NLU on Google Colab

1 line Install NLU on Kaggle

Install via PIP

22 New models for 23 languages including various African and Indian languages, Medical Spanish models and more in NLU 3.4.1

General NLU Improvements

New Open Source Models

New Embeddings

New Transformer based Token and Sequence Classifiers

New Healthcare Models

New Medical Spanish Models

New Resolvers

New Transformer based Sequence Classifiers

New Relation Extractors

New NLU OCR Features

New NLU Healthcare Features

New NLU Open source Features

New NLU OCR Models

New Open Source Models

48 new Transformer based models in 9 new languages, including NER for Finance, Industry, Politcal Policies, COVID and Chemical Trials, various clinical and medical domains in Spanish and English and much more in NLU 3.3.1

New Open Source Models and Features

Complete List of Open Source Models :

2000%+ Speedup on small data, 63 new models for 100+ Languages with 6 new supported Transformer classes including BERT, XLM-RoBERTa, alBERT, Longformer, XLnet based models, 48 NER profiling helathcare pipelines and much more in John Snow Labs NLU 3.3.0

New Features and Improvements

2000%+ Speedup prediction for small datasets

50x faster saving of NLU Pipelines

New Annotator Classes Integrated

New Transformer Models

27 new models in 7 Languages, including Japanese NER, resolution models for SNOMED, ICDO, CPT and RxNorm codes and much more in NLU 3.2.1

New Transformer Models

New Healthcare Transformer Models

New Notebooks

Enhancements

Bugfixes

100+ Transformers Models in 40+ languages, 3-D Streamlit Entity-Embedding-Manifold visualizations, Multi-Lingual NER, Longformers, TokenDistilBERT, Trainable Sentence Resolvers, 7% less memory usage and much more in NLU 3.2.0

Streamlit Entity Manifold visualization

function pipe.viz_streamlit_entity_embed_manifold

function parameters pipe.viz_streamlit_sentence_embed_manifold

Sentence Entity Resolver Training

Transformer Models

Sentence Embedding Visualizations, 20+ New Models, 2 New Trainable Models, Drug Normalizer and more in John Snow Labs NLU 3.1.1

Streamlit Sentence Embedding visualization via Manifold and Matrix Decomposition algorithms

function pipe.viz_streamlit_sentence_embed_manifold

function parameters pipe.viz_streamlit_sentence_embed_manifold

General Streamlit enhancements

Support for Latent Dirichlet Allocation

Normalization of Vectors before calculating sentence similarity.

Control order of plots

Sentence Embedding Visualization

Chunk Entity Resolver Training

Train with default glove embeddings

Train with custom embeddings

Rule based NER with Context Matcher

2600+ New Models for 200+ Languages and 10+ Dimension Reduction Algorithms for Streamlit Word-Embedding visualizations in 3-D

Streamlit Word Embedding visualization via Manifold and Matrix Decomposition algorithms

function pipe.viz_streamlit_word_embed_manifold

function parameters pipe.viz_streamlit_word_embed_manifold

Larger Example showcasing more dimension reduction techniques on a larger corpus :

Supported Manifold Algorithms

Supported Matrix Decomposition Algorithms

New Healthcare Pipelines Pipelines

Streamlit visualizations, improved T5, models for Farsi, Hebrew, Korean, Turkish and UMLS, LOINC, HPO, Resolvers in NLU 3.0.2

Install

Impatient and want some action?

Quick Starter cheat sheet - All you need to know in 1 picture for NLU + Streamlit

Examples

Example: 01_dashboard

Example: 02_NER

Example: 03_text_similarity_matrix

Example: 04_dependency_tree

Example: 05_classifiers

Example: 06_token_features

How to use NLU?

Overview of NLU + Streamlit buildingblocks

Detailed visualizer information and API docs

`function` `pipe.viz_streamlit_entity_embed_manifold`

`function parameters` `pipe.viz_streamlit_sentence_embed_manifold`

`function` `pipe.viz_streamlit_sentence_embed_manifold`

`function parameters` `pipe.viz_streamlit_sentence_embed_manifold`

`function` `pipe.viz_streamlit_word_embed_manifold`

`function parameters` `pipe.viz_streamlit_word_embed_manifold`

Example: `01_dashboard`

Example: `02_NER`

Example: `03_text_similarity_matrix`

Example: `04_dependency_tree`

Example: `05_classifiers`

Example: `06_token_features`

`function` `pipe.viz_streamlit`

`function parameters` `pipe.viz_streamlit`