Skip to content

Releases: JohnSnowLabs/nlu

Multilingual DeBERTa Transformer Embeddings for 100+ Languages, Spanish Deidentification and NER for Randomized Clinical Trials - John Snow Labs NLU 3.4.2

23 Mar 15:58
f9f8bb9
Compare
Choose a tag to compare

Multilingual DeBERTa Transformer Embeddings for 100+ Languages, Spanish Deidentification and NER for Randomized Clinical Trials - John Snow Labs NLU 3.4.2

We are very excited NLU 3.4.2 has been released.
On the open source side we have 5 new DeBERTa Transformer models for English and Multi-Lingual for 100+ languages.
DeBERTa improves over BERT and RoBERTa by introducing two novel techniques.

For the healthcare side we have new NER models for randomized clinical trials (RCT) which can detect entities of type
BACKGROUND, CONCLUSIONS, METHODS, OBJECTIVE, RESULTS from clinical text.
Additionally, new Spanish Deidentification NER models for entities like STATE, PATIENT, DEVICE, COUNTRY, ZIP, PHONE, HOSPITAL and many more.

New Open Source Models

Integrates models from Spark NLP 3.4.2 release

Language NLU Reference Spark NLP Reference Task Annotator Class
en en.embed.deberta_v3_xsmall deberta_v3_xsmall Embeddings DeBertaEmbeddings
en en.embed.deberta_v3_small deberta_v3_small Embeddings DeBertaEmbeddings
en en.embed.deberta_v3_base deberta_v3_base Embeddings DeBertaEmbeddings
en en.embed.deberta_v3_large deberta_v3_large Embeddings DeBertaEmbeddings
xx xx.embed.mdeberta_v3_base mdeberta_v3_base Embeddings DeBertaEmbeddings

New Healthcare Models

Integrates models from Spark NLP For Healthcare 3.4.2 release

Language NLU Reference Spark NLP Reference Task Annotator Class
en en.med_ner.clinical_trials bert_sequence_classifier_rct_biobert Text Classification MedicalBertForSequenceClassification
es es.med_ner.deid.generic.roberta ner_deid_generic_roberta_augmented De-identification MedicalNerModel
es es.med_ner.deid.subentity.roberta ner_deid_subentity_roberta_augmented De-identification MedicalNerModel
en en.med_ner.deid.generic_augmented ner_deid_generic_augmented ['Named Entity Recognition', 'De-identification'] MedicalNerModel
en en.med_ner.deid.subentity_augmented ner_deid_subentity_augmented ['Named Entity Recognition', 'De-identification'] MedicalNerModel

Additional NLU resources

1 line Install NLU on Google Colab

!wget https://setup.johnsnowlabs.com/nlu/colab.sh -O - | bash

1 line Install NLU on Kaggle

!wget https://setup.johnsnowlabs.com/nlu/kaggle.sh -O - | bash

Install via PIP

! pip install nlu pyspark streamlit==0.80.0

22 New models for 23 languages including various African and Indian languages, Medical Spanish models and more in NLU 3.4.1

23 Feb 01:44
Compare
Choose a tag to compare

We are very excited to announce the release of NLU 3.4.1
which features 22 new models for 23 languages where the
The open-source side covers new Embeddings for Vietnamese and English Clinical domains and Multilingual Embeddings for 12 Indian and 9 African Languages.
Additionally, there are new Sequence classifiers for Multilingual NER for 9 African languages,
German Sentiment Classifiers and English Emotion and Typo Classifiers.
The healthcare side covers Medical Spanish models, Classifiers for Drugs, Gender, the Pico Framework, and Relation Extractors for Adverse Drug events and Temporality.
Finally, Spark 3.2.X is now supported and bugs related to Databricks environments have been fixed.

General NLU Improvements

  • Support for Spark 3.2.x

New Open Source Models

Based on the amazing 3.4.1 Spark NLP Release
integrates new Multilingual embeddings for 12 Major Indian languages,
embeddings for Vietnamese, French, and English Clinical domains.
Additionally new Multilingual NER model for 9 African languages, English 6 Class Emotion classifier and Typo detectors.

New Embeddings

  • Multilingual ALBERT - IndicBert model pretrained exclusively on 12 major Indian languages with size smaller and performance on par or better than competing models. Languages covered are Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu.
    Available with xx.embed.albert.indic
  • Fine tuned Vietnamese DistilBERT Base cased embeddings. Available with vi.embed.distilbert.cased
  • Clinical Longformer Embeddings which consistently out-performs ClinicalBERT for various downstream
    tasks and on datasets. Available with en.embed.longformer.clinical
  • Fine tuned Static French Word2Vec Embeddings in 3 sizes, 200d, 300d and 100d. Available with fr.embed.word2vec_wiki_1000, fr.embed.word2vec_wac_200 and fr.embed.w2v_cc_300d

New Transformer based Token and Sequence Classifiers

Language NLU Reference Spark NLP Reference Task Annotator Class
xx xx.embed.albert.indic albert_indic Embeddings AlbertEmbeddings
xx xx.ner.masakhaner.distilbert xlm_roberta_large_token_classifier_masakhaner Named Entity Recognition DistilBertForTokenClassification
en en.embed.longformer.clinical clinical_longformer Embeddings LongformerEmbeddings
en en.classify.emotion.bert bert_sequence_classifier_emotion Text Classification BertForSequenceClassification
de de.classify.news_sentiment.bert bert_sequence_classifier_news_sentiment Sentiment Analysis BertForSequenceClassification
en en.classify.typos.distilbert distilbert_token_classifier_typo_detector Named Entity Recognition DistilBertForTokenClassification
fr fr.embed.word2vec_wiki_1000 word2vec_wiki_1000 Embeddings WordEmbeddingsModel
fr fr.embed.word2vec_wac_200 word2vec_wac_200 Embeddings WordEmbeddingsModel
fr fr.embed.w2v_cc_300d w2v_cc_300d Embeddings WordEmbeddingsModel
vi vi.embed.distilbert.cased distilbert_base_cased Embeddings DistilBertEmbeddings

New Healthcare Models

Integrated from the amazing 3.4.1 Spark NLP For Healthcare Release.
which makes 2 new Annotator Classes available, MedicalBertForSequenceClassification and MedicalDistilBertForSequenceClassification,
various medical Spanish models, RxNorm Resolvers,
Transformer based sequence classifiers for Drugs, Gender and the PICO framework,
and Relation extractors for Temporality and Causality of Drugs and Adverse Events.

New Medical Spanish Models

New Resolvers

New Transformer based Sequence Classifiers

New Relation Extractors

Read more

1 line to OCR for images, PDFS and DOCX, Text Generation with GPT2 and new T5 models, Sequence Classification with XlmRoBerta, RoBerta, Xlnet, Longformer and Albert, Transformer based medical NER with MedicalBertForTokenClassifier, 80 new models, 20+ new languages including various African and Scandinavian and much more in John Snow Labs NLU 3.4.0 !

22 Jan 18:43
c0dfd53
Compare
Choose a tag to compare

We are incredibly excited to announce John Snow Labs NLU 3.4.0 has been released!
This release features 11 new annotator classes and 80 new models, including 3 OCR Transformers which enable you to extract text
from various file types, support for GPT2 and new pretrained T5 models for Text Generation and dozens more of new transformer based models
for Token and Sequence Classification.
This includes 8 new Sequence classifier models which can be pretrained in Huggingface and imported into Spark NLP and NLU.
Finally, the NLU tutorial page of the 140+ notebooks has been updated

New NLU OCR Features

3 new OCR based spells are supported, which enable extracting text from files of type
JPEG, PNG, BMP, WBMP, GIF, JPG, TIFF, DOCX, PDF in just 1 line of code.
You need a Spark OCR license for using these, which is available for free here and refer to the new
OCR tutorial notebook
Open In Colab
Find more details on the NLU OCR documentation page

New NLU Healthcare Features

The healthcare side features a new MedicalBertForTokenClassifier annotator which is a Bert based model for token classification problems like Named Entity Recognition,
Parts of Speech and much more. Overall there are 28 new models which include German De-Identification models, English NER models for extracting Drug Development Trials,
Clinical Abbreviations and Acronyms, NER models for chemical compounds/drugs and genes/proteins, updated MedicalBertForTokenClassifier NER models for the medical domains Adverse drug Events,
Anatomy, Chemicals, Genes,Proteins, Cellular/Molecular Biology, Drugs, Bacteria, De-Identification and general Medical and Clinical Named Entities.
For Entity Relation Extraction between entity pairs new models for interaction between Drugs and Proteins.
For Entity Resolution new models for resolving Clinical Abbreviations and Acronyms to their full length names and also a model for resolving Drug Substance Entities to the categories
Clinical Drug, Pharmacologic Substance, Antibiotic, Hazardous or Poisonous Substance and new resolvers for LOINC and SNOMED terminologies.

New NLU Open source Features

On the open source side we have new support for Open Ai's GPT2 for various text sequence to sequence problems and
additionally the following new Transformer models are supported :
RoBertaForSequenceClassification, XlmRoBertaForSequenceClassification, LongformerForSequenceClassification,
AlbertForSequenceClassification, XlnetForSequenceClassification, Word2Vec with various pre-trained weights for various problems!

New GPT2 models for generating text conditioned on some input,
New T5 style transfer models for active to passive, formal to informal, informal to formal, passive to active sequence to sequence generation.
Additionally, a new T5 model for generating SQL code from natural language input is provided.

On top of this dozens new Transformer based Sequence Classifiers and Token Classifiers have been released, this is includes for Token Classifier the following models :
Multi-Lingual general NER models for 10 African Languages (Amharic, Hausa, Igbo, Kinyarwanda, Luganda, Nigerian, Pidgin, Swahilu, Wolof, and Yorùbá),
10 high resourced languages (10 high resourced languages (Arabic, German, English, Spanish, French, Italian, Latvian, Dutch, Portuguese and Chinese),
6 Scandinavian languages (Danish, Norwegian-Bokmål, Norwegian-Nynorsk, Swedish, Icelandic, Faroese) ,
Uni-Lingual NER models for general entites in the language Chinese, Hindi, Islandic, Indonesian
and finally English NER models for extracting entities related to Stocks Ticker Symbols, Restaurants, Time.

For Sequence Classification new models for classifying Toxicity in Russian text and English models for
Movie Reviews, News Categorization, Sentimental Tone and General Sentiment

New NLU OCR Models

The following Transformers have been integrated from Spark OCR

NLU Spell Transformer Class
nlu.load(img2text) ImageToText
nlu.load(pdf2text) PdfToText
nlu.load(doc2text) DocToText

New Open Source Models

Integration for the 49 new models from the colossal Spark NLP 3.4.0 release

Language NLU Reference Spark NLP Reference Task Annotator Class
en en.gpt2.distilled gpt2_distilled Text Generation GPT2Transformer
en en.gpt2 gpt2 Text Generation GPT2Transformer
en en.gpt2.medium gpt2_medium Text Generation GPT2Transformer
en en.gpt2.large gpt_large Text Generation GPT2Transformer
en en.t5.active_to_passive_styletransfer t5_active_to_passive_styletransfer Text Generation T5Transformer
en en.t5.formal_to_informal_styletransfer t5_formal_to_informal_styletransfer Text Generation T5Transformer
en en.t5.grammar_error_corrector t5_grammar_error_corrector Text Generation T5Transformer
en en.t5.informal_to_formal_styletransfer t5_informal_to_formal_styletransfer Text Generation T5Transformer
en en.t5.passive_to_active_styletransfer t5_passive_to_active_styletransfer Text Generation T5Transformer
en en.t5.wikiSQL t5_small_wikiSQL Text Generation ...
Read more

48 new Transformer based models in 9 new languages, including NER for Finance, Industry, Politcal Policies, COVID and Chemical Trials, various clinical and medical domains in Spanish and English and much more in NLU 3.3.1

06 Dec 15:42
fd7e73b
Compare
Choose a tag to compare

We are incredibly excited to announce NLU 3.3.1 has been released with 48 new models in 9 languages!

It comes with 2 new types of state-of-the-art models,distilBERT and BERT for sequence classification with various pre-trained weights,
state-of-the-art bert based classifiers for problems in the domains of Finance, Sentiment Classification, Industry, News, and much more.

On the healthcare side, NLU features 22 new models in for English and Spanish with
with entity Resolver Models for LOINC, MeSH, NDC and SNOMED and UMLS Diseases,
NER models for Biomarkers, NIHSS-Guidelines, COVID Trials , Chemical Trials,
Bert based Token Classifier models for biological, genetical,cancer, cellular terms,
Bert for Sequence Classification models for clinical question vs statement classification
and finally Spanish Clinical NER and Resolver Models

Once again, we would like to thank our community for making another amazing release possible!

New Open Source Models and Features

Integrates the amazing Spark NLP 3.3.3 and 3.3.2 releases, featuring:

  • New state-of-the-art fine-tuned BERT models for Sequence Classification in English, French, German, Spanish, Japanese, Turkish, Russian, and multilingual languages.
  • DistilBertForSequenceClassification models in English, French and Urdu
  • Word2Vec models.
  • classify.distilbert_sequence.banking77 : Banking NER model trained on BANKING77 dataset, which provides a very fine-grained set of intents in a banking domain. It comprises 13,083 customer service queries labeled with 77 intents. It focuses on fine-grained single-domain intent detection. Can extract entities like activate_my_card, age_limit, apple_pay_or_google_pay, atm_support, automatic_top_up, balance_not_updated_after_bank_transfer, balance_not_updated_after_cheque_or_cash_deposit, beneficiary_not_allowed, cancel_transfer, card_about_to_expire, card_acceptance, card_arrival, card_delivery_estimate, card_linking, card_not_working, card_payment_fee_charged, card_payment_not_recognised, card_payment_wrong_exchange_rate, card_swallowed, cash_withdrawal_charge, cash_withdrawal_not_recognised, change_pin, compromised_card, contactless_not_working, country_support, declined_card_payment, declined_cash_withdrawal, declined_transfer, direct_debit_payment_not_recognised, disposable_card_limits, edit_personal_details, exchange_charge, exchange_rate, exchange_via_app, extra_charge_on_statement, failed_transfer, fiat_currency_support, get_disposable_virtual_card, get_physical_card, getting_spare_card, getting_virtual_card, lost_or_stolen_card, lost_or_stolen_phone, order_physical_card, passcode_forgotten, pending_card_payment, pending_cash_withdrawal, pending_top_up, pending_transfer, pin_blocked, receiving_money,
  • classify.distilbert_sequence.industry : Industry NER model which can extract entities like Advertising, Aerospace & Defense, Apparel Retail, Apparel, Accessories & Luxury Goods, Application Software, Asset Management & Custody Banks, Auto Parts & Equipment, Biotechnology, Building Products, Casinos & Gaming, Commodity Chemicals, Communications Equipment, Construction & Engineering, Construction Machinery & Heavy Trucks, Consumer Finance, Data Processing & Outsourced Services, Diversified Metals & Mining, Diversified Support Services, Electric Utilities, Electrical Components & Equipment, Electronic Equipment & Instruments, Environmental & Facilities Services, Gold, Health Care Equipment, Health Care Facilities, Health Care Services.
  • xx.classify.bert_sequence.sentiment : Multi-Lingual Sentiment Classifier This a bert-base-multilingual-uncased model finetuned for sentiment analysis on product reviews in six languages: English, Dutch, German, French, Spanish and Italian. It predicts the sentiment of the review as a number of stars (between 1 and 5). This model is intended for direct use as a sentiment analysis model for product reviews in any of the six languages above, or for further finetuning on related sentiment analysis tasks.
  • distilbert_sequence.policy : Policy Classifier This model was trained on 129.669 manually annotated sentences to classify text into one of seven political categories: ‘Economy’, ‘External Relations’, ‘Fabric of Society’, ‘Freedom and Democracy’, ‘Political System’, ‘Welfare and Quality of Life’ or ‘Social Groups’.
  • classify.bert_sequence.dehatebert_mono : Hate Speech Classifier This model was trained on 129.669 manually annotated sentences to classify text into one of seven political categories: ‘Economy’, ‘External Relations’, ‘Fabric of Society’, ‘Freedom and Democracy’, ‘Political System’, ‘Welfare and Quality of Life’ or ‘Social Groups’.

Complete List of Open Source Models :

Language NLU Reference Spark NLP Reference Task
en en.classify.bert_sequence.imdb_large bert_large_sequence_classifier_imdb Text Classification
en en.classify.bert_sequence.imdb bert_base_sequence_classifier_imdb Text Classification
en en.classify.bert_sequence.ag_news bert_base_sequence_classifier_ag_news Text Classification
en en.classify.bert_sequence.dbpedia_14 bert_base_sequence_classifier_dbpedia_14 Text Classification
en en.classify.bert_sequence.finbert bert_sequence_classifier_finbert Text Classification
en en.classify.bert_sequence.dehatebert_mono bert_sequence_classifier_dehatebert_mono Text Classification
tr tr.classify.bert_sequence.sentiment bert_sequence_classifier_turkish_sentiment Text Classification
de de.classify.bert_sequence.sentiment bert_sequence_classifier_sentiment Text Classification
ru ru.classify.bert_sequence.sentiment bert_sequence_classifier_rubert_sentiment Text Classification
ja ja.classify.bert_sequence.sentiment bert_sequence_classifier_japanese_sentiment Text Classification
es es.classify.bert_sequence.sentiment bert_sequence_classifier_beto_sentiment_analysis Text Classification
es es.classify.bert_sequence.emotion bert_sequence_classifier_beto_emotion_analysis Text Classification
xx [xx.classify.bert_sequence.sentiment](https://nlp.johnsnowlabs.com/2021/11/03/bert_seq...
Read more

2000%+ Speedup on small data, 63 new models for 100+ Languages with 6 new supported Transformer classes including BERT, XLM-RoBERTa, alBERT, Longformer, XLnet based models, 48 NER profiling helathcare pipelines and much more in John Snow Labs NLU 3.3.0

30 Oct 07:29
e612a9c
Compare
Choose a tag to compare

We are incredibly excited to announce NLU 3.3.0 has been released!
It comes with a up to 2000%+ speedup on small datasets, 6 new Types of Deep Learning transformer models, including
RoBertaForTokenClassification,XlmRoBertaForTokenClassification,AlbertForTokenClassification,LongformerForTokenClassification,XlnetForTokenClassification,XlmRoBertaSentenceEmbeddings.
In total there are 63 NLP Models 6 New Languages Supported which are Igbo, Ganda, Dholuo, Naija, Wolof,Kinyarwanda with their corresponding ISO codes ig, lg, lou, pcm, wo,rw
with New SOTA XLM-RoBERTa models in Luganda, Kinyarwanda, Igbo, Hausa, and Amharic languages and 2 new Multilingual Embeddings with 100+ supported languages via XLM-Roberta are available.

On the healthcare NLP side we are glad to announce 18 new NLP for Healthcare models including
NER Profiling pretrained pipelines to run 48 different Clinical NER and 21 Different Biobert Models At Once Over the Input Text
New BERT-Based Deidentification NER Model, Sentence Entity Resolver Models For German Language
New Spell Checker Model For Drugs , 3 New Sentence Entity Resolver Models (3-char ICD10CM, RxNorm_NDC, HCPCS)
5 New Clinical NER Models (Trained By BertForTokenClassification Approach)
,Radiology NER Model Trained On cheXpert Datasetand New UMLS Sentence Entity Resolver Models

Additionally 2 new tutorials are avaiable, NLU & Streamlit Crashcourse and NLU for Healthcare Crashcourse of every of the 50 + healthcare Domains and 200+ healthcare models

New Features and Improvements

2000%+ Speedup prediction for small datasets

NLU pipelines now predict up to 2000% faster by optimizing integration with Spark NLP's light pipelines.
NLU will configure usage of this automatically, but it can be turned off as well via multithread=False
NLU 3.3.0 Benchmark

50x faster saving of NLU Pipelines

Up to 50x faster saving Spark NLP/ NLU models and pipelines! We have improved the way we package TensorFlow SavedModel while saving Spark NLP models & pipelines. For instance, it used to take up to 10 minutes to save the xlm_roberta_base model before Spark NLP 3.3.0, and now it only takes up to 15 seconds!

New Annotator Classes Integrated

The following new transformer classes are available with various pretrained weights in 1 line of code :

New Transformer Models

The following models are available from the amazing Spark NLP
3.3.0 and
3.3.1 releases
which includes NLP models for
Yiddish, Ukrainian, Telugu, Tamil, Somali, Sindhi, Russian, Punjabi, Nepali, Marathi, Malayalam, Kannada, Indonesian, Gujrati, Bosnian, Igbo, Ganda, Dholuo, Naija, Wolof,Kinyarwanda

Language NLU Reference Spark NLP Reference Task
ig ig.embed.xlm_roberta xlm_roberta_base_finetuned_igbo Embeddings
ig ig.embed_sentence.xlm_roberta sent_xlm_roberta_base_finetuned_igbo Embeddings
lg lg.embed.xlm_roberta xlm_roberta_base_finetuned_luganda Embeddings
lg lg.embed_sentence.xlm_roberta sent_xlm_roberta_base_finetuned_luganda Embeddings
wo wo.embed_sentence.xlm_roberta sent_xlm_roberta_base_finetuned_wolof Embeddings
wo wo.embed.xlm_roberta xlm_roberta_base_finetuned_wolof Embeddings
rw rw.embed_sentence.xlm_roberta sent_xlm_roberta_base_finetuned_kinyarwanda Embeddings
rw rw.embed.xlm_roberta xlm_roberta_base_finetuned_kinyarwanda Embeddings
sw sw.embed_sentence.xlm_roberta sent_xlm_roberta_base_finetuned_swahili Embeddings
sw sw.embed.xlm_roberta xlm_roberta_base_finetuned_swahili Embeddings
ha ha.embed.xlm_roberta xlm_roberta_base_finetuned_hausa Embeddings
ha ha.embed_sentence.xlm_roberta sent_xlm_roberta_base_finetuned_hausa Embeddings
am am.embed.xlm_roberta xlm_roberta_base_finetuned_amharic Embeddings
am am.embed_sentence.xlm_roberta sent_xlm_roberta_base_finetuned_amharic Embeddings
yo yo.embed_sentence.xlm_roberta sent_xlm_roberta_base_finetuned_yoruba Embeddings
yo yo.embed.xlm_roberta [xlm_roberta_base_finetuned_yoruba](https://nlp.johnsnowl...
Read more

27 new models in 7 Languages, including Japanese NER, resolution models for SNOMED, ICDO, CPT and RxNorm codes and much more in NLU 3.2.1

18 Sep 12:42
41e5a30
Compare
Choose a tag to compare

We are very excited to announce NLU 3.2.1!
This release comes with models 27 new models for 7 languages which are transformer based.
New NER-Classifiers, BertSentenceEmbeddings, BertEmbeddings and BertForTokenClassificationEmbeddings
for Japanese, German, Dutch, Swedish, Spanish, French and English.
For healthcare there are new Entity Resolvers and MedicalNerModels
for Snomed Conditions, Cpt Measurements, Icd0, Rxnorm Dispositions, Posology and Deidentification.
Finally, a new tutorial notebook and a webinar are available, which showcase almost every feature of NLU
for the over 50 Domains in Healthcare/Clinical/Biomedical/etc..

New Transformer Models

Models in Japanese, German, Dutch, Swedish, Spanish, French and English from the great Spark NLP 3.2.3 release

nlu.load() Refrence Spark NLP Refrence Annotater class Language
en.embed.bert.base_uncased_legal bert_base_uncased_legal BertEmbeddings en
en.embed_sentence.bert.base_uncased_legal sent_bert_base_uncased_legal BertSentenceEmbeddings en
en.embed.token_bert.classifier_ner_btc bert_token_classifier_ner_btc BertForTokenClassification en
es.embed.bert.base_uncased bert_base_uncased BertEmbeddings es
es.embed.bert.base_cased bert_base_cased BertEmbeddings es
es.embed_sentence.bert.base_uncased sent_bert_base_uncased BertSentenceEmbeddings es
es.embed_sentence.bert.base_cased sent_bert_base_cased BertSentenceEmbeddings es
el.embed.bert.base_uncased bert_base_uncased BertEmbeddings el
el.embed_sentence.bert.base_uncased sent_bert_base_uncased BertSentenceEmbeddings el
sv.embed.bert.base_cased bert_base_cased BertEmbeddings sv
sv.embed_sentence.bert.base_cased sent_bert_base_cased BertSentenceEmbeddings sv
nl.embed_sentence.bert.base_cased sent_bert_base_cased BertSentenceEmbeddings nl
nl.embed.bert.base_cased bert_base_cased BertEmbeddings nl
fr.classify.sentiment.bert classifierdl_bert_sentiment ClassifierDLModel fr
ja.embed.glove.cc_300d japanese_cc_300d WordEmbeddingsModel ja
ja.ner.ud_gsd_cc_300d ner_ud_gsd_cc_300d NerDLModel ja
ja.ner.ud_gsd_xlm_roberta_base ner_ud_gsd_xlm_roberta_base NerDLModel ja
ja.embed.token_bert.classifier_ner_ud_gsd bert_token_classifier_ner_ud_gsd BertForTokenClassification ja
de.embed_sentence.bert.base_cased sent_bert_base_cased BertSentenceEmbeddings de
de.classify.sentiment.bert classifierdl_bert_sentiment ClassifierDLModel de

New Healthcare Transformer Models

Models for Snomed Conditions, Cpt Measurements, Icd0, Rxnorm Dispositions, Posology and Deidentification from the amazing Spark NLP 3.2.2 for Healthcare Release

nlu.load() Refrences Spark NLP Refrence Annotater class Language
en.resolve.snomed_conditions sbertresolve_snomed_conditions SentenceEntityResolverModel en
en.resolve.cpt.procedures_measurements sbiobertresolve_cpt_procedures_measurements_augmented SentenceEntityResolverModel en
en.resolve.icdo.base sbiobertresolve_icdo_base SentenceEntityResolverModel en
en.resolve.rxnorm.disposition.sbert sbertresolve_rxnorm_disposition SentenceEntityResolverModel en
en.resolve.rxnorm_disposition.sbert sbertresolve_rxnorm_disposition SentenceEntityResolverModel en
en.med_ner.posology.experimental ner_posology_experimental MedicalNerModel en
en.med_ner.deid.subentity_augmented ner_deid_subentity_augmented MedicalNerModel en

New Notebooks

Enhancements

  • Columns of the Pandas DataFrame returned by NLU will now be sorted alphabetically

Bugfixes

  • Fixed a bug that caused output levels no beeing inferred properly
  • Fixed a bug that caused SentenceResolver visualizations not to appear.

100+ Transformers Models in 40+ languages, 3-D Streamlit Entity-Embedding-Manifold visualizations, Multi-Lingual NER, Longformers, TokenDistilBERT, Trainable Sentence Resolvers, 7% less memory usage and much more in NLU 3.2.0

11 Sep 09:58
2e12d75
Compare
Choose a tag to compare

We are extremely excited to announce the release of NLU 3.2.0
which marks the 1-year anniversary of the birth of this magical library.
This release packs features and improvements in every division of NLU's aspects,
89 new NLP models with new Models including Longformer, TokenBert, TokenDistilBert and Multi-Lingual NER for 40+ Languages.
12 new Healthcare models with trainable sentence resolvers and models Adverse Drug Relations, Clinical Token Bert Models, NER Models for Radiology, Drugs, Posology, Administration Cycles, RXNorm, and new Medical Assertion models.
New Streamlit visualizations enable you to see Entities in 3-D, 2-D, and 1-D Manifolds which are applicable to Entities and their Embeddings, Detected by Named-Entity-Recognizer models.
Finally, a ~7% decrease in Memory consumption in NLU's core which benefits every computation, achieved by leveraging Pyarrow.
We are incredibly thankful to our community, which helped us come this far, and are looking forward to another magical year of NLU!

Streamlit Entity Manifold visualization

function pipe.viz_streamlit_entity_embed_manifold

Visualize recognized entities by NER models via their Entity Embeddings in 1-D, 2-D, or 3-D by Reducing Dimensionality via 10+ Supported methods from Manifold Algorithms
and Matrix Decomposition Algorithms.
You can pick additional NER models and compare them via the GUI dropdown on the left.

  • Reduces Dimensionality of high dimensional Entity Embeddings to 1-D, 2-D, or 3-D and plot the resulting data in an interactive Plotly plot
  • Applicable with any of the 330+ Named Entity Recognizer models
  • Gemerates NUM-DIMENSIONS * NUM-NER-MODELS * NUM-DIMENSION-REDUCTION-ALGOS plots
nlu.load('ner').viz_streamlit_sentence_embed_manifold(['Hello From John Snow Labs', 'Peter loves to visit New York'])

or just run

streamlit run https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/examples/streamlit/09_entity_embedding_manifolds.py

function parameters pipe.viz_streamlit_sentence_embed_manifold

Argument Type Default Description
default_texts List[str] "Donald Trump likes to visit New York", "Angela Merkel likes to visit Berlin!", 'Peter hates visiting Paris') List of strings to apply classifiers, embeddings, and manifolds to.
title str 'NLU ❤️ Streamlit - Prototype your NLP startup in 0 lines of code🚀' Title of the Streamlit app
sub_title Optional[str] "Apply any of the 10+ Manifold or Matrix Decomposition algorithms to reduce the dimensionality of Entity Embeddings to 1-D, 2-D and 3-D " Sub title of the Streamlit app
default_algos_to_apply List[str] ["TSNE", "PCA"] A list Manifold and Matrix Decomposition Algorithms to apply. Can be either 'TSNE','ISOMAP','LLE','Spectral Embedding', 'MDS','PCA','SVD aka LSA','DictionaryLearning','FactorAnalysis','FastICA' or 'KernelPCA',
target_dimensions List[int] (1,2,3) Defines the target dimension embeddings will be reduced to
show_algo_select bool True Show selector for Manifold and Matrix Decomposition Algorithms
set_wide_layout_CSS bool True Whether to inject custom CSS or not.
num_cols int 2 How many columns should for the layout in streamlit when rendering the similarity matrixes.
key str "NLU_streamlit" Key for the Streamlit elements drawn
show_logo bool True Show logo
display_infos bool False Display additonal information about ISO codes and the NLU namespace structure.
n_jobs Optional[int] 3 False

Sentence Entity Resolver Training

Sentence Entity Resolver Training Tutorial Notebook
Named Entities are sub pieces in textual data which are labeled with classes.
These classes and strings are still ambiguous though and it is not possible to group semantically identically entities without any definition of terminology.
With the Sentence Resolver you can train a state-of-the-art deep learning architecture to map entities to their unique terminological representation.

Train a Sentence resolver on a dataset with columns named y , _y and text. y is a label, _y is an extra identifier label, text is the raw text

import pandas as pd 
import nlu
dataset = pd.DataFrame({
    'text': ['The Tesla company is good to invest is', 'TSLA is good to invest','TESLA INC. we should buy','PUT ALL MONEY IN TSLA inc!!'],
    'y': ['23','23','23','23'],
    '_y': ['TESLA','TESLA','TESLA','TESLA'],

})

trainable_pipe = nlu.load('train.resolve_sentence')
fitted_pipe  = trainable_pipe.fit(dataset)
res = fitted_pipe.predict(dataset)
fitted_pipe.predict(["Peter told me to buy Tesla ", 'I have money to loose, is TSLA a good option?'])
sentence_resolution_resolve_sentence_confidence sentence_resolution_resolve_sentence_code sentence_resolution_resolve_sentence sentence
0 '1.0000' '23' 'TESLA' 'The Tesla company is good to invest is'
1 '1.0000' '23' 'TESLA' 'TSLA is good to invest'
2 '1.0000' '23' 'TESLA' 'TESLA INC. we should buy'
3 '1.0000' '23' 'TESLA' 'PUT ALL MONEY IN TSLA inc!!'

Alternatively you can also use non-default healthcare embeddings.

trainable_pipe = nlu.load('en.embed.glove.biovec train.resolve_sentence')

Transformer Models

New models from the spectacular Spark NLP 3.2.0 + releases are integrated.
89 new models in total, with new LongFormer, TokenBert, TokenDistilBert and Multi-Lingual NER for 40+ languages.
The supported languages with their ISO 639-1 code are : af, ar, bg, bn, de, el, en, es, et, eu, fa, fi, fr, he, hi, hu, id, it, ja, jv, ka, kk, ko, ml, mr, ms, my, nl, pt, ru, sw, ta, te, th, tl, tr, ur, vi, yo, and zh

nlu.load() Refrence Spark NLP Refrence Annotator Class language
en.embed.longformer longformer_base_4096 LongformerEmbeddings en
en.embed.longformer.large longformer_large_4096 LongformerEmbeddings en
en.ner.ontonotes_roberta_base ner_ontonotes_roberta_base NerDLModel en
en.ner.ontonotes_roberta_large ner_ontonotes_roberta_large NerDLModel en
en.ner.ontonotes_distilbert_base_cased [ner_ontonotes_distilbert_base_cased](https://nlp.john...
Read more

Sentence Embedding Visualizations, 20+ New Models, 2 New Trainable Models, Drug Normalizer and more in John Snow Labs NLU 3.1.1

23 Jul 03:54
63b1d22
Compare
Choose a tag to compare

We are very excited to announce NLU 3.1.1 has been released!
It features a new Sentence Embedding visualization component for Streamlit which supports all 10+ previous dimension
reduction techniques. Additionally, all embedding visualizations now support Latent Dirichlet Allocation for dimension reduction.
Finally, 2 new trainable models for NER and chunk resolution are supported, a new drug normalizer algorithm has been added,
20+ new pre-trained models including Multi-Lingual, German,
various healthcare models and improved NER defaults when using licensed models that have NER dependencies.

Streamlit Sentence Embedding visualization via Manifold and Matrix Decomposition algorithms

function pipe.viz_streamlit_sentence_embed_manifold

Visualize Sentence Embeddings in 1-D, 2-D, or 3-D by Reducing Dimensionality via 12 Supported methods from Manifold Algorithms
and Matrix Decomposition Algorithms.
Additionally, you can color the lower dimensional points with a label that has been previously assigned to the text by specifying a list of nlu references in the additional_classifiers_for_coloring parameter.
You can also select additional classifiers via the GUI.

  • Reduces Dimensionality of high dimensional Sentence Embeddings to 1-D, 2-D, or 3-D and plot the resulting data in an interactive Plotly plot
  • Applicable with any of the 100+ Sentence Embedding models
  • Color points by classifying with any of the 100+ Document Classifiers
  • Gemerates NUM-DIMENSIONS * NUM-EMBEDDINGS * NUM-DIMENSION-REDUCTION-ALGOS plots
text= """You can visualize any of the 100 + Sentence Embeddings
with 10+ dimension reduction algorithms
and view the results in 3D, 2D, and 1D  
which can be colored by various classifier labels!
"""
nlu.load('embed_sentence.bert').viz_streamlit_sentence_embed_manifold(text)

function parameters pipe.viz_streamlit_sentence_embed_manifold

Argument Type Default Description
default_texts List[str] ("Donald Trump likes to party!", "Angela Merkel likes to party!", 'Peter HATES TO PARTTY!!!! :(') List of strings to apply classifiers, embeddings, and manifolds to.
text Optional[str] 'Billy likes to swim' Text to predict classes for.
sub_title Optional[str] "Apply any of the 11 Manifold or Matrix Decomposition algorithms to reduce the dimensionality of Sentence Embeddings to 1-D, 2-D and 3-D " Sub title of the Streamlit app
default_algos_to_apply List[str] ["TSNE", "PCA"] A list Manifold and Matrix Decomposition Algorithms to apply. Can be either 'TSNE','ISOMAP','LLE','Spectral Embedding', 'MDS','PCA','SVD aka LSA','DictionaryLearning','FactorAnalysis','FastICA' or 'KernelPCA',
target_dimensions List[int] (1,2,3) Defines the target dimension embeddings will be reduced to
show_algo_select bool True Show selector for Manifold and Matrix Decomposition Algorithms
show_embed_select bool True Show selector for Embedding Selection
show_color_select bool True Show selector for coloring plots
display_embed_information bool True Show additional embedding information like dimension, nlu_reference, spark_nlp_reference, sotrage_reference, modelhub link and more.
set_wide_layout_CSS bool True Whether to inject custom CSS or not.
num_cols int 2 How many columns should for the layout in streamlit when rendering the similarity matrixes.
key str "NLU_streamlit" Key for the Streamlit elements drawn
additional_classifiers_for_coloring List[str] ['sentiment.imdb'] List of additional NLU references to load for generting hue colors
show_model_select bool True Show a model selection dropdowns that makes any of the 1000+ models avaiable in 1 click
model_select_position str 'side' Whether to output the positions of predictions or not, see pipe.predict(positions=true) for more info
show_logo bool True Show logo
display_infos bool False Display additonal information about ISO codes and the NLU namespace structure.
n_jobs Optional[int] 3 False

General Streamlit enhancements

Support for Latent Dirichlet Allocation

The Latent Dirichlet Allocation algorithm is now supported
for the Word Embedding Visualizations and the Sentence Embedding Visualizations

Normalization of Vectors before calculating sentence similarity.

WordEmbedding vectors will now be normalized before calculating similarity scores, which bounds each similarity between 0 and 1

Control order of plots

You can now control the order in Which visualizations appear in the main GUI

Sentence Embedding Visualization

Chunk Entity Resolver Training

Chunk Entity Resolver Training Tutorial Notebook
Named Entities are sub pieces in textual data which are labeled with classes.
These classes and strings are still ambigous though and it is not possible to group semantically identically entities without any definition of terminology.
With the Chunk Resolver you can train a state-of-the-art deep learning architecture to map entities to their unique terminological representation.

Train a chunk resolver on a dataset with columns named y , _y and text. y is a label, _y is an extra identifier label, text is the raw text

import pandas as pd 
dataset = pd.DataFrame({
    'text': ['The Tesla company is good to invest is', 'TSLA is good to invest','TESLA INC. we should buy','PUT ALL MONEY IN TSLA inc!!'],
    'y': ['23','23','23','23']
    '_y': ['TESLA','TESLA','TESLA','TESLA'], 

})


trainable_pipe = nlu.load('train.resolve_chunks')
fitted_pipe  = trainable_pipe.fit(dataset)
res = fitted_pipe.predict(dataset)
fitted_pipe.predict(["Peter told me to buy Tesla ", 'I have money to loose, is TSLA a good option?'])
entity_resolution_confidence entity_resolution_code entity_resolution document
'1.0000' '23' 'TESLA' Peter told me to buy Tesla
'1.0000' '23' 'TESLA' I have money to loose, is TSLA a good option?

Train with default glove embeddings

untrained_chunk_resolver = nlu.load('train.resolve_chunks')
trained_chunk_resolver  =  untrained_chunk_resolver.fit(df)
trained_chunk_resolver.predict(df)

Train with custom embeddings

# Use BIo GLove
untrained_chunk_resolver = nlu.load('en.embed.glove.biovec train.resolve_chunks')
trained_chunk_resolver  =  untrained_chunk_resolver.fit(df)
trained_chunk_resolver.predict(df)

Rule based NER with Context Matcher

Rule based NER with context matching tutorial notebook
Define a rule-based NER algorithm by providing Regex Patterns and resolution mappings.
The confidence value is computed using a heuristic approach based on how many matches it has.
A dictionary can be pro...

Read more

2600+ New Models for 200+ Languages and 10+ Dimension Reduction Algorithms for Streamlit Word-Embedding visualizations in 3-D

28 Jun 09:10
48d34e1
Compare
Choose a tag to compare

We are extremely excited to announce the release of NLU 3.1 !
This is our biggest release so far and it comes with over 2600+ new models in 200+ languages, including DistilBERT, RoBERTa, and XLM-RoBERTa and Huggingface based Embeddings from the incredible Spark-NLP 3.1.0 release,
new Streamlit Visualizations for visualizing Word Embeddings in 3-D, 2-D, and 1-D,
New Healthcare pipelines for healthcare code mappings
and finally confidence extraction for open source NER models.
Additionally, the NLU Namespace has been renamed to the NLU Spellbook, to reflect the magicalness of each 1-liners represented by them!

Streamlit Word Embedding visualization via Manifold and Matrix Decomposition algorithms

function pipe.viz_streamlit_word_embed_manifold

Visualize Word Embeddings in 1-D, 2-D, or 3-D by Reducing Dimensionality via 11 Supported methods from Manifold Algorithms
and Matrix Decomposition Algorithms.
Additionally, you can color the lower dimensional points with a label that has been previously assigned to the text by specifying a list of nlu references in the additional_classifiers_for_coloring parameter.

nlu.load('bert',verbose=True).viz_streamlit_word_embed_manifold(default_texts=THE_MATRIX_ARCHITECT_SCRIPT.split('\n'),default_algos_to_apply=['TSNE'],MAX_DISPLAY_NUM=5)

function parameters pipe.viz_streamlit_word_embed_manifold

Argument Type Default Description
default_texts List[str] ("Donald Trump likes to party!", "Angela Merkel likes to party!", 'Peter HATES TO PARTTY!!!! :(') List of strings to apply classifiers, embeddings, and manifolds to.
text Optional[str] 'Billy likes to swim' Text to predict classes for.
sub_title Optional[str] "Apply any of the 11 Manifold or Matrix Decomposition algorithms to reduce the dimensionality of Word Embeddings to 1-D, 2-D and 3-D " Sub title of the Streamlit app
default_algos_to_apply List[str] ["TSNE", "PCA"] A list Manifold and Matrix Decomposition Algorithms to apply. Can be either 'TSNE','ISOMAP','LLE','Spectral Embedding', 'MDS','PCA','SVD aka LSA','DictionaryLearning','FactorAnalysis','FastICA' or 'KernelPCA',
target_dimensions List[int] (1,2,3) Defines the target dimension embeddings will be reduced to
show_algo_select bool True Show selector for Manifold and Matrix Decomposition Algorithms
show_embed_select bool True Show selector for Embedding Selection
show_color_select bool True Show selector for coloring plots
MAX_DISPLAY_NUM int 100 Cap maximum number of Tokens displayed
display_embed_information bool True Show additional embedding information like dimension, nlu_reference, spark_nlp_reference, sotrage_reference, modelhub link and more.
set_wide_layout_CSS bool True Whether to inject custom CSS or not.
num_cols int 2 How many columns should for the layout in streamlit when rendering the similarity matrixes.
key str "NLU_streamlit" Key for the Streamlit elements drawn
additional_classifiers_for_coloring List[str] ['pos', 'sentiment.imdb'] List of additional NLU references to load for generting hue colors
show_model_select bool True Show a model selection dropdowns that makes any of the 1000+ models avaiable in 1 click
model_select_position str 'side' Whether to output the positions of predictions or not, see pipe.predict(positions=true) for more info
show_logo bool True Show logo
display_infos bool False Display additonal information about ISO codes and the NLU namespace structure.
n_jobs Optional[int] 3 False

Larger Example showcasing more dimension reduction techniques on a larger corpus :

Supported Manifold Algorithms

Supported Matrix Decomposition Algorithms

New Healthcare Pipelines Pipelines

Five new healthcare code mapping pipelines:

  • nlu.load(en.resolve.icd10cm.umls): This pretrained pipeline maps ICD10CM codes to UMLS codes without using any text data. You’ll just feed white space-delimited ICD10CM codes and it will return the corresponding UMLS codes as a list. If there is no mapping, the original code is returned with no mapping.

{'icd10cm': ['M89.50', 'R82.2', 'R09.01'],'umls': ['C4721411', 'C0159076', 'C0004044']}

  • nlu.load(en.resolve.mesh.umls): This pretrained pipeline maps MeSH codes to UMLS codes without using any text data. You’ll just feed white space-delimited MeSH codes and it will return the corresponding UMLS codes as a list. If there is no mapping, the original code is returned with no mapping.

{'mesh': ['C028491', 'D019326', 'C579867'],'umls': ['C0970275', 'C0886627', 'C3696376']}

  • nlu.load(en.resolve.rxnorm.umls): This pretrained pipeline maps RxNorm codes to UMLS codes without using any text data. You’ll just feed white space-delimited RxNorm codes and it will return the corresponding UMLS codes as a list. If there is no mapping, the original code is returned with no mapping.

{'rxnorm': ['1161611', '315677', '343663'],'umls': ['C3215948', 'C0984912', 'C1146501']}

  • nlu.load(en.resolve.rxnorm.mesh): This pretrained pipeline maps RxNorm codes to MeSH codes without using any text data. You’ll just feed white space-delimited RxNorm codes and it will return the corresponding MeSH codes as a list. If there is no mapping, the original code is returned with no mapping.

{'rxnorm': ['1191', '6809', '47613'],'mesh': ['D001241', 'D008687', 'D019355']}

  • nlu.load(en.resolve.snomed.umls): This pretrained pipeline maps SNOMED codes to UMLS codes without using any text...
Read more

Streamlit visualizations, improved T5, models for Farsi, Hebrew, Korean, Turkish and UMLS, LOINC, HPO, Resolvers in NLU 3.0.2

03 Jun 02:50
a7b1c0b
Compare
Choose a tag to compare

This release contains examples and tutorials on how to visualize the 1000+ state-of-the-art NLP models provided by NLU in just 1 line of code in streamlit.
It includes simple 1-liners you can sprinkle into your Streamlit app to for features like Dependency Trees, Named Entities (NER), text classification results, semantic simmilarity,
embedding visualizations via ELMO, BERT, ALBERT, XLNET and much more
. Additionally, improvements for T5, various resolvers have been added and models Farsi, Hebrew, Korean, and Turkish

This is the ultimate NLP research tool. You can visualize and compare the results of hundreds of context aware deep learning embeddings and compare them with classical vanilla embeddings like Glove
and can see with your own eyes how context is encoded by transformer models like BERT or XLNETand many more !
Besides that, you can also compare the results of the 200+ NER models John Snow Labs provides and see how peformances changes with varrying ebeddings, like Contextual, Static and Domain Specific Embeddings.

Install

For detailed instructions refer to the NLU install documentation here
You need Open JDK 8 installed and the following python packages

pip install nlu streamlit pyspark==3.0.1 sklearn plotly 

Problems? Connect with us on Slack!

Impatient and want some action?

Just run this Streamlit app, you can use it to generate python code for each NLU-Streamlit building block

streamlit run https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/examples/streamlit/01_dashboard.py

Quick Starter cheat sheet - All you need to know in 1 picture for NLU + Streamlit

For NLU models to load, see the NLU Namespace or the John Snow Labs Modelshub or go straight to the source.
NLU Streamlit Cheatsheet

Examples

Just try out any of these.
You can use the first example to generate python-code snippets which you can
recycle as building blocks in your streamlit apps!

Example: 01_dashboard

streamlit run https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/examples/streamlit/01_dashboard.py

Example: 02_NER

streamlit run https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/examples/streamlit/02_NER.py

Example: 03_text_similarity_matrix

streamlit run https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/examples/streamlit/03_text_similarity_matrix.py

Example: 04_dependency_tree

streamlit run https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/examples/streamlit/04_dependency_tree.py

Example: 05_classifiers

streamlit run https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/examples/streamlit/05_classifiers.py

Example: 06_token_features

streamlit run https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/examples/streamlit/06_token_features.py

How to use NLU?

All you need to know about NLU is that there is the nlu.load() method which returns a NLUPipeline object
which has a .predict() that works on most common data types in the pydata stack like Pandas dataframes .
Ontop of that, there are various visualization methods a NLUPipeline provides easily integrate in Streamlit as re-usable components. viz() method

Overview of NLU + Streamlit buildingblocks

Method Description
nlu.load('<Model>').predict(data) Load any of the 1000+ models by providing the model name any predict on most Pythontic data strucutres like Pandas, strings, arrays of strings and more
nlu.load('<Model>').viz_streamlit(data) Display full NLU exploration dashboard, that showcases every feature avaiable with dropdown selectors for 1000+ models
nlu.load('<Model>').viz_streamlit_similarity([string1, string2]) Display similarity matrix and scalar similarity for every word embedding loaded and 2 strings.
nlu.load('<Model>').viz_streamlit_ner(data) Visualize predicted NER tags from Named Entity Recognizer model
nlu.load('<Model>').viz_streamlit_dep_tree(data) Visualize Dependency Tree together with Part of Speech labels
nlu.load('<Model>').viz_streamlit_classes(data) Display all extracted class features and confidences for every classifier loaded in pipeline
nlu.load('<Model>').viz_streamlit_token(data) Display all detected token features and informations in Streamlit
nlu.load('<Model>').viz(data, write_to_streamlit=True) Display the raw visualization without any UI elements. See viz docs for more info. By default all aplicable nlu model references will be shown.
nlu.enable_streamlit_caching() Enable caching the nlu.load() call. Once enabled, the nlu.load() method will automatically cached. This is recommended to run first and for large peformance gans

Detailed visualizer information and API docs

function pipe.viz_streamlit

Display a highly configurable UI that showcases almost every feature available for Streamlit visualization with model selection dropdowns in your applications.
Ths includes :

  • Similarity Matrix & Scalars & Embedding Information for any of the 100+ Word Embedding Models
  • NER visualizations for any of the 200+ Named entity recognizers
  • Labled & Unlabled Dependency Trees visualizations with Part of Speech Tags for any of the 100+ Part of Speech Models
  • Token informations predicted by any of the 1000+ models
  • Classification results predicted by any of the 100+ models classification models
  • Pipeline Configuration & Model Information & Link to John Snow Labs Modelshub for all loaded pipelines
  • Auto generate Python code that can be copy pasted to re-create the individual Streamlit visualization blocks.
    NlLU takes the first model specified as nlu.load() for the first visualization run.
    Once the Streamlit app is running, additional models can easily be added via the UI.
    It is recommended to run this first, since you can generate Python code snippets to recreate individual Streamlit visualization blocks
nlu.load('ner').viz_streamlit(['I love NLU and Streamlit!','I hate buggy software'])

NLU Streamlit UI Overview

function parameters pipe.viz_streamlit

Argument Type Default Description
text Union [str, List[str], pd.DataFrame, pd.Series] 'NLU and Streamlit go together like peanutbutter and jelly' Default text for the Classification, Named Entitiy Recognizer, Token Information and Dependency Tree visualizations
similarity_texts Union[List[str],Tuple[str,str]] ('Donald Trump Likes to part', 'Angela Merkel likes to party') Default texts for the Text similarity visualization. Should contain exactly 2 strings which will be compared token embedding wise. For each embedding active, a token wise similarity matrix and a similarity scalar
model_selection List[str] [] List of nlu references to display in the model selector, see [the NLU Namespace](https://nlu.johnsno...
Read more