We are very excited to announce NLU 3.1.1 has been released!
It features a new Sentence Embedding visualization component for Streamlit which supports all 10+ previous dimension
reduction techniques. Additionally, all embedding visualizations now support Latent Dirichlet Allocation for dimension reduction.
Finally, 2 new trainable models for NER and chunk resolution are supported, a new drug normalizer algorithm has been added,
20+ new pre-trained models including Multi-Lingual, German,
various healthcare models and improved NER defaults when using licensed models that have NER dependencies.

Streamlit Sentence Embedding visualization via Manifold and Matrix Decomposition algorithms

`function` `pipe.viz_streamlit_sentence_embed_manifold`

Visualize Sentence Embeddings in 1-D, 2-D, or 3-D by Reducing Dimensionality via 12 Supported methods from Manifold Algorithms
and Matrix Decomposition Algorithms.
Additionally, you can color the lower dimensional points with a label that has been previously assigned to the text by specifying a list of nlu references in the additional_classifiers_for_coloring parameter.
You can also select additional classifiers via the GUI.

Reduces Dimensionality of high dimensional Sentence Embeddings to 1-D, 2-D, or 3-D and plot the resulting data in an interactive Plotly plot
Applicable with any of the 100+ Sentence Embedding models
Color points by classifying with any of the 100+ Document Classifiers
Gemerates NUM-DIMENSIONS * NUM-EMBEDDINGS * NUM-DIMENSION-REDUCTION-ALGOS plots

text= """You can visualize any of the 100 + Sentence Embeddings
with 10+ dimension reduction algorithms
and view the results in 3D, 2D, and 1D  
which can be colored by various classifier labels!
"""
nlu.load('embed_sentence.bert').viz_streamlit_sentence_embed_manifold(text)

`function parameters` `pipe.viz_streamlit_sentence_embed_manifold`

Argument	Type	Default	Description
`default_texts`	`List[str]`	("Donald Trump likes to party!", "Angela Merkel likes to party!", 'Peter HATES TO PARTTY!!!! :(')	List of strings to apply classifiers, embeddings, and manifolds to.
`text`	`Optional[str]`	`'Billy likes to swim'`	Text to predict classes for.
`sub_title`	`Optional[str]`	"Apply any of the 11 `Manifold` or `Matrix Decomposition` algorithms to reduce the dimensionality of `Sentence Embeddings` to `1-D`, `2-D` and `3-D` "	Sub title of the Streamlit app
`default_algos_to_apply`	`List[str]`	`["TSNE", "PCA"]`	A list Manifold and Matrix Decomposition Algorithms to apply. Can be either `'TSNE'`,`'ISOMAP'`,`'LLE'`,`'Spectral Embedding'`, `'MDS'`,`'PCA'`,`'SVD aka LSA'`,`'DictionaryLearning'`,`'FactorAnalysis'`,`'FastICA'` or `'KernelPCA'`,
`target_dimensions`	`List[int]`	`(1,2,3)`	Defines the target dimension embeddings will be reduced to
`show_algo_select`	`bool`	`True`	Show selector for Manifold and Matrix Decomposition Algorithms
`show_embed_select`	`bool`	`True`	Show selector for Embedding Selection
`show_color_select`	`bool`	`True`	Show selector for coloring plots
`display_embed_information`	`bool`	`True`	Show additional embedding information like `dimension`, `nlu_reference`, `spark_nlp_reference`, `sotrage_reference`, `modelhub link` and more.
`set_wide_layout_CSS`	`bool`	`True`	Whether to inject custom CSS or not.
`num_cols`	`int`	`2`	How many columns should for the layout in streamlit when rendering the similarity matrixes.
`key`	`str`	`"NLU_streamlit"`	Key for the Streamlit elements drawn
`additional_classifiers_for_coloring`	`List[str]`	`['sentiment.imdb']`	List of additional NLU references to load for generting hue colors
`show_model_select`	`bool`	`True`	Show a model selection dropdowns that makes any of the 1000+ models avaiable in 1 click
`model_select_position`	`str`	`'side'`	Whether to output the positions of predictions or not, see `pipe.predict(positions=true`) for more info
`show_logo`	`bool`	`True`	Show logo
`display_infos`	`bool`	`False`	Display additonal information about ISO codes and the NLU namespace structure.
`n_jobs`	`Optional[int]`	`3`	`False`

General Streamlit enhancements

Support for Latent Dirichlet Allocation

The Latent Dirichlet Allocation algorithm is now supported
for the Word Embedding Visualizations and the Sentence Embedding Visualizations

Normalization of Vectors before calculating sentence similarity.

WordEmbedding vectors will now be normalized before calculating similarity scores, which bounds each similarity between 0 and 1

Control order of plots

You can now control the order in Which visualizations appear in the main GUI

Sentence Embedding Visualization

Chunk Entity Resolver Training

Chunk Entity Resolver Training Tutorial Notebook
Named Entities are sub pieces in textual data which are labeled with classes.
These classes and strings are still ambigous though and it is not possible to group semantically identically entities without any definition of terminology.
With the Chunk Resolver you can train a state-of-the-art deep learning architecture to map entities to their unique terminological representation.

Train a chunk resolver on a dataset with columns named y , _y and text. y is a label, _y is an extra identifier label, text is the raw text

import pandas as pd 
dataset = pd.DataFrame({
    'text': ['The Tesla company is good to invest is', 'TSLA is good to invest','TESLA INC. we should buy','PUT ALL MONEY IN TSLA inc!!'],
    'y': ['23','23','23','23']
    '_y': ['TESLA','TESLA','TESLA','TESLA'], 

})


trainable_pipe = nlu.load('train.resolve_chunks')
fitted_pipe  = trainable_pipe.fit(dataset)
res = fitted_pipe.predict(dataset)
fitted_pipe.predict(["Peter told me to buy Tesla ", 'I have money to loose, is TSLA a good option?'])

entity_resolution_confidence	entity_resolution_code	entity_resolution	document
'1.0000'	'23'	'TESLA'	Peter told me to buy Tesla
'1.0000'	'23'	'TESLA'	I have money to loose, is TSLA a good option?

Train with default glove embeddings

untrained_chunk_resolver = nlu.load('train.resolve_chunks')
trained_chunk_resolver  =  untrained_chunk_resolver.fit(df)
trained_chunk_resolver.predict(df)

Train with custom embeddings

# Use BIo GLove
untrained_chunk_resolver = nlu.load('en.embed.glove.biovec train.resolve_chunks')
trained_chunk_resolver  =  untrained_chunk_resolver.fit(df)
trained_chunk_resolver.predict(df)

Rule based NER with Context Matcher

Rule based NER with context matching tutorial notebook
Define a rule-based NER algorithm by providing Regex Patterns and resolution mappings.
The confidence value is computed using a heuristic approach based on how many matches it has.
A dictionary can be provided with setDictionary to map extracted entities to a unified representation. The first column of the dictionary file should be the representation with the following columns the possible matches.

import nlu
import json
# Define helper functions to write NER rules to file 
"""Generate json with dict contexts at target path"""
def dump_dict_to_json_file(dict, path): 
  with open(path, 'w') as f: json.dump(dict, f)

"""Dump raw text file """
def dump_file_to_csv(data,path):
  with open(path, 'w') as f:f.write(data)
sample_text = """A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting. Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation . Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia . The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L . β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours . Twenty days ago. Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . At birth the typical boy is growing slightly faster than the typical girl, but the velocities become equal at about seven months, and then the girl grows faster until four years. From then until adolescence no differences in velocity can be detected. 21-02-2020 21/04/2020 """

# Define Gender NER matching rules
gender_rules = {
    "entity": "Gender",
    "ruleScope": "sentence",
    "completeMatchRegex": "true"    }

# Define dict data in csv format
gender_data = '''male,man,male,boy,gentleman,he,him
female,woman,female,girl,lady,old-lady,she,her
neutral,neutral'''

# Dump configs to file 
dump_dict_to_json_file(gender_data, 'gender.csv')
dump_dict_to_json_file(gender_rules, 'gender.json')
gender_NER_pipe = nlu.load('match.context')
gender_NER_pipe.print_info()
gender_NER_pipe['context_matcher'].setJsonPath('gender.json')
gender_NER_pipe['context_matcher'].setDictionary('gender.csv', options={"delimiter":","})
gender_NER_pipe.predict(sample_text)

context_match	context_match_confidence
female	0.13
she	0.13
she	0.13
she	0.13
she	0.13
boy	0.13
girl	0.13
girl	0.13

Context Matcher Parameters

You can define the following parameters in your rules.json file to define the entities to be matched

Parameter	Type	Description
entity	`str`	The name of this rule
regex	`Optional[str]`	Regex Pattern to extract candidates
contextLength	`Optional[int]`	defines the maximum distance a prefix and suffix words can be away from the word to match,whereas context are words that must be immediately after or before the word to match
prefix	`Optional[List[str]]`	Words preceding the regex match, that are at most `contextLength` characters aways
regexPrefix	`Optional[str]`	RegexPattern of words preceding the regex match, that are at most `contextLength` characters aways
suffix	`Optional[List[str]]`	Words following the regex match, that are at most `contextLength` characters aways
regexSuffix	`Optional[str]`	RegexPattern of words following the regex match, that are at most `contextLength` distance aways
context	`Optional[List[str]]`	list of words that must be immediatly before/after a match
contextException	`Optional[List[str]]`	?? List of words that may not be immediatly before/after a match
exceptionDistance	`Optional[int]`	Distance exceptions must be away from a match
regexContextException	`Optional[str]`	Regex Pattern of exceptions that may not be within `exceptionDistance` range of the match
matchScope	`Optional[str]`	Either `token` or `sub-token` to match on character basis
completeMatchRegex	`Optional[str]`	Wether to use complete or partial matching, either `"true"` or `"false"`
ruleScope	`str`	currently only `sentence` supported

Drug Normalizer

Drug Normalizer tutorial notebook

Normalize raw text from clinical documents, e.g. scraped web pages or xml documents. Removes all dirty characters from text following one or more input regex patterns. Can apply unwanted character removal which a specific policy. Can apply lower case normalization.

Parameters are

lowercase: whether to convert strings to lowercase. Default is False.
policy: rule to remove patterns from text. Valid policy values are: all abbreviations, dosages
Defaults is all. abbreviation policy used to expend common drugs abbreviations, dosages policy used to convert drugs dosages and values to the standard form (see examples below).

data = ["Agnogenic one half cup","adalimumab 54.5 + 43.2 gm","aspirin 10 meq/ 5 ml oral sol","interferon alfa-2b 10 million unit ( 1 ml ) injec","Sodium Chloride/Potassium Chloride 13bag"]
nlu.load('norm_drugs').predict(data)

drug_norm	text
Agnogenic 0.5 oral solution	Agnogenic one half cup
adalimumab 97700 mg	adalimumab 54.5 + 43.2 gm
aspirin 2 meq/ml oral solution	aspirin 10 meq/ 5 ml oral sol
interferon alfa - 2b 10000000 unt ( 1 ml ) injection	interferon alfa-2b 10 million unit ( 1 ml ) injec
Sodium Chloride / Potassium Chloride 13 bag	Sodium Chloride/Potassium Chloride 13bag

New NLU Spells

These new magical 1-liners which get new the folowing models

Open Source NLU Spells

NLU Spell	Spark NLP Model
nlu.load('de.ner.wikiner.6B_100')	wikiner_6B_100
nlu.load('xx.embed.glove.glove_6B_100')	glove_6B_100

Healthcare NLU spells

NLU Spell	Spark NLP Model
nlu.load('en.resolve.snomed_body_structure_med')	sbertresolve_snomed_bodyStructure_med
nlu.load('en.resolve.snomed_body_structure')	sbiobertresolve_snomed_bodyStructure
nlu.load('en.resolve.icdo_augmented')	sbiobertresolve_icdo_augmented
nlu.load('en.embed_sentence.biobert.jsl_cased')	sbiobert_jsl_cased
nlu.load('en.embed_sentence.biobert.jsl_umls_cased')	sbiobert_jsl_umls_cased
nlu.load('en.embed_sentence.bert.jsl_medium_uncased')	sbert_jsl_medium_uncased
nlu.load('en.embed_sentence.bert.jsl_medium_umls_uncased')	sbert_jsl_medium_umls_uncased
nlu.load('en.embed_sentence.bert.jsl_mini_uncased')	sbert_jsl_mini_uncased
nlu.load('en.embed_sentence.bert.jsl_mini_umlsuncased')	sbert_jsl_mini_umls_uncasedjsl_tiny_uncased
nlu.load('en.embed_sentence.bert.jsl_tiny_uncased')	sbert_jsl_tiny_uncased
nlu.load('en.embed_sentence.bert.jsl_tiny_umls_uncased')	sbert_jsl_tiny_umls_uncased
nlu.load('en.resolve.icd10cm.slim_billable_hcc')	sbiobertresolve_icd10cm_slim_billable_hcc
nlu.load('en.resolve.icd10cm.slim_billable_hcc_med')	sbertresolve_icd10cm_slim_billable_hcc_med
nlu.load('med_ner.deid.generic_augmented')	ner_deid_generic_augmented
nlu.load('med_ner.deid.subentity_augmented')	ner_deid_subentity_augmented
nlu.load('en.assert.radiology')	assertion_dl_radiology
nlu.load('en.relation.test_result_date')	re_test_result_date
nlu.load('en.med_ner.admission_events')	ner_events_admission_clinical
nlu.load('en.classify.ade.clinicalbert')	classifierdl_ade_clinicalbert
nlu.load('en.recognize_entities.posology')	recognize_entities_posology
nlu.load('en.embed_sentence.bluebert_cased_mli')	spark_name

Improved NER defaults

When loading licensed models that require a NER features like Assertion, Relation, Resolution,
nlu will now use the en.med_ner model which maps to the Spark NLP model jsl_ner_wip_clinical as default.
See https://nlp.johnsnowlabs.com/2021/03/31/jsl_ner_wip_clinical_en.html for more infos on this model.

New Notebooks

Additional NLU ressources

140+ NLU Tutorials
Streamlit visualizations docs
The complete list of all 4000+ models & pipelines in 200+ languages is available on Models Hub.
Spark NLP publications
NLU in Action
NLU documentation
Discussions Engage with other community members, share ideas, and show off how you use Spark NLP and NLU!

1 line Install NLU on Google Colab

!wget https://setup.johnsnowlabs.com/nlu/colab.sh -O - | bash

1 line Install NLU on Kaggle

!wget https://setup.johnsnowlabs.com/nlu/kaggle.sh -O - | bash

Install via PIP

! pip install nlu pyspark==3.0.3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sentence Embedding Visualizations, 20+ New Models, 2 New Trainable Models, Drug Normalizer and more in John Snow Labs NLU 3.1.1

Streamlit Sentence Embedding visualization via Manifold and Matrix Decomposition algorithms

`function` `pipe.viz_streamlit_sentence_embed_manifold`

`function parameters` `pipe.viz_streamlit_sentence_embed_manifold`

General Streamlit enhancements

Support for Latent Dirichlet Allocation

Normalization of Vectors before calculating sentence similarity.

Control order of plots

Sentence Embedding Visualization

Chunk Entity Resolver Training

Train with default glove embeddings

Train with custom embeddings

Rule based NER with Context Matcher

Context Matcher Parameters

Drug Normalizer

New NLU Spells

Open Source NLU Spells

Healthcare NLU spells

Improved NER defaults

New Notebooks

Additional NLU ressources

1 line Install NLU on Google Colab

1 line Install NLU on Kaggle

Install via PIP

Sentence Embedding Visualizations, 20+ New Models, 2 New Trainable Models, Drug Normalizer and more in John Snow Labs NLU 3.1.1

Streamlit Sentence Embedding visualization via Manifold and Matrix Decomposition algorithms

function pipe.viz_streamlit_sentence_embed_manifold

function parameters pipe.viz_streamlit_sentence_embed_manifold

General Streamlit enhancements

Support for Latent Dirichlet Allocation

Normalization of Vectors before calculating sentence similarity.

Control order of plots

Sentence Embedding Visualization

Chunk Entity Resolver Training

Train with default glove embeddings

Train with custom embeddings

Rule based NER with Context Matcher

Context Matcher Parameters

Drug Normalizer

New NLU Spells

Open Source NLU Spells

Healthcare NLU spells

Improved NER defaults

New Notebooks

Additional NLU ressources

1 line Install NLU on Google Colab

1 line Install NLU on Kaggle

Install via PIP

`function` `pipe.viz_streamlit_sentence_embed_manifold`

`function parameters` `pipe.viz_streamlit_sentence_embed_manifold`