John Snow Labs NLU 1.1.1 : New multilingual models, Spark 2.3 support, new tutorials and more!
John Snow Labs NLU 1.1.1 : New multilingual models, Spark 2.3 support, new tutorials and more!
NLU 1.1.1 Release Notes
We are very excited to release NLU 1.1.1!
This release features 3 new tutorial notebooks for Open/Closed book question answering with Google's T5, Intent classification, and Aspect Based NER.
In Addition, NLU 1.1.0 comes with 25+ pre-trained models and pipelines in Amharic, Bengali, Bhojpuri, Japanese, and Korean languages from the amazing Spark2.7.2 release
Finally, NLU now supports running on Spark 2.3 clusters.
NLU 1.1.0 New Non-English Models
Language | nlu.load() reference | Spark NLP Model reference | Type |
Arabic | ar.ner | arabic_w2v_cc_300d | Named Entity Recognizer |
Arabic | ar.embed.aner | aner_cc_300d | Word Embedding |
Arabic | ar.embed.aner.300d | aner_cc_300d | Word Embedding (Alias) |
Bengali | bn.stopwords | stopwords_bn | Stopwords Cleaner |
Bengali | bn.pos | pos_msri | Part of Speech |
Thai | th.segment_words | wordseg_best | Word Segmenter |
Thai | th.pos | pos_lst20 | Part of Speech |
Thai | th.sentiment | sentiment_jager_use | Sentiment Classifier |
Thai | th.classify.sentiment | sentiment_jager_use | Sentiment Classifier (Alias) |
Chinese | zh.pos.ud_gsd_trad | pos_ud_gsd_trad | Part of Speech |
Chinese | zh.segment_words.gsd | wordseg_gsd_ud_trad | Word Segmenter |
Bihari | bh.pos | pos_ud_bhtb | Part of Speech |
Amharic | am.pos | pos_ud_att | Part of Speech |
NLU 1.1.1 New English Models and Pipelines
New Easy NLU 1-liner Examples :
Extract aspects and entities from airline questions (ATIS dataset)
nlu.load("en.ner.atis").predict("i want to fly from baltimore to dallas round trip")
output: ["baltimore"," dallas", "round trip"]
Intent Classification for Airline Traffic Information System queries (ATIS dataset)
nlu.load("en.classify.questions.atis").predict("what is the price of flight from newyork to washington")
output: "atis_airfare"
Recognize Entities OntoNotes - ELECTRA Large
nlu.load("en.ner.onto.large").predict("Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London.")
output: ["Johnson", "first", "2001", "eight years", "London"]
Question classification of open-domain and fact-based questions Pipeline - TREC50
nlu.load("en.classify.trec50.pipe").predict("When did the construction of stone circles begin in the UK? ")
output: LOC_other
Traditional Chinese Word Segmentation
# 'However, this treatment also creates some problems' in Chinese
output: ["然而",",","這樣","的","處理","也","衍生","了","一些","問題","。"]
Part of Speech for Traditional Chinese
# 'However, this treatment also creates some problems' in Chinese
Token | POS |
然而 | ADV |
, | PUNCT |
這樣 | PRON |
的 | PART |
處理 | NOUN |
也 | ADV |
衍生 | VERB |
了 | PART |
一些 | ADJ |
問題 | NOUN |
。 | PUNCT |
Thai Word Segment Recognition
# 'Mona Lisa is a 16th-century oil painting created by Leonardo held at the Louvre in Paris' in Thai
nlu.loadnlu.load("th.segment_words").predict("Mona Lisa เป็นภาพวาดสีน้ำมันในศตวรรษที่ 16 ที่สร้างโดย Leonardo จัดขึ้นที่พิพิธภัณฑ์ลูฟร์ในปารีส")
token |
M |
o |
n |
a |
Lisa |
เป็น |
ภาพ |
ว |
า |
ด |
สีน้ำ |
มัน |
ใน |
ศตวรรษ |
ที่ |
16 |
ที่ |
สร้าง |
โ |
ด |
ย |
L |
e |
o |
n |
a |
r |
d |
o |
จัด |
ขึ้น |
ที่ |
พิพิธภัณฑ์ |
ลูฟร์ |
ใน |
ปารีส |
Part of Speech for Bengali (POS)
# 'The village is also called 'Mod' in Tora language' in Bengali
nlu.load("bn.pos").predict("বাসস্থান-ঘরগৃহস্থালি তোড়া ভাষায় গ্রামকেও বলে ` মোদ ' ৷")
token | pos |
বাসস্থান-ঘরগৃহস্থালি | NN |
তোড়া | NNP |
ভাষায় | NN |
গ্রামকেও | NN |
বলে | VM |
` | SYM |
মোদ | NN |
' | SYM |
৷ | SYM |
Stop Words Cleaner for Bengali
# 'This language is not enough' in Bengali
df = nlu.load("bn.stopwords").predict("এই ভাষা যথেষ্ট নয়")
cleanTokens | token |
ভাষা | এই |
যথেষ্ট | ভাষা |
নয় | যথেষ্ট |
None | নয় |
Part of Speech for Bengali
# 'The people of Ohu know that the foundation of Bhojpuri was shaken' in Bengali
nlu.load('bh.pos').predict("ओहु लोग के मालूम बा कि श्लील होखते भोजपुरी के नींव हिल जाई")
pos | token |
DET | ओहु |
NOUN | लोग |
ADP | के |
NOUN | मालूम |
VERB | बा |
SCONJ | कि |
ADJ | श्लील |
VERB | होखते |
PROPN | भोजपुरी |
ADP | के |
NOUN | नींव |
VERB | हिल |
AUX | जाई |
Amharic Part of Speech (POS)
# ' "Son, finish the job," he said.' in Amharic
nlu.load('am.pos').predict('ልጅ ኡ ን ሥራ ው ን አስጨርስ ኧው ኣል ኧሁ ።"')
pos | token |
NOUN | ልጅ |
DET | ኡ |
PART | ን |
NOUN | ሥራ |
DET | ው |
PART | ን |
VERB | አስጨርስ |
PRON | ኧው |
AUX | ኣል |
PRON | ኧሁ |
PUNCT | ። |
NOUN | " |
Thai Sentiment Classification
# 'I love peanut butter and jelly!' in thai
sentiment | sentiment_confidence |
positive | 0.999998 |
Arabic Named Entity Recognition (NER)
# 'In 1918, the forces of the Arab Revolt liberated Damascus with the help of the British' in Arabic
nlu.load('ar.ner').predict('في عام 1918 حررت قوات الثورة العربية دمشق بمساعدة من الإنكليز',output_level='chunk')[['entities_confidence','ner_confidence','entities']]
entity_class | ner_confidence | entities |
ORG | [1.0, 1.0, 1.0, 0.9997000098228455, 0.9840999841690063, 0.9987999796867371, 0.9990000128746033, 0.9998999834060669, 0.9998999834060669, 0.9993000030517578, 0.9998999834060669] | قوات الثورة العربية |
LOC | [1.0, 1.0, 1.0, 0.9997000098228455, 0.9840999841690063, 0.9987999796867371, 0.9990000128746033, 0.9998999834060669, 0.9998999834060669, 0.9993000030517578, 0.9998999834060669] | دمشق |
PER | [1.0, 1.0, 1.0, 0.9997000098228455, 0.9840999841690063, 0.9987999796867371, 0.9990000128746033, 0.9998999834060669, 0.9998999834060669, 0.9993000030517578, 0.9998999834060669] | الإنكليز |
NLU 1.1.0 Enhancements :
- Spark 2.3 compatibility
New NLU Notebooks and Tutorials
- Open and Closed book question Ansering
- Aspect based NER for Airline ATIS
- Intent Classification for Airline emssages ATIS
# PyPi
!pip install nlu pyspark==2.4.7
# Install NLU from Anaconda/Conda
conda install -c johnsnowlabs nlu