Merge pull request #24 from JohnSnowLabs/1.0.6rc1

1.0.6rc1
JohnSnowLabs · Jan 2, 2021 · f533eaa · f533eaa
2 parents 73cc744 + 23ba790
commit f533eaa
Show file tree

Hide file tree

Showing 61 changed files with 448 additions and 220 deletions.
diff --git a/.github/workflows/nlu_test_flow.yaml b/.github/workflows/nlu_test_flow.yaml
@@ -1,14 +1,11 @@
 name: NLU Tests
-
 on: [push]
-
-
 jobs:
   build:
     runs-on: ubuntu-latest
     strategy:
       matrix:
-        python-version: [3.7]
+        python-version: [3.6]
     steps:
       - uses: actions/setup-java@v1
         with:
@@ -23,7 +20,7 @@ jobs:
         run: |
           python -m pip install --upgrade pip
           pip install pypandoc sklearn
-          pip install pypandoc wheel nlu  pytest modin[ray]
+          pip install  wheel nlu  pytest modin[ray] pyspark==2.4.7
           if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
       - name: NLU Basic Component tests
         if: always()

diff --git a/docs/en/examples.md b/docs/en/examples.md
@@ -473,7 +473,7 @@ nlu.load('en.classify.toxic').predict('You are to stupid')
 {:.table-model-big.mb0}
 |	toxic_confidence | 	toxic | 	sentence_embeddings| 	document| 
 |-------------------|---------|------------------------|------------|
-| 0.978273 | 	[toxic,insult]	| [[-0.03398505970835686, 0.0007853527786210179,...,]	You are to stupid|
+| 0.978273 | 	[toxic,insult]	| [[-0.03398505970835686, 0.0007853527786210179,...,] |	You are to stupid|
 
 </div></div></div><div class="h3-box" markdown="1">
 

diff --git a/docs/en/release_notes.md b/docs/en/release_notes.md
@@ -13,6 +13,66 @@ modify_date: "2020-06-12"
 
 <div class="h3-box" markdown="1">
 
+##  NLU 1.0.6 Release Notes
+### Trainable Multi Label Classifiers, predict Stackoverflow Tags and much more in 1 Line of with NLU 1.0.6
+We are glad to announce NLU 1.0.6 has been released!
+NLU 1.0.6 comes with the Multi Label classifier, it can learn to map strings to multiple labels.
+The Multi Label Classifier is using Bidirectional GRU and CNNs inside TensorFlow and supports up to 100 classes.
+
+### NLU 1.0.6 New Features
+- Multi Label Classifier
+   - The Multi Label Classifier learns a 1 to many mapping between text and labels. This means it can predict multiple labels at the same time for a given input string. This is very helpful for tasks similar to content tag prediction (HashTags/RedditTags/YoutubeTags/Toxic/E2e etc..)
+   - Support up to 100 classes
+   - Pre-trained Multi Label Classifiers are already avaiable as [Toxic](https://nlu.johnsnowlabs.com/docs/en/examples#toxic-classifier) and [E2E](https://nlu.johnsnowlabs.com/docs/en/examples#e2e-classifier) classifiers
+
+####  Multi Label Classifier
+- [ Train Multi Label Classifier on E2E dataset Demo](https://colab.research.google.com/drive/15ZqfNUqliRKP4UgaFcRg5KOSTkqrtDXy?usp=sharing)
+- [Train Multi Label  Classifier on Stack Overflow Question Tags dataset Demo](https://colab.research.google.com/drive/1Y0pYdUMKSs1ZP0NDcKgVECqkKD9ShIdc?usp=sharing)       
+  This model can predict multiple labels for one sentence.
+  To train the Multi Label text classifier model, you must pass a dataframe with a ```text``` column and a ```y``` column for the label.   
+  The ```y``` label must be a string column where each label is seperated with a seperator.     
+  By default, ```,``` is assumed as line seperator.      
+  If your dataset is using a different label seperator, you must configure the ```label_seperator``` parameter while calling the ```fit()``` method.
+
+By default *Universal Sentence Encoder Embeddings (USE)* are used as sentence embeddings for training.
+
+```python
+fitted_pipe = nlu.load('train.multi_classifier').fit(train_df)
+preds = fitted_pipe.predict(train_df)
+```
+
+If you add a nlu sentence embeddings reference, before the train reference, NLU will use that Sentence embeddings instead of the default USE.
+```python
+#Train on BERT sentence emebddings
+fitted_pipe = nlu.load('embed_sentence.bert train.multi_classifier').fit(train_df)
+preds = fitted_pipe.predict(train_df)
+```
+
+Configure a custom line seperator
+```python
+#Use ; as label seperator
+fitted_pipe = nlu.load('embed_sentence.electra train.multi_classifier').fit(train_df, label_seperator=';')
+preds = fitted_pipe.predict(train_df)
+```
+
+
+### NLU 1.0.6 Enhancements
+- Improved outputs for Toxic and E2E Classifier.
+  - by default, all predicted classes and their confidences which are above the threshold will be returned inside of a list in the Pandas dataframe
+  - by configuring meta=True, the confidences for all classes will be returned.
+
+
+### NLU 1.0.6 New Notebooks and Tutorials
+
+- [ Train Multi Label Classifier on E2E dataset](https://colab.research.google.com/drive/15ZqfNUqliRKP4UgaFcRg5KOSTkqrtDXy?usp=sharing)
+- [Train Multi Label  Classifier on Stack Overflow Question Tags dataset](https://drive.google.com/file/d/1Nmrncn-y559od3AKJglwfJ0VmZKjtMAF/view?usp=sharing)
+
+### NLU 1.0.6 Bug-fixes
+- Fixed a bug that caused ```en.ner.dl.bert``` to be inaccessible
+- Fixed a bug that caused ```pt.ner.large``` to be inaccessible
+- Fixed a bug that caused USE embeddings not properly beeing configured to document level output when using multiple embeddings at the same time
+
+
 ##  NLU 1.0.5 Release Notes 
 
 ### Trainable Part of Speech Tagger (POS), Sentiment Classifier with BERT/USE/ELECTRA sentence embeddings in 1 Line of code! Latest NLU Release 1.0.5
@@ -45,13 +105,13 @@ preds = fitted_pipe.predict(train_df)
 If you add a nlu sentence embeddings reference, before the train reference, NLU will use that Sentence embeddings instead of the default USE.
 
 ```python
-#Train NER on BERT sentence embeddings
+#Train Classifier on BERT sentence embeddings
 fitted_pipe = nlu.load('embed_sentence.bert train.classifier').fit(train_df)
 preds = fitted_pipe.predict(train_df)
 ```
 
 ```python
-#Train NER on ELECTRA sentence embeddings
+#Train Classifier on ELECTRA sentence embeddings
 fitted_pipe = nlu.load('embed_sentence.electra train.classifier').fit(train_df)
 preds = fitted_pipe.predict(train_df)
 ```

diff --git a/docs/en/training.md b/docs/en/training.md
@@ -11,29 +11,35 @@ modify_date: "2020-05-08"
 
 <div class="h3-box" markdown="1">
 
-You can fit load a trainable NLU pipeline via nlu.load('train.<model>') you can 
+You can fit load a trainable NLU pipeline via ```nlu.load('train.<model>')``` 
 
-# Named Entity Recognizer Training. Training
-[NER training demo](https://colab.research.google.com/drive/1_GwhdXULq45GZkw3157fAOx4Wqo-fmFV?usp=sharing)        
-You can train your own custom NER model with an [CoNLL 20003 IOB](https://www.aclweb.org/anthology/W03-0419.pdf) formatted dataset.      
-By default *Glove 100d Token Embeddings* are used as features for the classifier.
+# Binary Text Classifier Training
+[Sentiment classification training demo](https://colab.research.google.com/drive/1f-EORjO3IpvwRAktuL4EvZPqPr2IZ_g8?usp=sharing)        
+To train the a Sentiment classifier model, you must pass a dataframe with a ```text``` column and a ```y``` column for the label.
+Uses a Deep Neural Network built in Tensorflow.       
+By default *Universal Sentence Encoder Embeddings (USE)* are used as sentence embeddings.
 
 ```python
-train_path = '/content/eng.train'
-fitted_pipe = nlu.load('train.ner').fit(dataset_path=train_path)
+fitted_pipe = nlu.load('train.sentiment').fit(train_df)
+preds = fitted_pipe.predict(train_df)
 ```
+If you add a nlu sentence embeddings reference, before the train reference, NLU will use that Sentence embeddings instead of the default USE.
 
-If a NLU reference to a Token Embeddings model is added before the train reference, that Token Embedding will be used when training the NER model.
+```python
+#Train Classifier on BERT sentence embeddings
+fitted_pipe = nlu.load('embed_sentence.bert train.classifier').fit(train_df)
+preds = fitted_pipe.predict(train_df)
+```
 
 ```python
-# Train on BERT embeddigns
-train_path = '/content/eng.train'
-fitted_pipe = nlu.load('bert train.ner').fit(dataset_path=train_path)
+#Train Classifier on ELECTRA sentence embeddings
+fitted_pipe = nlu.load('embed_sentence.electra train.classifier').fit(train_df)
+preds = fitted_pipe.predict(train_df)
 ```
 
 # Multi Class Text Classifier Training
 [Multi Class Text Classifier Training Demo](https://colab.research.google.com/drive/12FA2TVvvRWw4pRhxDnK32WAzl9dbF6Qw?usp=sharing)         
-To train the Multi Class text classifier model, you must pass a dataframe with a 'text' column and a 'y' column for the label.
+To train the Multi Class text classifier model, you must pass a dataframe with a ```text``` column and a ```y``` column for the label.        
 By default *Universal Sentence Encoder Embeddings (USE)* are used as sentence embeddings. 
 
 ```python
@@ -49,9 +55,73 @@ fitted_pipe = nlu.load('embed_sentence.bert train.classifier').fit(train_df)
 preds = fitted_pipe.predict(train_df)
 ```
 
+# Multi Label Classifier training
+[ Train Multi Label Classifier on E2E dataset](https://colab.research.google.com/drive/15ZqfNUqliRKP4UgaFcRg5KOSTkqrtDXy?usp=sharing)       
+[Train Multi Label  Classifier on Stack Overflow Question Tags dataset](https://drive.google.com/file/d/1Nmrncn-y559od3AKJglwfJ0VmZKjtMAF/view?usp=sharing)       
+This model can predict multiple labels for one sentence.     
+Uses a Bidirectional GRU with Convolution model that we have built inside TensorFlow and supports up to 100 classes.        
+To train the Multi Class text classifier model, you must pass a dataframe with a ```text``` column and a ```y``` column for the label.   
+The ```y``` label must be a string column where each label is seperated with a seperator.     
+By default, ```,``` is assumed as line seperator.      
+If your dataset is using a different label seperator, you must configure the ```label_seperator``` parameter while calling the ```fit()``` method.    
+
+By default *Universal Sentence Encoder Embeddings (USE)* are used as sentence embeddings for training.
+
+```python
+fitted_pipe = nlu.load('train.multi_classifier').fit(train_df)
+preds = fitted_pipe.predict(train_df)
+```
+
+If you add a nlu sentence embeddings reference, before the train reference, NLU will use that Sentence embeddings instead of the default USE.
+```python
+#Train on BERT sentence emebddings
+fitted_pipe = nlu.load('embed_sentence.bert train.multi_classifier').fit(train_df)
+preds = fitted_pipe.predict(train_df)
+```
+
+Configure a custom line seperator
+```python
+#Use ; as label seperator
+fitted_pipe = nlu.load('embed_sentence.electra train.multi_classifier').fit(train_df, label_seperator=';')
+preds = fitted_pipe.predict(train_df)
+```
+
+
+
+#Part of Speech (POS) Training
+Your dataset must be in the form of universal dependencies [Universal Dependencies](https://universaldependencies.org/).
+You must configure the dataset_path in the ```fit()``` method to point to the universal dependencies you wish to train on.       
+You can configure the delimiter via the ```label_seperator``` parameter      
+[POS training demo]](https://colab.research.google.com/drive/1CZqHQmrxkDf7y3rQHVjO-97tCnpUXu_3?usp=sharing)
+
+```python
+fitted_pipe = nlu.load('train.pos').fit(dataset_path=train_path, label_seperator='_')
+preds = fitted_pipe.predict(train_df)
+```
+
+
+
+# Named Entity Recognizer (NER) Training
+[NER training demo](https://colab.research.google.com/drive/1_GwhdXULq45GZkw3157fAOx4Wqo-fmFV?usp=sharing)        
+You can train your own custom NER model with an [CoNLL 20003 IOB](https://www.aclweb.org/anthology/W03-0419.pdf) formatted dataset.      
+By default *Glove 100d Token Embeddings* are used as features for the classifier.
+
+```python
+train_path = '/content/eng.train'
+fitted_pipe = nlu.load('train.ner').fit(dataset_path=train_path)
+```
+
+If a NLU reference to a Token Embeddings model is added before the train reference, that Token Embedding will be used when training the NER model.
+
+```python
+# Train on BERT embeddigns
+train_path = '/content/eng.train'
+fitted_pipe = nlu.load('bert train.ner').fit(dataset_path=train_path)
+```
+
 
 
-## Saving a NLU pipelien to disk
+# Saving a NLU pipeline to disk
 
 ```python
 train_path = '/content/eng.train'
@@ -61,7 +131,7 @@ fitted_pipe.save(stored_model_path)
 
 ```
 
-## Loading a NLU pipeline from disk
+# Loading a NLU pipeline from disk
 
 ```python
 train_path = '/content/eng.train'
@@ -73,7 +143,7 @@ hdd_pipe = nlu.load(path=stored_model_path)
 
 
 
-## Loading a NLU pipeline as pyspark.ml.PipelineModel
+# Loading a NLU pipeline as pyspark.ml.PipelineModel
 ```python
 import pyspark
 # load the NLU pipeline as pyspark pipeline

diff --git a/examples/colab/Component Examples/Chunkers/NLU - n-gram .ipynb b/examples/colab/Component Examples/Chunkers/NLU - n-gram .ipynb