Skip to content

Commit

Permalink
Merge pull request #24 from JohnSnowLabs/1.0.6rc1
Browse files Browse the repository at this point in the history
1.0.6rc1
  • Loading branch information
C-K-Loan authored Jan 2, 2021
2 parents 73cc744 + 23ba790 commit f533eaa
Show file tree
Hide file tree
Showing 61 changed files with 448 additions and 220 deletions.
7 changes: 2 additions & 5 deletions .github/workflows/nlu_test_flow.yaml
Original file line number Diff line number Diff line change
@@ -1,14 +1,11 @@
name: NLU Tests

on: [push]


jobs:
build:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: [3.7]
python-version: [3.6]
steps:
- uses: actions/setup-java@v1
with:
Expand All @@ -23,7 +20,7 @@ jobs:
run: |
python -m pip install --upgrade pip
pip install pypandoc sklearn
pip install pypandoc wheel nlu pytest modin[ray]
pip install wheel nlu pytest modin[ray] pyspark==2.4.7
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
- name: NLU Basic Component tests
if: always()
Expand Down
2 changes: 1 addition & 1 deletion docs/en/examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -473,7 +473,7 @@ nlu.load('en.classify.toxic').predict('You are to stupid')
{:.table-model-big.mb0}
| toxic_confidence | toxic | sentence_embeddings| document|
|-------------------|---------|------------------------|------------|
| 0.978273 | [toxic,insult] | [[-0.03398505970835686, 0.0007853527786210179,...,] You are to stupid|
| 0.978273 | [toxic,insult] | [[-0.03398505970835686, 0.0007853527786210179,...,] | You are to stupid|

</div></div></div><div class="h3-box" markdown="1">

Expand Down
64 changes: 62 additions & 2 deletions docs/en/release_notes.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,66 @@ modify_date: "2020-06-12"

<div class="h3-box" markdown="1">

## NLU 1.0.6 Release Notes
### Trainable Multi Label Classifiers, predict Stackoverflow Tags and much more in 1 Line of with NLU 1.0.6
We are glad to announce NLU 1.0.6 has been released!
NLU 1.0.6 comes with the Multi Label classifier, it can learn to map strings to multiple labels.
The Multi Label Classifier is using Bidirectional GRU and CNNs inside TensorFlow and supports up to 100 classes.

### NLU 1.0.6 New Features
- Multi Label Classifier
- The Multi Label Classifier learns a 1 to many mapping between text and labels. This means it can predict multiple labels at the same time for a given input string. This is very helpful for tasks similar to content tag prediction (HashTags/RedditTags/YoutubeTags/Toxic/E2e etc..)
- Support up to 100 classes
- Pre-trained Multi Label Classifiers are already avaiable as [Toxic](https://nlu.johnsnowlabs.com/docs/en/examples#toxic-classifier) and [E2E](https://nlu.johnsnowlabs.com/docs/en/examples#e2e-classifier) classifiers

#### Multi Label Classifier
- [ Train Multi Label Classifier on E2E dataset Demo](https://colab.research.google.com/drive/15ZqfNUqliRKP4UgaFcRg5KOSTkqrtDXy?usp=sharing)
- [Train Multi Label Classifier on Stack Overflow Question Tags dataset Demo](https://colab.research.google.com/drive/1Y0pYdUMKSs1ZP0NDcKgVECqkKD9ShIdc?usp=sharing)
This model can predict multiple labels for one sentence.
To train the Multi Label text classifier model, you must pass a dataframe with a ```text``` column and a ```y``` column for the label.
The ```y``` label must be a string column where each label is seperated with a seperator.
By default, ```,``` is assumed as line seperator.
If your dataset is using a different label seperator, you must configure the ```label_seperator``` parameter while calling the ```fit()``` method.

By default *Universal Sentence Encoder Embeddings (USE)* are used as sentence embeddings for training.

```python
fitted_pipe = nlu.load('train.multi_classifier').fit(train_df)
preds = fitted_pipe.predict(train_df)
```

If you add a nlu sentence embeddings reference, before the train reference, NLU will use that Sentence embeddings instead of the default USE.
```python
#Train on BERT sentence emebddings
fitted_pipe = nlu.load('embed_sentence.bert train.multi_classifier').fit(train_df)
preds = fitted_pipe.predict(train_df)
```

Configure a custom line seperator
```python
#Use ; as label seperator
fitted_pipe = nlu.load('embed_sentence.electra train.multi_classifier').fit(train_df, label_seperator=';')
preds = fitted_pipe.predict(train_df)
```


### NLU 1.0.6 Enhancements
- Improved outputs for Toxic and E2E Classifier.
- by default, all predicted classes and their confidences which are above the threshold will be returned inside of a list in the Pandas dataframe
- by configuring meta=True, the confidences for all classes will be returned.


### NLU 1.0.6 New Notebooks and Tutorials

- [ Train Multi Label Classifier on E2E dataset](https://colab.research.google.com/drive/15ZqfNUqliRKP4UgaFcRg5KOSTkqrtDXy?usp=sharing)
- [Train Multi Label Classifier on Stack Overflow Question Tags dataset](https://drive.google.com/file/d/1Nmrncn-y559od3AKJglwfJ0VmZKjtMAF/view?usp=sharing)

### NLU 1.0.6 Bug-fixes
- Fixed a bug that caused ```en.ner.dl.bert``` to be inaccessible
- Fixed a bug that caused ```pt.ner.large``` to be inaccessible
- Fixed a bug that caused USE embeddings not properly beeing configured to document level output when using multiple embeddings at the same time


## NLU 1.0.5 Release Notes

### Trainable Part of Speech Tagger (POS), Sentiment Classifier with BERT/USE/ELECTRA sentence embeddings in 1 Line of code! Latest NLU Release 1.0.5
Expand Down Expand Up @@ -45,13 +105,13 @@ preds = fitted_pipe.predict(train_df)
If you add a nlu sentence embeddings reference, before the train reference, NLU will use that Sentence embeddings instead of the default USE.

```python
#Train NER on BERT sentence embeddings
#Train Classifier on BERT sentence embeddings
fitted_pipe = nlu.load('embed_sentence.bert train.classifier').fit(train_df)
preds = fitted_pipe.predict(train_df)
```

```python
#Train NER on ELECTRA sentence embeddings
#Train Classifier on ELECTRA sentence embeddings
fitted_pipe = nlu.load('embed_sentence.electra train.classifier').fit(train_df)
preds = fitted_pipe.predict(train_df)
```
Expand Down
100 changes: 85 additions & 15 deletions docs/en/training.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,29 +11,35 @@ modify_date: "2020-05-08"

<div class="h3-box" markdown="1">

You can fit load a trainable NLU pipeline via nlu.load('train.<model>') you can
You can fit load a trainable NLU pipeline via ```nlu.load('train.<model>')```

# Named Entity Recognizer Training. Training
[NER training demo](https://colab.research.google.com/drive/1_GwhdXULq45GZkw3157fAOx4Wqo-fmFV?usp=sharing)
You can train your own custom NER model with an [CoNLL 20003 IOB](https://www.aclweb.org/anthology/W03-0419.pdf) formatted dataset.
By default *Glove 100d Token Embeddings* are used as features for the classifier.
# Binary Text Classifier Training
[Sentiment classification training demo](https://colab.research.google.com/drive/1f-EORjO3IpvwRAktuL4EvZPqPr2IZ_g8?usp=sharing)
To train the a Sentiment classifier model, you must pass a dataframe with a ```text``` column and a ```y``` column for the label.
Uses a Deep Neural Network built in Tensorflow.
By default *Universal Sentence Encoder Embeddings (USE)* are used as sentence embeddings.

```python
train_path = '/content/eng.train'
fitted_pipe = nlu.load('train.ner').fit(dataset_path=train_path)
fitted_pipe = nlu.load('train.sentiment').fit(train_df)
preds = fitted_pipe.predict(train_df)
```
If you add a nlu sentence embeddings reference, before the train reference, NLU will use that Sentence embeddings instead of the default USE.

If a NLU reference to a Token Embeddings model is added before the train reference, that Token Embedding will be used when training the NER model.
```python
#Train Classifier on BERT sentence embeddings
fitted_pipe = nlu.load('embed_sentence.bert train.classifier').fit(train_df)
preds = fitted_pipe.predict(train_df)
```

```python
# Train on BERT embeddigns
train_path = '/content/eng.train'
fitted_pipe = nlu.load('bert train.ner').fit(dataset_path=train_path)
#Train Classifier on ELECTRA sentence embeddings
fitted_pipe = nlu.load('embed_sentence.electra train.classifier').fit(train_df)
preds = fitted_pipe.predict(train_df)
```

# Multi Class Text Classifier Training
[Multi Class Text Classifier Training Demo](https://colab.research.google.com/drive/12FA2TVvvRWw4pRhxDnK32WAzl9dbF6Qw?usp=sharing)
To train the Multi Class text classifier model, you must pass a dataframe with a 'text' column and a 'y' column for the label.
To train the Multi Class text classifier model, you must pass a dataframe with a ```text``` column and a ```y``` column for the label.
By default *Universal Sentence Encoder Embeddings (USE)* are used as sentence embeddings.

```python
Expand All @@ -49,9 +55,73 @@ fitted_pipe = nlu.load('embed_sentence.bert train.classifier').fit(train_df)
preds = fitted_pipe.predict(train_df)
```

# Multi Label Classifier training
[ Train Multi Label Classifier on E2E dataset](https://colab.research.google.com/drive/15ZqfNUqliRKP4UgaFcRg5KOSTkqrtDXy?usp=sharing)
[Train Multi Label Classifier on Stack Overflow Question Tags dataset](https://drive.google.com/file/d/1Nmrncn-y559od3AKJglwfJ0VmZKjtMAF/view?usp=sharing)
This model can predict multiple labels for one sentence.
Uses a Bidirectional GRU with Convolution model that we have built inside TensorFlow and supports up to 100 classes.
To train the Multi Class text classifier model, you must pass a dataframe with a ```text``` column and a ```y``` column for the label.
The ```y``` label must be a string column where each label is seperated with a seperator.
By default, ```,``` is assumed as line seperator.
If your dataset is using a different label seperator, you must configure the ```label_seperator``` parameter while calling the ```fit()``` method.

By default *Universal Sentence Encoder Embeddings (USE)* are used as sentence embeddings for training.

```python
fitted_pipe = nlu.load('train.multi_classifier').fit(train_df)
preds = fitted_pipe.predict(train_df)
```

If you add a nlu sentence embeddings reference, before the train reference, NLU will use that Sentence embeddings instead of the default USE.
```python
#Train on BERT sentence emebddings
fitted_pipe = nlu.load('embed_sentence.bert train.multi_classifier').fit(train_df)
preds = fitted_pipe.predict(train_df)
```

Configure a custom line seperator
```python
#Use ; as label seperator
fitted_pipe = nlu.load('embed_sentence.electra train.multi_classifier').fit(train_df, label_seperator=';')
preds = fitted_pipe.predict(train_df)
```



#Part of Speech (POS) Training
Your dataset must be in the form of universal dependencies [Universal Dependencies](https://universaldependencies.org/).
You must configure the dataset_path in the ```fit()``` method to point to the universal dependencies you wish to train on.
You can configure the delimiter via the ```label_seperator``` parameter
[POS training demo]](https://colab.research.google.com/drive/1CZqHQmrxkDf7y3rQHVjO-97tCnpUXu_3?usp=sharing)

```python
fitted_pipe = nlu.load('train.pos').fit(dataset_path=train_path, label_seperator='_')
preds = fitted_pipe.predict(train_df)
```



# Named Entity Recognizer (NER) Training
[NER training demo](https://colab.research.google.com/drive/1_GwhdXULq45GZkw3157fAOx4Wqo-fmFV?usp=sharing)
You can train your own custom NER model with an [CoNLL 20003 IOB](https://www.aclweb.org/anthology/W03-0419.pdf) formatted dataset.
By default *Glove 100d Token Embeddings* are used as features for the classifier.

```python
train_path = '/content/eng.train'
fitted_pipe = nlu.load('train.ner').fit(dataset_path=train_path)
```

If a NLU reference to a Token Embeddings model is added before the train reference, that Token Embedding will be used when training the NER model.

```python
# Train on BERT embeddigns
train_path = '/content/eng.train'
fitted_pipe = nlu.load('bert train.ner').fit(dataset_path=train_path)
```



## Saving a NLU pipelien to disk
# Saving a NLU pipeline to disk

```python
train_path = '/content/eng.train'
Expand All @@ -61,7 +131,7 @@ fitted_pipe.save(stored_model_path)

```

## Loading a NLU pipeline from disk
# Loading a NLU pipeline from disk

```python
train_path = '/content/eng.train'
Expand All @@ -73,7 +143,7 @@ hdd_pipe = nlu.load(path=stored_model_path)



## Loading a NLU pipeline as pyspark.ml.PipelineModel
# Loading a NLU pipeline as pyspark.ml.PipelineModel
```python
import pyspark
# load the NLU pipeline as pyspark pipeline
Expand Down

Large diffs are not rendered by default.

Loading

0 comments on commit f533eaa

Please sign in to comment.