MiniCat is short for Mini Text Categorizer.
The goals of this tool is to :
- Serve as a simple, interactive interface to categorize text documents into up to 10 custom categories using Google's Natural Language API and Google Cloud Machine Learning Engine.
- Use Cloud ML Engine and Natural Language API to see how it can improve the performance/accuracy and provide an end-to-end solution for your ML needs.
- Serve as a template for your own end-to-end text classification workflows using Google Cloud Platform APIs.
It is recommended to use a Virtual Environment, but not required. Installing the above dependencies in a new virtual environment allows you to run the sample without changing global python packages on your system.
There are two options for the virtual environments:
- Install Virtual env
- Create virtual environment:
virtualenv MiniCat-env
- Activate env:
source MiniCat-env/bin/activate
- Create virtual environment:
- Install Miniconda
- Create conda environment:
conda create --name MiniCat-env python=2.7
- Activate env:
source activate MiniCat-env
- Create conda environment:
Python 2.7 required.
pip install -r requirements.txt
Setup a google cloud project and enable the following APIs:
Then create a Google Cloud Storage bucket. This is where all your model and training related data will be stored. For more information check out the tutorials in the documentation pages.
A simple terminal-based tool that allows document labeling for training, as well as label curation.
python main.py label --data_csv_file <filename.csv> \
--local_working_dir <MiniCat/data>
-
data_csv_file
: path to your csv which should contain these 3 column headers :file_path
: full file path of where the text is to be read fromtext
: Text for the data point (Only one of either file_path or text is required.)labels
: The class which the text belong to (can be empty)
-
local_working_dir
: This is where all the different csv versions of your data and the prediction results is going to be located at.
Use the NL API and ML Engine to train a classifier using the text and labels prepared by the labeler.
python main.py train --local_working_dir <MiniCat/data> \
--version <version_number> \
--gcs_working_dir <gs://bucket_name/file_path> \
--vocab_size <number> \
--region <us-central-1> \
--scale_tier
local_working_dir
: Directory where all the csv version files are located.version
: Version number of csv to be used for training.gcs_working_dir
: Path to your Google Cloud Storage directory to use for training and storing the models and dataset (of the form :-gs://bucket_name/some_path
).vocab_size
: Size of the vocabulary to use for training. (Default :- 20000)region
: REGION where training should occur. Ideally set this the same as the REGION where your Google Cloud Storage bucket is located. (Default :-us-central-1
)scale_tier
: Mention this flag to train with GPU's. The scale_tier will be set toBASIC_GPU
.
This tool could be used to classify different types of text data such as emails, support-tickets, movie reviews, news topics etc.
Let's consider the case of emails.
Create a working directory emails
in your home directory.
As an example, export your emails from gmail into a mailbox file. Then post-process into the following csv format.
Create a spreadsheet similar to :
. | file_path | text | labels |
---|---|---|---|
1 | ~/emails/file1.txt | Important | |
2 | ~/emails/file2.txt | Unimportant | |
3 | ~/emails/file3.txt | ||
4 | ~/emails/file4.txt | Important |
.
.
In this example each email's text is in a file. There are some seed labels that can be used to partially label the set of emails.
The spreadsheet can also be in this format :
. | file_path | text | labels |
---|---|---|---|
1 | You just won a prize for $5000 ... | Unimportant | |
2 | Your friends Alice tagged you in ... | Important | |
3 | Call #0000 and get a free Iphone ... | ||
4 | Signup today for holiday packages... | Important |
.
.
Note: You could also use a mix of both text
and file_path
in the
spreadsheet.
Create the spreadsheet according to your requirements and save it in
working_directory emails
under the name emails.csv
.
Make sure python 2.7 is installed. Follow the commands in the Virtual
Environments Setup section. Fork the git repository and
from inside the directory run :- pip install -r requirements.txt
Create a Google Cloud Platform project and setup billing and credentials. For info on how to do that see the steps 1,2,4,5 and 6 on this page.
Set up APIs by following the setup mentioned above.
Create a Google cloud Storage bucket emails
and then create a
directory under it called working_dir
.
From the git-repo directory, run the following command
python main.py label --data_csv_file ~/emails/email.csv \
--local_working_dir ~/emails/
First the tool will ask you to select a set of target labels :-
Automatically detected labels :
Important
Unimportant
Enter a new label or enter 'd' for done :
Then the tool will allow you to label the text :-
Id Label
0 Important
1 Unimportant
Call #0000 and get a free Pixel today. Select between all google phones........
Enter the Label id ('d' for done, 's' to skip) : 1
The labelling workflow will continue until you have labelled all the unlabelled text or you type 'd'.
The tool should exit at the end saying a new version 1 was created.
From the git-repo directory, run the following command
python main.py train --local_working_dir ~/emails/ \
--version 1 \
--gcs_working_dir gs://emails/working_dir \
--scale_tier
Note: Don't use the flag scale_tier
if you do not want to use a GPU
while training.
This will start the training on the version 1 labels file which was created
using the labeler tool. The tool will output a url which can be used to view the
job's progress. Wait for the job to finish and the results to be displayed.
There should be a file in ~/emails/v1/predictions.csv
that will contain the
predicted labels and prediction confidence for all your data points.
At this point if the results are unsatisfactory then label some more examples.
Predictions in ~/emails/v1/predictions.csv
could be used to help in
labelling the new version of labels.
Run the command below to start labelling again. :-
python main.py label --data_csv_file ~/emails/v1/predictions.csv \
--local_working_dir ~/emails/
Note: We call the labelling on the predictions.csv
file from version 1.
This will lead to the same labelling process. After labelling some more examples call the trainer module :-
python main.py train --local_working_dir ~/emails/ \
--version 2 \
--gcs_working_dir gs://emails/working_dir \
--scale_tier
Repeat the same process if the results are still unsatisfactory.
- If you are still not satisified with the training results, here are some
things you could do :-
- Run the model for more number of epochs by changing the 'num_epochs'
value in
params.json
. - If you have a lot of training data (say > 20000) you could increase the
number of hyper-parameters in
params.json
. - Provide more training examples for the labels that are performing badly
- Run the model for more number of epochs by changing the 'num_epochs'
value in
A few errors that might commonly occur and their possible solutions :-
google.cloud.exceptions.TooManyRequests:
This error is due to the tool making too many requests too quickly. Add some sort of throttling liketime.sleep(0.1)
before making the NL API requests in trainer.py.The provided GCS paths [] cannot be read by service account $srvacct
This error occurs when the$srvcacct
doesn't have write permissions to the GCS bucket. Run the following command to set the ACL permissions :-
gsutil defacl ch -u $SVCACCT:O gs://$BUCKET/
This is not an official Google product.