This tool was created by Chreston Miller, a researcher at Virginia Tech, to identify words with flavor-descriptive meanings in English-language food reviews and other texts. It is based on the archetecture described by Intuition Engineering in the article "Deep learning for specific information extraction from unstructured texts". The results on whiskey reviews have been published in "Sensory Descriptor Analysis of Whisky Lexicons through the Use of Deep Learning". Significant changes to streamline and standardize preprocessing have been made by Leah Hamilton, a fellow author and current maintainer of this repository. Email Chreston Miller if you want code to identically reproduce the results of the Foods paper.
Programmed in Python using the deep learning backend keras, this package includes a set of whiskey reviews from Whiskycast, WhiskyAdvocate, The Whiskey Jug, and Breaking Bourbon with manually-annotated examples of flavor-descriptive and flavor-nondescriptive adjectives and nouns, a trained neural network to predict the descriptive or nondescriptive status of a token in context, and code to use the trained network to tag all tokens in the example dataset as descriptive or nondescriptive.
When you download this package using pip
, it should automatically download the dependencies:
keras>=2.3
numpy
pandas
tensorflow>2
If you'd like to enable GPU acceleration (recommended for (re)training), you are encouraged to download your preferred versions of tensorflow
(via pip
) and CUDA (via the NVIDIA website) before installing this package.
pip install flavorfetcher
If you are downloading this package from the source (e.g., acquired via GitHub), you'll have to replace flavorfetcher
with a pointer to the directory this README.md
is in, which will be specific to you. As one possible example:
pip install C:/Users/YOURNAME/Documents/flavor-fetcher
Once you have installed this package via pip
, you will be able to run the code examples below in any Python interpreter or IDE by importing the relevant functions.
Additionally, to run this code, you will need to manually download the 300-dimension GloVe embeddings from the Stanford NLP website (glove.42B.300d.txt
). If you're new to programming, we recommend saving this large file to the same place as your data files so you'll know where to find it.
This code is intended to replicate the results from "Sensory Descriptor Analysis of Whisky Lexicons through the Use of Deep Learning" or allow users to identify flavor-descriptors in their own datasets using the pre-trained model.
This neural network was trained using text that had been tokenized by the R package cleanNLP, a "tidy" implementation of SpaCy, and specifically using the language model en_core_web_sm
.
If you have your data in the format created by cleanNLP's implementation of SpaCy, it should already be formatted appropriately, i.e., with columns:
token
- The token as a string, lowercase and with white space removed but otherwise as it appears in the text.upos
- The part of speech, based on the Universal Part of Speech bankdoc_id
- A unique number corresponding to the document that the token was in.sid
- A number indicating which sentence the token was in within its document.tid
- A number indicating a token's position within its sentence.
You can use any English-language tokenizer & part-of-speech tagger to prepare your data (although results may vary based on tokenization differences) so long as you then format your data into a .csv
file with these columns. If your data has these columns but not these exact names, you can provide this tool with a special column specification as shown below.
You can make predictions in Python by importing the package and calling the function file_to_file_workflow()
from the module flavor-fetcher.makepredictions
. The simplest example (using the provided example data, model, etc) is:
from flavorfetcher.makepredictions import file_to_file_workflow as do_predict
do_predict(glove_file = "glove.42B.300d.txt")
Because of its size, the GloVe embedding file is not included with this package and you must always specify its location. All other files are optional.
By default, this package will pull the model, settings, and data files from the ones included in the package, and will save the output as tokenwise_lstm_predictions_YYYYMMDD-HHMMSS.csv
in the current working directory. If you're not sure where that is, you can always specify the output location yourself. You will also likely have to specify your own data, for most use-cases. So a more common usage may look like:
from flavorfetcher.makepredictions import file_to_file_workflow as do_predict
do_predict(reviews_file = "example_data.csv", glove_file = "glove.42B.300d.txt", output_file = "predicted_descriptors.csv")
In practice, you will likely need to change the above file names to reflect your specific use-case and to represent where the files are located in your filesystem, unless they're all in the current working directory.
The script, by default, looks for the exact column names mentioned above. All files need to have these columns, but if your .csv
file has differing column names, you can specify them as follows:
mycolnames = {"token" : "customtokenid", "upos" : "customuposid", "doc_id" : "customdocid", "sid" : "customsentenceid", "tid" : "customtokenid"}
file_to_file_workflow("example_data.csv", "glove.42B.300d.txt", "model.h5", "model.json", "predicted_descriptors.csv", mycolnames)
You can also change only a few of the default column names by importing flavorfetcher.datamunging.default_col_names
:
from flavorfetcher.datamunging import default_col_names as mycolnames
mycolnames['tid'] = "token_id"
mycolnames['sid'] = "sentence_id"
file_to_file_workflow("example_data.csv", "glove.42B.300d.txt", "model.h5", "model.json", "predicted_descriptors.csv", mycolnames)
Coming soon
R and RStudio users should be able to use the R package reticulate
to call this package's functions from within an R script, but this functionality has not been tested and should be considered a recommendation for advanced users only.
A full tutorial for R users is planned in the future.
The descriptor_model
module contains a definition of the DescriptorNN
class, whose methods will allow for limited training, but anyone looking to change the architecture or fine-tune the model should look into implementing a model of the same/similar structure in keras
itself.
If you use this tool, the best ways to give back are to cite our paper and to let us know how it's doing! We are actively looking to improve this tool.
To let us know how it's doing, take around 20 sentences from your dataset (or more), tokenize them, and put them into a tidy format. Add a column with the "ground truth"--0
for most of the tokens and 1
if you think it is a word describing a flavor. Run this little dataset through our tool and send the results to Leah Hamilton with a little explanation of your data/project. We may contact you about your data if the tool is performing poorly on it, to help make it better for everyone. Even if this isn't possible, we appreciate you sending a little report one way or another!
This repository only works on English text. If you are interested in identifying descriptors in another language and speak that language, please reach out to Leah Hamilton about a potential collaboration.
This package requires python 3.7 or newer. If you don't know how python works, how to install it, or how to run python code, you should start by reading and following the installation tutorial for Windows or Mac computers and the tutorial on using python interactively. We advise against installing python as a part of anaconda
, since it will download much more than you actually need.
If you are having issues installing this package or if you get any error messages related to keras
, tensorflow
, pip
, or setuptools
, especially if you're using a Mac, you may find that it's easier to install and run this package inside of a virtual environment. Note that some tensorflow
"errors" are more like warnings--if the code continues to run, it's probably fine!
A python virutal environment is a quarantined installation of python that allows you to have more control over exactly what versions of what packages are installed without interfering with or trying to use other python installations on your system.
First, decide where you'll save your virtual environment. This could be in your documents folder, or wherever you keep your code. Whatever is convenient and memorable for you. Navigate there in your command line of choice.
To create the virtual environment here and install the packages needed, run the code:
python -m venv flavor-fetcher-test
flavor-fetcher-test\Scripts\activate
python -m pip install flavorfetcher
This only needs to be run once per system, when you first set up your virtual environment. You can change the name of the virtual environment (flavor-fetcher-test
) if you want. You will have to make the corresponding change in the line that activates the virtual environment as well.
From here, once the virtual environment is activated (you'll know because the command line will replace the >
character with the name of the virtual environment wherever you're typing), you can start a python session using python
and then follow the instructions above to actually run the package.
Any time you want to use the virtual environment, you should run flavor-fetcher-test\Scripts\activate
, and when you are done, you can deactivate it by typing deactivate
.
On a middle-of-the-road computer, you should estimate that running this package will take about 5 minutes for initial setup and then 20-50 ms/token to make the predictions.
If you have a particularly large dataset, you might consider splitting it into multiple files, as the program does not save until it's finished and you will lose all prediction progress on an interruption.
If you have a computer with an NVIDIA GPU, you can also set up tensorflow
to use GPU acceleration. To do this, you should install an appropriate version of CUDA and a GPU-compatible version of tensorflow
yourself using pip
as descibed in the tensorflow documentation, rather than letting this package install it as a dependency. If you are using a laptop and are not sure whether you have an NVIDIA GPU, you probably don't.
README.md
- This file.LICENSE
- The GPL v3.0 license this software is provided under.MANIFEST.in
&pyproject.toml
- Files required forpip
or another builder to make an installable version of this code. Do not move or change these.src/makepredictions.py
- A python module with the functionfile_to_file_workflow()
, which can be given the file locations of a.csv
reviews_file, a.h5
model file with the appropriate architecture, the 300-dimension GloVe embeddings (glove.42B.300d.txt
), a settings file (.json
with theCONTEXT_WINDOW_SIZE
andGLOVE_EMBEDDING_DIMENSIONS
used to train the model), and an output file (.csv
, should not currently exist). If the column names are not identical to those in the example filesrc/ata/Example Whiskey Reviews.csv
, the column names will have to be specified as a dictionary using the last argument as show in the usage examples. This file can also be run as a script from the command line, which will by default reproduce the results of "Sensory Descriptor Analysis of Whisky Lexicons through the Use of Deep Learning" but can be used to use a pre-trained model to predict descriptors in any tokenized and POS-tagged dataset with the appropriate arguments.src/datamunging.py
- A python module with many helper functions used to remove punctuation, reindex tokens forkeras
, convert tokens to GloVe embeddings, and set each token's context window. Primarily called bysrc/makepredictions.py
src/descriptor_model.py
- A python script defining theDescriptorNN
class, which holds a (trained) descriptor-predictor model and is used to make predictions.src/data/100_2021-04-20 10_38_24.203316_3_epochs_re-run_model.h5
- The exact network trained and used in the paper "Sensory Descriptor Analysis of Whisky Lexicons through the Use of Deep Learning". Necessary to make predictions usingsrc/makepredictions.py
.src/data/100_2021-04-20 10_38_24.203316_3_epochs_re-run_model.json
- A settings file that should be saved and distributed alongside the trained model in order to specify theCONTEXT_WINDOW_SIZE
andGLOVE_EMBEDDING_DIMENSIONS
used to train the model so predictions can be made on appropriately-shaped data.src/data/Example Whiskey Reviews.csv
- A data file containing one row per token in 100 selected reviews from the Whiskycast and WhiskyAdvocate websites, used for prediction. Tokenization and annotation done using SpaCy through CleanNLP. Only the fieldsupos
,token
,doc_id
,sid
, andtid
are used. It also contains the predictions as calculated by the provided LSTM on our machine (expected_prediction
), so you can confirm the package is working as intended.
If you use this package in any published work, please cite the associated publication:
Miller, C.; Hamilton, L.; Lahne, J. 2021. Sensory Descriptor Analysis of Whisky Lexicons through the Use of Deep Learning. Foods 10(7), 1633. https://doi.org/10.3390/foods10071633