by: Pratyush Singh
Create bi/tri-grams out of column data. The bi/trigram are created for each row within the data. This application is different than other use-cases where bi/trigrams are created for the entire text as a whole. In this case, a bi/tri-gram is created for each row within the data after pre-processesing. The list of bi/tri-grams returned are useful for visualization of your data through WordClouds or other word frequency visualizations.
- python 3.*
- pip3
pip3 install -r requirements.txt
!pip install -r https://raw.githubusercontent.com/pratyushsingh97/Bigrams/master/requirements.txt
!curl -O https://raw.githubusercontent.com/pratyushsingh97/Bigrams/master/bigrams.py
Simply pass the column of text within your DataFrame to the runner()
function. The runner()
function returns a list of n-grams.
bigrams.py
already filters out for common stopwords; however, based on your data you may want to add additional filtering. You can pass a list of words that you would like to filter, and bigrams.py
adds these words to the list of stopwords.
import pandas as pd
dummy_data = pd.DataFrame({"input_text": ["I like pie",
"I like to eat",
"I like to watch movies on the weekends"]})
bigrams = runner(dummy_data['input_text'], stopwords=["like"]) # add "like" to the list of stopwords
print(bigrams) # prints a list of three bigrams (one for each row)
Make sure to follow the installation steps for Jupyter Notebooks.
import bigrams
import pandas as pd
dummy_data = pd.DataFrame({"input_text": ["I like pie",
"I like to eat",
"I like to watch movies on the weekends"]})
bigrams = bigrams.runner(dummy_data['input_text'], stopwords=["like"]) # add "like" to the list of stopwords
print(bigrams) # prints a list of three bigrams (one for each row)
bigrams.py
utilizes the nltk library to score each bi/tri-gram created for each input text. The highest rated bi/tri-gram is returned. If no bi/tr-grams exist within the data, then the original text is returned.bigrams.py
lemmatizes the words in the input text, so similar phrases will lead to the same bigram. For example "I am eating pie" and "I eat pie" result in the same bigram "eat_pie".- This function only works on
pandas.core.series.Series
objects right now. In most cases, this is equivalent to the column within your pandas DataFrame you wish to analyze (i.e.dummy_data['input_text']
).