Semantic Textual Similarity Pipeline

A text-based, sentence-pair Semantic Similarity pipeline written in Python that utilizes Natural Language Processing via spaCy & pywsd, Machine and Deep Learning via scikit-learn & TensorFlow Hub models (Cer, 2018), as well as heuristic approaches (Pawar, 2018) to create an ensemble model that scores the semantic similarity of two English sentences on an integer scale between 1 (not very similar) and 5 (highly similar).

Input is given as a tab delimited corpus of unscored sentence pairs and output is produced as a tab delimited table of scores. See the dev-set for a sample input and the dev-set predictions for a sample output.

For specific base as well as ensemble model metrics and error analysis, see the corresponding notebook.

Pipeline Architecture

Prerequisites

Python 3.X
pip

Getting Started

Ensure you are using the pip associated with the desired Python env:

which pip
/Path/to/desired_venv/bin/pip

Which should point to the desired env

Alternatively, invoke the desired env directly:

/Path/To/desired_venv/bin/python -m pip <commands>

Within the desired env, install project requirements with:

pip install -U -r requirements.txt

Install the following NLTK WordNet data (again making sure to use the desired env):

python -m nltk.downloader wordnet
python -m nltk.downloader omw
python -m nltk.downloader popular

Before running any programs / notebooks dependent on Pywsd, patch Pywsd 1.2.4:

Programs dependent on Pywsd:

ensembleModels.py
pawarModel.py

Notebooks dependent on Pywsd:

ensembleModels-Dev-Train.ipynb
ensembleModels-Test.ipynb
pawarModel-Dev-Train.ipynb
pawarModel-Test.ipynb

As of Pywsd 1.2.4 (future releases may render this patch obsolete), a bug exists in Pywsd that will cause word sense disambiguation using pywsd.max_similarity to fail with an IndexError. To patch this, find the Pywsd module in the site packages of the env that it was installed in / the env that the STS-Pipeline will be run within:

Example location:

/Path/to/desired_venv/lib/python3.8/site-packages/pywsd/

Within the method max_similarity in .../site-packages/pywsd/similarity.py add the following before the return statement:

    if not len(result):
        return None

max_similarity should now look like:

def max_similarity(context_sentence: str, ambiguous_word: str, option="path",
                   lemma=True, context_is_lemmatized=False, pos=None, best=True) -> "wn.Synset":
    """
    Perform WSD by maximizing the sum of maximum similarity between possible
    synsets of all words in the context sentence and the possible synsets of the
    ambiguous words (see https://ibin.co/4gG9zUlejUUA.png):
    {argmax}_{synset(a)}(\sum_{i}^{n}{{max}_{synset(i)}(sim(i,a))}

    :param context_sentence: String, a sentence.
    :param ambiguous_word: String, a single word.
    :return: If best, returns only the best Synset, else returns a dict.
    """
    ambiguous_word = lemmatize(ambiguous_word)
    # If ambiguous word not in WordNet return None
    if not wn.synsets(ambiguous_word):
        return None
    if context_is_lemmatized:
        context_sentence = word_tokenize(context_sentence)
    else:
        context_sentence = [lemmatize(w) for w in word_tokenize(context_sentence)]
    result = {}
    for i in wn.synsets(ambiguous_word, pos=pos):
        result[i] = 0
        for j in context_sentence:
            _result = [0]
            for k in wn.synsets(j):
                _result.append(sim(i,k,option))
            result[i] += max(_result)

    if option in ["res","resnik"]: # lower score = more similar
        result = sorted([(v,k) for k,v in result.items()])
    else: # higher score = more similar
        result = sorted([(v,k) for k,v in result.items()],reverse=True)
    
    if not len(result):
        return None
    
    return result[0][1] if best else result

Save the edited file. All STS-Pipeline programs / notebooks dependent on Pywsd should now run without issue.

Before running programs / notebooks, ensure environment variables are correct:

Verify that the PYTHONPATH system variable includes the sts_wrldom directory. For example:

echo $PYTHONPATH
> /Path/to/Projects/STS-Pipeline/sts_wrldom

Ensure that the working directory of the terminal the program will run in is the STS-Pipeline root directory. For example:

pwd
> /Path/to/Projects/STS-Pipeline

Usage

Standalone programs are:

corpusReader
enrichPipe
depTFIDFModel
pawarModel
ensembleModels

All programs can run with no command line options, however all programs offer command line options. Use:

python <program>.py -h

for available options.

Warning:

STS-Pipeline/notebooks/embedModel-Dev-Train-Test.ipynb was built and run directly inside Google Colabs. Avoid running it in a local environment as it has some pip3 installs that might clutter the local environment

References:

Pawar, Atish, and Vijay Mago. "Calculating the similarity between words and sentences using a lexical database and corpus statistics." arXiv preprint arXiv:1802.05667 (2018).

Cer, Daniel, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant et al. "Universal sentence encoder." arXiv preprint arXiv:1803.11175 (2018).

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
data		data
embeds		embeds
notebooks		notebooks
results		results
sts_wrldom		sts_wrldom
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
arch.png		arch.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semantic Textual Similarity Pipeline

Pipeline Architecture

Prerequisites

Getting Started

Ensure you are using the pip associated with the desired Python env:

Within the desired env, install project requirements with:

Before running any programs / notebooks dependent on Pywsd, patch Pywsd 1.2.4:

Before running programs / notebooks, ensure environment variables are correct:

Usage

Warning:

References:

About

Contributors 2

Languages

License

BigBossAnwer/STS-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Semantic Textual Similarity Pipeline

Pipeline Architecture

Prerequisites

Getting Started

Ensure you are using the pip associated with the desired Python env:

Within the desired env, install project requirements with:

Before running any programs / notebooks dependent on Pywsd, patch Pywsd 1.2.4:

Before running programs / notebooks, ensure environment variables are correct:

Usage

Warning:

References:

About

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages