Skip to content

A text-based, sentence-pair Semantic Similarity pipeline written in Python

License

Notifications You must be signed in to change notification settings

BigBossAnwer/STS-Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Semantic Textual Similarity Pipeline

A text-based, sentence-pair Semantic Similarity pipeline written in Python that utilizes Natural Language Processing via spaCy & pywsd, Machine and Deep Learning via scikit-learn & TensorFlow Hub models (Cer, 2018), as well as heuristic approaches (Pawar, 2018) to create an ensemble model that scores the semantic similarity of two English sentences on an integer scale between 1 (not very similar) and 5 (highly similar).

Input is given as a tab delimited corpus of unscored sentence pairs and output is produced as a tab delimited table of scores. See the dev-set for a sample input and the dev-set predictions for a sample output.

For specific base as well as ensemble model metrics and error analysis, see the corresponding notebook.

Pipeline Architecture

STS Pipeline Architecture

Prerequisites

  • Python 3.X
  • pip

Getting Started

Ensure you are using the pip associated with the desired Python env:

which pip
/Path/to/desired_venv/bin/pip

Which should point to the desired env

Alternatively, invoke the desired env directly:

/Path/To/desired_venv/bin/python -m pip <commands>

Within the desired env, install project requirements with:

pip install -U -r requirements.txt

Install the following NLTK WordNet data (again making sure to use the desired env):

python -m nltk.downloader wordnet
python -m nltk.downloader omw
python -m nltk.downloader popular

Before running any programs / notebooks dependent on Pywsd, patch Pywsd 1.2.4:

Programs dependent on Pywsd:

  • ensembleModels.py
  • pawarModel.py

Notebooks dependent on Pywsd:

  • ensembleModels-Dev-Train.ipynb
  • ensembleModels-Test.ipynb
  • pawarModel-Dev-Train.ipynb
  • pawarModel-Test.ipynb

As of Pywsd 1.2.4 (future releases may render this patch obsolete), a bug exists in Pywsd that will cause word sense disambiguation using pywsd.max_similarity to fail with an IndexError. To patch this, find the Pywsd module in the site packages of the env that it was installed in / the env that the STS-Pipeline will be run within:

Example location:

/Path/to/desired_venv/lib/python3.8/site-packages/pywsd/

Within the method max_similarity in .../site-packages/pywsd/similarity.py add the following before the return statement:

    if not len(result):
        return None

max_similarity should now look like:

def max_similarity(context_sentence: str, ambiguous_word: str, option="path",
                   lemma=True, context_is_lemmatized=False, pos=None, best=True) -> "wn.Synset":
    """
    Perform WSD by maximizing the sum of maximum similarity between possible
    synsets of all words in the context sentence and the possible synsets of the
    ambiguous words (see https://ibin.co/4gG9zUlejUUA.png):
    {argmax}_{synset(a)}(\sum_{i}^{n}{{max}_{synset(i)}(sim(i,a))}

    :param context_sentence: String, a sentence.
    :param ambiguous_word: String, a single word.
    :return: If best, returns only the best Synset, else returns a dict.
    """
    ambiguous_word = lemmatize(ambiguous_word)
    # If ambiguous word not in WordNet return None
    if not wn.synsets(ambiguous_word):
        return None
    if context_is_lemmatized:
        context_sentence = word_tokenize(context_sentence)
    else:
        context_sentence = [lemmatize(w) for w in word_tokenize(context_sentence)]
    result = {}
    for i in wn.synsets(ambiguous_word, pos=pos):
        result[i] = 0
        for j in context_sentence:
            _result = [0]
            for k in wn.synsets(j):
                _result.append(sim(i,k,option))
            result[i] += max(_result)

    if option in ["res","resnik"]: # lower score = more similar
        result = sorted([(v,k) for k,v in result.items()])
    else: # higher score = more similar
        result = sorted([(v,k) for k,v in result.items()],reverse=True)
    
    if not len(result):
        return None
    
    return result[0][1] if best else result

Save the edited file. All STS-Pipeline programs / notebooks dependent on Pywsd should now run without issue.

Before running programs / notebooks, ensure environment variables are correct:

Verify that the PYTHONPATH system variable includes the sts_wrldom directory. For example:

echo $PYTHONPATH
> /Path/to/Projects/STS-Pipeline/sts_wrldom

Ensure that the working directory of the terminal the program will run in is the STS-Pipeline root directory. For example:

pwd
> /Path/to/Projects/STS-Pipeline

Usage

Standalone programs are:

  • corpusReader
  • enrichPipe
  • depTFIDFModel
  • pawarModel
  • ensembleModels

All programs can run with no command line options, however all programs offer command line options. Use:

python <program>.py -h

for available options.

Warning:

STS-Pipeline/notebooks/embedModel-Dev-Train-Test.ipynb was built and run directly inside Google Colabs. Avoid running it in a local environment as it has some pip3 installs that might clutter the local environment

References:

Pawar, Atish, and Vijay Mago. "Calculating the similarity between words and sentences using a lexical database and corpus statistics." arXiv preprint arXiv:1802.05667 (2018).

Cer, Daniel, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant et al. "Universal sentence encoder." arXiv preprint arXiv:1803.11175 (2018).

About

A text-based, sentence-pair Semantic Similarity pipeline written in Python

Resources

License

Stars

Watchers

Forks