Skip to content

sumn2u/string-comparisons

Repository files navigation

String Comparisons

NPM version npm GitHub stars GitHub license example workflow

This library offers a range of functions to calculate text similarity, allowing you to measure the likeness of text data in an application. It implements well-established similarity metrics. The library currently supports the following algorithms:

  • Cosine Similarity
  • Jaccard Similarity
  • Jaro Similarity
  • Damerau-Levenshtein Distance
  • Hamming Distance
  • Levenshtein Distance
  • Smith-Waterman Alignment
  • Sørensen-Dice Coefficient
  • Jaccard Similarity based on Trigrams
  • Szymkiewicz Simpson Overlap
  • N-Gram
  • Q-Gram
  • Optimal String Alignment

Installation

Assuming you have Node.js and npm/yarn/pnpm installed, install the library using:

# Install the 'string-comparisons' package using npm
npm install string-comparisons

# Alternatively, install the 'string-comparisons' package using yarn
yarn add string-comparisons

# Or, install the 'string-comparisons' package using pnpm
pnpm add string-comparisons

Docs

Find more information on the algorithms by accessing the class documentation of each implemented algorithm.

String Similarity Algorithm Comparison

Algorithm Normalized Metric Similarity Distance Space Complexity
cosine.js Yes Vector Space Model O(n)
jaro.js No Edit Distance O(min(n, m))
jaccard.js No Set Theory O(min(n, m))
damerauLevenshtein.js No Edit Distance O(max(n, m)²)
hammingDistance.js No Bitwise Operations O(1)
jaroWinkler.js No Edit Distance O(min(n, m))
levenshtein.js No Edit Distance O(max(n, m)²)
smithWaterman.js No Dynamic Programming (Local Alignment) O(n * m)
sorensenDice.js No Set Theory O(min(n, m))
trigram.js No N-gram Overlap O(n²)
szymkiewiczSimpsonOverlap.js Yes Overlap Coefficient O(min(m, n))
nGram.js Yes Jaccard similarity coefficient O(m * n)
qGram.js Yes Jaccard similarity coefficient O(n + m)
optimalStringAlignment.js No Edit distance O(max(n, m)²)

Explanation of Columns:

  • Normalized: Indicates whether the algorithm produces a score between 0 and 1 (normalized).
  • Metric: The underlying mathematical concept used for comparison.
  • Similarity: Whether the algorithm outputs a higher score for more similar strings.
  • Distance: Whether the algorithm outputs a lower score for more similar strings. (One algorithm might use similarity, another distance - they provide the opposite information).
  • Space Complexity: The amount of extra memory the algorithm needs to run the comparison.

Notes:

  • ✓ indicates the algorithm applies to that category.
  • Some algorithms can be used for both similarity and distance calculations depending on the interpretation of the score.

Example Usage

import StringComparisons from 'string-comparisons';

const { Cosine, Jaccard, Jaro, DamerauLevenshtein, HammingDistance, JaroWrinker, Levenshtein, SmithWaterman, SorensenDice, Trigram } = StringComparisons;

const string1 = 'programming';
const string2 = 'programmer';


console.log('Jaro-Winkler similarity:', JaroWrinker.similarity(string1, string2)); // Output: ~0.9054545454545454
console.log('Levenshtein distance:', Levenshtein.similarity(string1, string2)); // Output: 3
console.log('Smith-Waterman similarity:', SmithWaterman.similarity(string1, string2)); // Output: 16

const set1 = new Set([1, 2, 3]);
const set2 = new Set([2, 3, 4]);

console.log('Sørensen-Dice similarity:', SorensenDice.similarity(set1, set2)); // Output: 0.6666666666666667

const trigram1 = 'hello';
const trigram2 = 'world';

console.log('Trigram Jaccard similarity:', Trigram.similarity(trigram1, trigram2)); // Output: 0 (no shared trigrams)

// so on

Contributing

We encourage contributions to this library! Feel free to fork the repository, make your changes, and submit pull requests.

Support the Project

If you feel awesome and want to support us in a small way, please consider starring and sharing the repo! This helps us get visibility and allow the community to grow. 🙏

Contact Us

If you have any questions or feedback, please don't hesitate to contact us at [email protected], or reach out to Suman directly. We hope you find this resource helpful 💜.

License Information

This project is licensed under the MIT , which means that you are free to use, modify, and distribute the code as long as you comply with the terms of the license.

Resources