Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Snowball stemmer to class Normalization.java #2

Open
wants to merge 53 commits into
base: master
Choose a base branch
from
Open

Added Snowball stemmer to class Normalization.java #2

wants to merge 53 commits into from

Conversation

lorikdumani
Copy link
Member

  • Also added a test case

* the similarity threshold is now only defined in the
  so-posthistory-extractor project
* select "rightmost" minimal hash value in window
* remove duplicate hash values (considering position)
* update methods fingerprintList and completeFingerprintList
* update test case
* robust winnowing as described in Schleimer03
* window size is now always twice the nGram size (as in Schleimer03)
* update test case
* if input is too short, throw InputTooShortException
* now checking size of input strings for ngram and shingle metrics
* implement default metrics and variants using longestCommonSubsequence
  and optimalAlignment
* made required edit base metrics public
* re-added metrics combining winnowing with edit-based metrics
* add file encoding
* update some versions
* remove unnecessary dependencies
* add comments
* InputTooShortException in now thrown there and not in metrics using
  those methods
* update test cases
* add new test case checking if InputTooShortException in thrown
* won't implement this
* moved length check to nGramList and shingleList
* refactorings
Reason: Values may be larger than 1.0:

str1 = "abc"
str2 = "abc"

2-Grams: {"ab", "bc"}
Intersection: {"ab", "bc"}

Dice variant: 2*2 / 2 = 2
sbaltes and others added 23 commits October 26, 2017 16:32
* refactored test cases
* removed dice variant based metrics
* issue when one string was empty
* now correctly ordering strings
* add test case
* fixed bug in levenshtein based metrics
* Kondrak05-based metrics now also throw InputTooShortException if input
  length is shorter than nGram size
* fix bug in optimal alignment
* revise Kondrak05-based metrics
* revise test cases
* rename method Base.equals to Base.equal to avoid confusion with
  Object.equals
* move equality-based metrics to package "equal"
* rename method Base.equals to Base.equal to avoid confusion with
  Object.equals
* move util methods to util project
* update default values after evaluation
* now calling listToSet instead of nGramSet (copy&paste error)
* fix bug in token-based similarity
* Implement a method to normalize text blocks by a stemmer
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants