[QUESTION] Short sentences in en #24

loretoparisi · 2019-11-27T14:16:36Z

Hello, I have found this case that seems strange for the input string "I am the begt spell cherken!":

int maxEd = 2;
suggestionItems = symSpellCheck.lookupCompound("I am the begt spell cherken!", maxEd);
    for (SuggestionItem elem : suggestionItems) {
      System.out.println("compound : " + elem.getTerm().trim());
    }

I'm getting compound : a am the best spell cher ken

My setup is the default one:

    SpellCheckSettings spellCheckSettings = SpellCheckSettings.builder().countThreshold(1).deletionWeight(1f)
        .insertionWeight(1f).replaceWeight(1f).maxEditDistance(2).transpositionWeight(1f).topK(5).prefixLength(10)
        .verbosity(Verbosity.ALL).build();

    dataHolder = new InMemoryDataHolder(spellCheckSettings, new Murmur3HashFunction());

// weighted Damerau-Levenshtein
    weightedDamerauLevenshteinDistance = new WeightedDamerauLevenshteinDistance(spellCheckSettings.getDeletionWeight(),
        spellCheckSettings.getInsertionWeight(), spellCheckSettings.getReplaceWeight(),
        spellCheckSettings.getTranspositionWeight(), null);

    symSpellCheck = new SymSpellCheck(dataHolder, weightedDamerauLevenshteinDistance, spellCheckSettings);

The text was updated successfully, but these errors were encountered:

MighTguY · 2019-12-01T16:36:07Z

Thanks, @loretoparisi, for pointing out the issue

I to a conversion is happening due to the casing issue, which I have fixed in the PR.
Also for cherken to 'cher ken' is happening due to lookup compound algo.
Currently, the spell correction works mainly on the term frequency of the words, and it mainly doesn't include the previous word meaningful context. the top word suggested from the symspell at the time of lookup is itself not checker, it's chicken so hence it's giving such output,

I am currently working to add a noisy channel with the current algo for the spell checker to add much meaningful context.

loretoparisi · 2019-12-01T22:12:36Z

@MighTguY thank you. I will try to build the latest PR thanks.
Regarding context based spelling, I was discussing about it with @wolfgarbe in wolfgarbe/SymSpell#61

where he suggests this word2vec approach: https://blog.usejournal.com/a-simple-spell-checker-built-from-word-vectors-9f28452b6f26?gi=5df44b8031b6

The Jupyter code of this example with common mistakes is here.
Other interesting ngram based approach can be found in

LanguageTool, Java, see Finding words in Context
JamSpell, C++

MighTguY · 2020-01-22T07:31:16Z

Hi, @loretoparisi The fix has been done, for the lower case. I have also added the Solr Plugin code to use the symspell lib.

MighTguY closed this as completed Jan 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] Short sentences in en #24

[QUESTION] Short sentences in en #24

loretoparisi commented Nov 27, 2019 •

edited

Loading

MighTguY commented Dec 1, 2019 •

edited

Loading

loretoparisi commented Dec 1, 2019 •

edited

Loading

MighTguY commented Jan 22, 2020

[QUESTION] Short sentences in en #24

[QUESTION] Short sentences in en #24

Comments

loretoparisi commented Nov 27, 2019 • edited Loading

MighTguY commented Dec 1, 2019 • edited Loading

loretoparisi commented Dec 1, 2019 • edited Loading

MighTguY commented Jan 22, 2020

loretoparisi commented Nov 27, 2019 •

edited

Loading

MighTguY commented Dec 1, 2019 •

edited

Loading

loretoparisi commented Dec 1, 2019 •

edited

Loading