Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] Short sentences in en #24

Closed
loretoparisi opened this issue Nov 27, 2019 · 3 comments
Closed

[QUESTION] Short sentences in en #24

loretoparisi opened this issue Nov 27, 2019 · 3 comments

Comments

@loretoparisi
Copy link

loretoparisi commented Nov 27, 2019

Hello, I have found this case that seems strange for the input string "I am the begt spell cherken!":

int maxEd = 2;
suggestionItems = symSpellCheck.lookupCompound("I am the begt spell cherken!", maxEd);
    for (SuggestionItem elem : suggestionItems) {
      System.out.println("compound : " + elem.getTerm().trim());
    }

I'm getting compound : a am the best spell cher ken

My setup is the default one:

    SpellCheckSettings spellCheckSettings = SpellCheckSettings.builder().countThreshold(1).deletionWeight(1f)
        .insertionWeight(1f).replaceWeight(1f).maxEditDistance(2).transpositionWeight(1f).topK(5).prefixLength(10)
        .verbosity(Verbosity.ALL).build();

    dataHolder = new InMemoryDataHolder(spellCheckSettings, new Murmur3HashFunction());

// weighted Damerau-Levenshtein
    weightedDamerauLevenshteinDistance = new WeightedDamerauLevenshteinDistance(spellCheckSettings.getDeletionWeight(),
        spellCheckSettings.getInsertionWeight(), spellCheckSettings.getReplaceWeight(),
        spellCheckSettings.getTranspositionWeight(), null);

    symSpellCheck = new SymSpellCheck(dataHolder, weightedDamerauLevenshteinDistance, spellCheckSettings);
@MighTguY
Copy link
Owner

MighTguY commented Dec 1, 2019

Thanks, @loretoparisi, for pointing out the issue

I to a conversion is happening due to the casing issue, which I have fixed in the PR.
Also for cherken to 'cher ken' is happening due to lookup compound algo.
Currently, the spell correction works mainly on the term frequency of the words, and it mainly doesn't include the previous word meaningful context. the top word suggested from the symspell at the time of lookup is itself not checker, it's chicken so hence it's giving such output,

I am currently working to add a noisy channel with the current algo for the spell checker to add much meaningful context.

@loretoparisi
Copy link
Author

loretoparisi commented Dec 1, 2019

@MighTguY thank you. I will try to build the latest PR thanks.
Regarding context based spelling, I was discussing about it with @wolfgarbe in wolfgarbe/SymSpell#61

where he suggests this word2vec approach: https://blog.usejournal.com/a-simple-spell-checker-built-from-word-vectors-9f28452b6f26?gi=5df44b8031b6

The Jupyter code of this example with common mistakes is here.
Other interesting ngram based approach can be found in

@MighTguY
Copy link
Owner

Hi, @loretoparisi The fix has been done, for the lower case. I have also added the Solr Plugin code to use the symspell lib.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants