Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ngrams #13

Open
murtuzamdahod opened this issue Sep 30, 2020 · 4 comments
Open

Ngrams #13

murtuzamdahod opened this issue Sep 30, 2020 · 4 comments
Labels
enhancement New feature or request

Comments

@murtuzamdahod
Copy link

Does this model takes care of ngrams like "hot dog" = "hotdog", "ice cream" = "icecream"??

I have these ngrams in my training data

Also what if i want to remove words which are not corrected and not even in my vocabulary?
For eg:

IN : "Cheese hot dog abcd"
OUT: "Cheese hotdog"

@chiragjn chiragjn added the enhancement New feature or request label Oct 1, 2020
@chiragjn
Copy link
Contributor

chiragjn commented Oct 1, 2020

Unfortunately, no. In current state spello operates on unigrams and can't perform any n-gram normalisation
We would gladly accept such an enhancement.

The latter part should be some what easy - with some modifications you can get the internal vocabulary of the model and then exclude words with a simple filter. spello does not do it by default because it might end up deleting some important context (domain-specific jargons, etc)

@murtuzamdahod
Copy link
Author

Thank you for your response. Then maybe I can build an ngram model on top of it.

The major issue is that words like "panii puri" should be corrected to "pani puri" / "panipuri" as per the context. But it gives me "panini puri". I have trained spello on my dataset of around 20 lakh rows (3 lakh unique).

@chiragjn
Copy link
Contributor

chiragjn commented Oct 1, 2020

Interesting,
Two questions:

  • does panii occur in your train set, if yes what is the count?
  • does pani puri occur at least once in your train set?

because at the moment spello does not attempt to correct a word if it is in the vocabulary of the trained model. That is also one of the enhancements which would be welcome.
https://github.com/hellohaptik/spello#future-scope--limitations
Fixing grammatical mistakes and replacing legit words with contextually sensible words would definitely require more intelligence.

If panii does not occur in your training set, then that is surely a bug and we would like to fix it.
If possible, maybe you can provide us with only sentences that contain panii, pani, puri so we might try re-producing.

@murtuzamdahod
Copy link
Author

"panii " should not be there in my vocabulary because then only it gets corrected to "panini". But I am sure I don't have "panini puri" in my dataset :P
So as per the context, "panii puri" should be "pani puri".

Earlier, I was just using regexp to map the ngrams that I require manually. Now, I will need to look into building an n-gram model and see if it works well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants