-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raise StringLengthException if vectoriser is applied to strings that … #67
base: master
Are you sure you want to change the base?
raise StringLengthException if vectoriser is applied to strings that … #67
Conversation
…are not all greater in length than ngram_size
77ed699
to
f9f316f
Compare
Thanks @gw00207 Looks good. Eventually @Bergvca (the maintainer) will need to approve and pull this. One question though: what happens in those cases in which there are string values with length greater than or equal to One such string could be |
@ParticularMiner i have updated the test and code but obviously adding this regex replacement has a computational cost, interested what you think |
Good point. If indeed this validation is computationally costly, then may I suggest you attempt to tackle the problem at its source instead of performing your own validation. What I mean is, you could instead reexamine the traceback log of the original Is this alternative acceptable to you? |
@ParticularMiner how is this? I can squash commits if required |
I've tested it myself and it's perfect! Though I was surprised by the fact that all strings need to be problematic before a @Bergvca (the maintainer) will let you know whether you need to squash your commits after he approves your PR. Thanks for this! |
Hi @gw00207 , Thanks for your interest in this repo, and taking the time to help improve it!. I understand the need for the PR, and the reasoning of @ParticularMiner to get to this implementation. However, the try/except clause now catches all value errors that the TfIdfVectorizer's "fit" function might throw. It could be that the string length (after stopword/punctuation removal) is the only one, in that case I can approve this PR. Could you check if there are other reasons that could cause a ValueError in the TfIdfVectorizer.fit() function? Another question - what do we need the separate "Error" subclass? Thanks! |
…are not greater than or equal in length to ngram_size