Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raise StringLengthException if vectoriser is applied to strings that … #67

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

gw00207
Copy link

@gw00207 gw00207 commented Sep 1, 2021

…are not greater than or equal in length to ngram_size

…are not all greater in length than ngram_size
@gw00207 gw00207 force-pushed the feature/raise_exception_if_strings_not_long branch from 77ed699 to f9f316f Compare September 1, 2021 13:36
@ParticularMiner
Copy link
Contributor

Thanks @gw00207

Looks good. Eventually @Bergvca (the maintainer) will need to approve and pull this.

One question though: what happens in those cases in which there are string values with length greater than or equal to ngram_size to begin with but are later trimmed to a length less than ngram_size after cleanup (such as punctuation removal, for example)?

One such string could be 'se,', which for ngram_size=3 and regex=',' would pass your validation test as it is now but would still raise the ValueError exception.

@gw00207
Copy link
Author

gw00207 commented Sep 1, 2021

@ParticularMiner i have updated the test and code but obviously adding this regex replacement has a computational cost, interested what you think

@ParticularMiner
Copy link
Contributor

@gw00207

obviously adding this regex replacement has a computational cost, interested what you think

Good point. If indeed this validation is computationally costly, then may I suggest you attempt to tackle the problem at its source instead of performing your own validation. What I mean is, you could instead reexamine the traceback log of the original ValueError you obtained to determine which line in string_grouper's code is producing the error, and then place your own error handler there which would, in turn, raise your preferred exception.

Is this alternative acceptable to you?

@gw00207
Copy link
Author

gw00207 commented Sep 3, 2021

@ParticularMiner how is this? I can squash commits if required

@ParticularMiner
Copy link
Contributor

@gw00207

I've tested it myself and it's perfect!

Though I was surprised by the fact that all strings need to be problematic before a ValueError occurs. Originally, I had thought that just one problematic string would raise an exception. Thanks for bringing that to my attention.

@Bergvca (the maintainer) will let you know whether you need to squash your commits after he approves your PR.

Thanks for this!

@Bergvca
Copy link
Owner

Bergvca commented Sep 20, 2021

Hi @gw00207 ,

Thanks for your interest in this repo, and taking the time to help improve it!. I understand the need for the PR, and the reasoning of @ParticularMiner to get to this implementation.

However, the try/except clause now catches all value errors that the TfIdfVectorizer's "fit" function might throw. It could be that the string length (after stopword/punctuation removal) is the only one, in that case I can approve this PR. Could you check if there are other reasons that could cause a ValueError in the TfIdfVectorizer.fit() function?

Another question - what do we need the separate "Error" subclass?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants