raise StringLengthException if vectoriser is applied to strings that … #67

gw00207 · 2021-09-01T13:30:20Z

…are not greater than or equal in length to ngram_size

…are not all greater in length than ngram_size

ParticularMiner · 2021-09-01T14:05:23Z

Looks good. Eventually @Bergvca (the maintainer) will need to approve and pull this.

One question though: what happens in those cases in which there are string values with length greater than or equal to ngram_size to begin with but are later trimmed to a length less than ngram_size after cleanup (such as punctuation removal, for example)?

One such string could be 'se,', which for ngram_size=3 and regex=',' would pass your validation test as it is now but would still raise the ValueError exception.

gw00207 · 2021-09-01T15:22:38Z

@ParticularMiner i have updated the test and code but obviously adding this regex replacement has a computational cost, interested what you think

ParticularMiner · 2021-09-01T16:15:21Z

@gw00207

obviously adding this regex replacement has a computational cost, interested what you think

Good point. If indeed this validation is computationally costly, then may I suggest you attempt to tackle the problem at its source instead of performing your own validation. What I mean is, you could instead reexamine the traceback log of the original ValueError you obtained to determine which line in string_grouper's code is producing the error, and then place your own error handler there which would, in turn, raise your preferred exception.

Is this alternative acceptable to you?

gw00207 · 2021-09-03T15:10:13Z

@ParticularMiner how is this? I can squash commits if required

ParticularMiner · 2021-09-03T16:19:11Z

@gw00207

I've tested it myself and it's perfect!

Though I was surprised by the fact that all strings need to be problematic before a ValueError occurs. Originally, I had thought that just one problematic string would raise an exception. Thanks for bringing that to my attention.

@Bergvca (the maintainer) will let you know whether you need to squash your commits after he approves your PR.

Thanks for this!

Bergvca · 2021-09-20T13:01:22Z

Hi @gw00207 ,

Thanks for your interest in this repo, and taking the time to help improve it!. I understand the need for the PR, and the reasoning of @ParticularMiner to get to this implementation.

However, the try/except clause now catches all value errors that the TfIdfVectorizer's "fit" function might throw. It could be that the string length (after stopword/punctuation removal) is the only one, in that case I can approve this PR. Could you check if there are other reasons that could cause a ValueError in the TfIdfVectorizer.fit() function?

Another question - what do we need the separate "Error" subclass?

Thanks!

gw00207 mentioned this pull request Sep 1, 2021

how to handle 'ValueError: empty vocabulary; perhaps the documents only contain stop words' in group_similar_strings #66

Open

raise StringLengthException if vectoriser is applied to strings that …

f9f316f

…are not all greater in length than ngram_size

gw00207 force-pushed the feature/raise_exception_if_strings_not_long branch from 77ed699 to f9f316f Compare September 1, 2021 13:36

remove regex pattern characters before checking length

cb56fbe

inherit from Error class

ff0b041

gw00207 added 2 commits September 3, 2021 16:09

raise StringLengthException instead of ValueError when vectorizer is fit

ce518bf

remove string length check method

eb78a01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raise StringLengthException if vectoriser is applied to strings that … #67

raise StringLengthException if vectoriser is applied to strings that … #67

gw00207 commented Sep 1, 2021 •

edited

Loading

ParticularMiner commented Sep 1, 2021

gw00207 commented Sep 1, 2021

ParticularMiner commented Sep 1, 2021

gw00207 commented Sep 3, 2021

ParticularMiner commented Sep 3, 2021

Bergvca commented Sep 20, 2021

raise StringLengthException if vectoriser is applied to strings that … #67

Are you sure you want to change the base?

raise StringLengthException if vectoriser is applied to strings that … #67

Conversation

gw00207 commented Sep 1, 2021 • edited Loading

ParticularMiner commented Sep 1, 2021

gw00207 commented Sep 1, 2021

ParticularMiner commented Sep 1, 2021

gw00207 commented Sep 3, 2021

ParticularMiner commented Sep 3, 2021

Bergvca commented Sep 20, 2021

gw00207 commented Sep 1, 2021 •

edited

Loading