Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: Refine Word Matching Regex #91

Closed
davidlday opened this issue Feb 11, 2018 · 10 comments · Fixed by #95
Closed

Suggestion: Refine Word Matching Regex #91

davidlday opened this issue Feb 11, 2018 · 10 comments · Fixed by #95

Comments

@davidlday
Copy link
Collaborator

davidlday commented Feb 11, 2018

The current regex uses a pretty liberal expression:

/\S+/g

I propose a more constrained expression that only matches on word characters + apostrophes:

/[\w’']+/g

I believe this more closely represents what an editor / publisher means by word count and may also resolve #88 on excluding spaces from character count without adding a config item, as well as #55 on counting markdown syntax as words. And maybe this is what was meant by #2 a well?

Be happy to submit a PR but wanted to run this by you first.

@OleMchls
Copy link
Owner

OleMchls commented Feb 19, 2018

I like the idea, and yes, this is what was meant with #2. Could you give me an example how the apostrophes make this regex more accurate than simply /\w+/g?

@davidlday
Copy link
Collaborator Author

@OleMchls - Cool, and sure! The apostrophes account for contractions (at least in English). The \w expression expands to [A-Za-z0-9_], which will still count words like don't as two words. In some academic cases, contractions count as two words. IIRC, a couple of NLP tokenizers I've worked with in the past behave this way, but I believe the intent here is more along the lines of how word processors behave. Contractions count as one word, not two. What do you think?

@keelanfh
Copy link

keelanfh commented Mar 23, 2018

If \w does just expand to [A-Za-z0-9_] then surely this would cause problems for accented characters, etc.

For instance, the French word à would not be counted at all, and fête would be counted as two words.

Not sure if that's what \w actually expands to. Would be good to test. I just tried it out in Atom's find and replace interface and it suggests that is what it does.

@davidlday
Copy link
Collaborator Author

@keelanfh - Excellent point. I found a couple of references to patterns that might work better:

Maybe this will work:

/[\w’'\u00C0-\u017F]+/g

I haven't tested yet, but will try to do so this weekend.

Thoughts?

@keelanfh
Copy link

Then that would run into the problem of other non-Latin scripts (e.g. Arabic) not being counted.

I'm just not sure exactly what the problem is that we're trying to solve. If there's an issue with Markdown syntax maybe it should just be fixed with something like

/[^\s#]+/g

?

@davidlday
Copy link
Collaborator Author

The problems are listed above: #88 and #55. The goal is to increase accuracy without adding any new settings.

If we can assume Markdown only, then we could use something like the Remove Markdown package and use the original regex to count what's left. I kind of like that idea, but I don't think the intent was to be Markdown-specific either.

What counts as a word isn't always so simple, apparently. How accurate does this need to be? Maybe it's accurate enough as is and if a user needs something more accurate then they'll have to do something else.

@keelanfh
Copy link

Yeah, I think it’s quite complicated. In order to make it more accurate we’d need to decide what we actually want the word count to look like.

@davidlday
Copy link
Collaborator Author

Maybe a better approach would be to adopt an existing word count package and implement filters for the various types of files (markdown, html, etc) and then run the results through an existing word count package. There's no shortage of them: https://www.npmjs.com/search?q=word%20count

This would relieve this package from the responsibility of defining what a word is and developing tests to validate. Thoughts?

@OleMchls
Copy link
Owner

OleMchls commented Mar 26, 2018

@davidlday @keelanfh first of all, thanks for your involvement <3 And special sorry to @davidlday for forgetting to follow up on your Feb. comment.

Maybe a better approach would be to adopt an existing word count package and implement filters for the various types of files (markdown, html, etc) and then run the results through an existing word count package. There's no shortage of them: https://www.npmjs.com/search?q=word%20count

This would relieve this package from the responsibility of defining what a word is and developing tests to validate. Thoughts?

Do you have a specific one you would recommend? I was scanning the list, but none of them really stood out to me. But generally, I do like the idea, given how complex the realm of word counting actually is.

Maybe in the meantime go with a more refined regex as you suggested.

For #55 there is another idea discussed in #65 which I also like; having different count functions per language extension.

@davidlday
Copy link
Collaborator Author

@OleMchls no worries! wordcount caught my eye because it supports English, CJK, and Cyrillic. Digging down through its dependencies to word-regex, the pattern it uses is:

/[a-zA-Z0-9_\u0392-\u03c9\u0400-\u04FF]+|[\u4E00-\u9FFF\u3400-\u4dbf\uf900-\ufaff\u3040-\u309f\uac00-\ud7af\u0400-\u04FF]+|[\u00E4\u00C4\u00E5\u00C5\u00F6\u00D6]+|\w+/g

So maybe the place to start is by leveraging word-regex?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants