-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestion: Refine Word Matching Regex #91
Comments
I like the idea, and yes, this is what was meant with #2. Could you give me an example how the apostrophes make this regex more accurate than simply |
@OleMchls - Cool, and sure! The apostrophes account for contractions (at least in English). The |
If For instance, the French word Not sure if that's what |
@keelanfh - Excellent point. I found a couple of references to patterns that might work better:
Maybe this will work:
I haven't tested yet, but will try to do so this weekend. Thoughts? |
Then that would run into the problem of other non-Latin scripts (e.g. Arabic) not being counted. I'm just not sure exactly what the problem is that we're trying to solve. If there's an issue with Markdown syntax maybe it should just be fixed with something like
? |
The problems are listed above: #88 and #55. The goal is to increase accuracy without adding any new settings. If we can assume Markdown only, then we could use something like the Remove Markdown package and use the original regex to count what's left. I kind of like that idea, but I don't think the intent was to be Markdown-specific either. What counts as a word isn't always so simple, apparently. How accurate does this need to be? Maybe it's accurate enough as is and if a user needs something more accurate then they'll have to do something else. |
Yeah, I think it’s quite complicated. In order to make it more accurate we’d need to decide what we actually want the word count to look like. |
Maybe a better approach would be to adopt an existing word count package and implement filters for the various types of files (markdown, html, etc) and then run the results through an existing word count package. There's no shortage of them: https://www.npmjs.com/search?q=word%20count This would relieve this package from the responsibility of defining what a word is and developing tests to validate. Thoughts? |
@davidlday @keelanfh first of all, thanks for your involvement <3 And special sorry to @davidlday for forgetting to follow up on your Feb. comment.
Do you have a specific one you would recommend? I was scanning the list, but none of them really stood out to me. But generally, I do like the idea, given how complex the realm of word counting actually is. Maybe in the meantime go with a more refined regex as you suggested. For #55 there is another idea discussed in #65 which I also like; having different count functions per language extension. |
@OleMchls no worries! wordcount caught my eye because it supports English, CJK, and Cyrillic. Digging down through its dependencies to word-regex, the pattern it uses is:
So maybe the place to start is by leveraging word-regex? |
The current regex uses a pretty liberal expression:
I propose a more constrained expression that only matches on word characters + apostrophes:
I believe this more closely represents what an editor / publisher means by word count and may also resolve #88 on excluding spaces from character count without adding a config item, as well as #55 on counting markdown syntax as words. And maybe this is what was meant by #2 a well?
Be happy to submit a PR but wanted to run this by you first.
The text was updated successfully, but these errors were encountered: