Suggestion: Refine Word Matching Regex #91

davidlday · 2018-02-11T16:29:21Z

The current regex uses a pretty liberal expression:

/\S+/g

I propose a more constrained expression that only matches on word characters + apostrophes:

/[\w’']+/g

I believe this more closely represents what an editor / publisher means by word count and may also resolve #88 on excluding spaces from character count without adding a config item, as well as #55 on counting markdown syntax as words. And maybe this is what was meant by #2 a well?

Be happy to submit a PR but wanted to run this by you first.

The text was updated successfully, but these errors were encountered:

OleMchls · 2018-02-19T13:43:02Z

I like the idea, and yes, this is what was meant with #2. Could you give me an example how the apostrophes make this regex more accurate than simply /\w+/g?

davidlday · 2018-02-19T17:00:25Z

@OleMchls - Cool, and sure! The apostrophes account for contractions (at least in English). The \w expression expands to [A-Za-z0-9_], which will still count words like don't as two words. In some academic cases, contractions count as two words. IIRC, a couple of NLP tokenizers I've worked with in the past behave this way, but I believe the intent here is more along the lines of how word processors behave. Contractions count as one word, not two. What do you think?

keelanfh · 2018-03-23T12:52:30Z

If \w does just expand to [A-Za-z0-9_] then surely this would cause problems for accented characters, etc.

For instance, the French word à would not be counted at all, and fête would be counted as two words.

Not sure if that's what \w actually expands to. Would be good to test. I just tried it out in Atom's find and replace interface and it suggests that is what it does.

davidlday · 2018-03-24T02:06:01Z

@keelanfh - Excellent point. I found a couple of references to patterns that might work better:

Maybe this will work:

/[\w’'\u00C0-\u017F]+/g

I haven't tested yet, but will try to do so this weekend.

Thoughts?

keelanfh · 2018-03-24T12:09:36Z

Then that would run into the problem of other non-Latin scripts (e.g. Arabic) not being counted.

I'm just not sure exactly what the problem is that we're trying to solve. If there's an issue with Markdown syntax maybe it should just be fixed with something like

/[^\s#]+/g

?

davidlday · 2018-03-24T13:40:06Z

The problems are listed above: #88 and #55. The goal is to increase accuracy without adding any new settings.

If we can assume Markdown only, then we could use something like the Remove Markdown package and use the original regex to count what's left. I kind of like that idea, but I don't think the intent was to be Markdown-specific either.

What counts as a word isn't always so simple, apparently. How accurate does this need to be? Maybe it's accurate enough as is and if a user needs something more accurate then they'll have to do something else.

keelanfh · 2018-03-24T14:40:01Z

Yeah, I think it’s quite complicated. In order to make it more accurate we’d need to decide what we actually want the word count to look like.

davidlday · 2018-03-26T01:43:50Z

Maybe a better approach would be to adopt an existing word count package and implement filters for the various types of files (markdown, html, etc) and then run the results through an existing word count package. There's no shortage of them: https://www.npmjs.com/search?q=word%20count

This would relieve this package from the responsibility of defining what a word is and developing tests to validate. Thoughts?

OleMchls · 2018-03-26T06:41:46Z

@davidlday @keelanfh first of all, thanks for your involvement <3 And special sorry to @davidlday for forgetting to follow up on your Feb. comment.

Maybe a better approach would be to adopt an existing word count package and implement filters for the various types of files (markdown, html, etc) and then run the results through an existing word count package. There's no shortage of them: https://www.npmjs.com/search?q=word%20count

This would relieve this package from the responsibility of defining what a word is and developing tests to validate. Thoughts?

Do you have a specific one you would recommend? I was scanning the list, but none of them really stood out to me. But generally, I do like the idea, given how complex the realm of word counting actually is.

Maybe in the meantime go with a more refined regex as you suggested.

For #55 there is another idea discussed in #65 which I also like; having different count functions per language extension.

davidlday · 2018-03-29T13:46:58Z

@OleMchls no worries! wordcount caught my eye because it supports English, CJK, and Cyrillic. Digging down through its dependencies to word-regex, the pattern it uses is:

/[a-zA-Z0-9_\u0392-\u03c9\u0400-\u04FF]+|[\u4E00-\u9FFF\u3400-\u4dbf\uf900-\ufaff\u3040-\u309f\uac00-\ud7af\u0400-\u04FF]+|[\u00E4\u00C4\u00E5\u00C5\u00F6\u00D6]+|\w+/g

So maybe the place to start is by leveraging word-regex?

davidlday mentioned this issue Apr 15, 2018

Leverage word-regex #95

Merged

OleMchls closed this as completed in #95 May 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion: Refine Word Matching Regex #91

Suggestion: Refine Word Matching Regex #91

davidlday commented Feb 11, 2018 •

edited

Loading

OleMchls commented Feb 19, 2018 •

edited

Loading

davidlday commented Feb 19, 2018

keelanfh commented Mar 23, 2018 •

edited

Loading

davidlday commented Mar 24, 2018

keelanfh commented Mar 24, 2018

davidlday commented Mar 24, 2018

keelanfh commented Mar 24, 2018

davidlday commented Mar 26, 2018

OleMchls commented Mar 26, 2018 •

edited

Loading

davidlday commented Mar 29, 2018

Suggestion: Refine Word Matching Regex #91

Suggestion: Refine Word Matching Regex #91

Comments

davidlday commented Feb 11, 2018 • edited Loading

OleMchls commented Feb 19, 2018 • edited Loading

davidlday commented Feb 19, 2018

keelanfh commented Mar 23, 2018 • edited Loading

davidlday commented Mar 24, 2018

keelanfh commented Mar 24, 2018

davidlday commented Mar 24, 2018

keelanfh commented Mar 24, 2018

davidlday commented Mar 26, 2018

OleMchls commented Mar 26, 2018 • edited Loading

davidlday commented Mar 29, 2018

davidlday commented Feb 11, 2018 •

edited

Loading

OleMchls commented Feb 19, 2018 •

edited

Loading

keelanfh commented Mar 23, 2018 •

edited

Loading

OleMchls commented Mar 26, 2018 •

edited

Loading