Text with numbers doesn't segment as expected #20

sgokhales · 2018-11-30T04:42:15Z

Raising an issue that I faced while using this package.

Code for Reproducing the issue:

import wordsegment as ws    
ws.load()     
text = "increased $55 million or 23.8% for"   
ws.segment(text)

Actual Output:

['increased', '55millionor238', 'for']

Expected Output:

// If special symbols are permitted in the final output
['increased', '$55', 'million', 'or', '23.8%', 'for']

// If special symbols such as $ and % are not permitted in the final output
['increased', 'dollar', '55', 'million', 'or', '23.8', 'percent', 'for']

Tested on Python versions:

3.6
3.5
2.7

wordsegment version:

1.3.1

StackOverflow Question Link:

https://stackoverflow.com/q/53549446/202375

The text was updated successfully, but these errors were encountered:

grantjenks · 2018-12-27T21:27:06Z

Short answer: the design of wordsegment is based on a trillion-word corpus and unfortunately that corpus includes no numbers. But I think it's a reasonable feature to support. Pull request welcome.

Longer answer: the wordsegment module was originally intended for input like "thisisanexample". That's where it excels. If your input has extra information in it like: "this, is an: example" that punctuation is not used by wordsegment and it may be better to pre-process the input. A simple regular expression like re.finditer(r'[a-zA-Z0-9%$]+', text) may pre-process the tokens and take advantage of the added information in the input.

If you would like to contract me to fix the issue, then I am open to that as well.

iwanggp · 2019-10-07T10:24:20Z

How to not convert uppercase letters to lowercase Thanks

grantjenks · 2019-10-07T14:13:25Z

Maintaining uppercase letters is not a supported feature.

davidpaulmcintyre mentioned this issue Apr 26, 2019

allow substrings to be ignored if they have digits #23

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text with numbers doesn't segment as expected #20

Text with numbers doesn't segment as expected #20

sgokhales commented Nov 30, 2018

grantjenks commented Dec 27, 2018

iwanggp commented Oct 7, 2019

grantjenks commented Oct 7, 2019

Text with numbers doesn't segment as expected #20

Text with numbers doesn't segment as expected #20

Comments

sgokhales commented Nov 30, 2018

Code for Reproducing the issue:

Actual Output:

Expected Output:

Tested on Python versions:

wordsegment version:

StackOverflow Question Link:

grantjenks commented Dec 27, 2018

iwanggp commented Oct 7, 2019

grantjenks commented Oct 7, 2019