Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text with numbers doesn't segment as expected #20

Open
sgokhales opened this issue Nov 30, 2018 · 3 comments
Open

Text with numbers doesn't segment as expected #20

sgokhales opened this issue Nov 30, 2018 · 3 comments

Comments

@sgokhales
Copy link

Raising an issue that I faced while using this package.

Code for Reproducing the issue:

import wordsegment as ws    
ws.load()     
text = "increased $55 million or 23.8% for"   
ws.segment(text)

Actual Output:

['increased', '55millionor238', 'for']

Expected Output:

// If special symbols are permitted in the final output
['increased', '$55', 'million', 'or', '23.8%', 'for']

// If special symbols such as $ and % are not permitted in the final output
['increased', 'dollar', '55', 'million', 'or', '23.8', 'percent', 'for']

Tested on Python versions:

  • 3.6
  • 3.5
  • 2.7

wordsegment version:

  • 1.3.1

StackOverflow Question Link:

@grantjenks
Copy link
Owner

Short answer: the design of wordsegment is based on a trillion-word corpus and unfortunately that corpus includes no numbers. But I think it's a reasonable feature to support. Pull request welcome.

Longer answer: the wordsegment module was originally intended for input like "thisisanexample". That's where it excels. If your input has extra information in it like: "this, is an: example" that punctuation is not used by wordsegment and it may be better to pre-process the input. A simple regular expression like re.finditer(r'[a-zA-Z0-9%$]+', text) may pre-process the tokens and take advantage of the added information in the input.

If you would like to contract me to fix the issue, then I am open to that as well.

@iwanggp
Copy link

iwanggp commented Oct 7, 2019

How to not convert uppercase letters to lowercase Thanks

@grantjenks
Copy link
Owner

Maintaining uppercase letters is not a supported feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants