Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature_request(mode): preserve all punctuation marks #31

Open
Kristinita opened this issue Jan 25, 2021 · 2 comments
Open

feature_request(mode): preserve all punctuation marks #31

Kristinita opened this issue Jan 25, 2021 · 2 comments

Comments

@Kristinita
Copy link

Kristinita commented Jan 25, 2021

1. Summary

It would be nice, if WordSegment at least at CLI mode will have the option to preserve all punctuation marks: ., ,, and so on.

2. Problem

  1. Scientific article example
  2. Scientific book example

Try copy and paste text from these article and book.

  1. The article:

    Sharing-economyfirmsdifferfromold-powerfirmsbecausetheformertypicallyareexponentialnew-powerorganisationscharac-terisedbyPorter’scompetitiveforces.Althoughsomenew-powerfirmsmaychoosenottoembraceastakeholderfocus,stakehold-ersandothernew-powerfirmswillpunishsuchchoices.Inotherwords,counterargumentstothesharingeconomy’sstakeholderpo-tentialbasedonthequestionableactionsofsomenew-powerfirmsareovershadowedbyothernew-powerfirmsandtheirstakehold-ers’actions.

  2. The book:

    Accordingto DavidAllen,authorof the bestsellerGettingThingsDone(2001),informationprofessionalshavea hardtimeaccomplishingtasksbecauseour workis inherentlyambiguous,we takeon too manycommit-ments,andwe cannotprioritizethe bestthingto do fromthe manychoicesbeforeus. J. WesleyCochran(1992),JudithSiess(2002),SamanthaHines(2010),andotherauthorsof timemanagementtreatisesfor librar-iansconcurthatlibrarieshavebeendifficultplacesto workfor years,especiallygivenour complexworkprocessesandoftenintangibleprod-ucts.Nevertheless,we havethe abilityas individualsto adoptbetterstrategiesto managethe everydaychaos.

Yes, ideally, of course, it would be nice normally add a text layer to the PDF, but I’m not making these articles and books. From my experience, I can say that a text layer without spaces like this is a common problem. The routine work of separating words can be time-consuming.

3. Behavior

3.1. Current

CLI usage:

sharing economy firms differ from old power firms because the former typically are exponential new power organisations characterised by porters competitive forces although some new power firms may choose not to embrace a stakeholder focus stakeholders and other new power firms will punish such choices in other words counterarguments to the sharing economy s stakeholder potential based on the questionable actions of some new power firms are overshadowed by other new power firms and their stakeholders actions

according to david allen author of the best seller getting things done 2001 information professionals have a hard time accomplishing tasks because our work is inherently ambiguous we take on too many commitments and we can not prioritize the best thing to do from the many choices before us j wesley cochran 1992judithsiess2002 samantha hines2010 and other authors of time management treatises for librarians concur that libraries have been difficult places to work for years especially given our complex work processes and often intangible products nevertheless we have the ability as individuals to adopt better strategies to manage the everyday chaos

Punctuation marks are stripped. Users have to do a lot of routine work to get them back.

3.2. Expected behavior

Ordinary English texts:

Sharing economy firms differ from old power firms because the former typically are exponential new power organisations characterised by Porter’s competitive forces. Although some new power firms may choose not to embrace a stakeholder focus, stakeholders and other new power firms will punish such choices. In other words, counterarguments to the sharing economy’s stakeholder potential based on the questionable actions of some new power firms are overshadowed by other new power firms and their stakeholders’ actions.

According to David Allen, author of the bestseller Getting Things Done(2001), information professionals have a hard time accomplishing tasks because our work is inherently ambiguous, we take on too many commitments, and we can not prioritize the best thing to do from the many choices before us. J. Wesley Cochran(1992), Judith Siess(2002), Samantha Hines(2010) and other authors of time management treatises for librarians concur that libraries have been difficult places to work for years, especially given our complex work processes and often intangible products. Nevertheless, we have the ability as individuals to adopt better strategies to manage the everyday chaos.

Thanks.

@grantjenks
Copy link
Owner

Use a regex to break the input into chunks separated by punctuation, then segment each chunk and combine the results by punctuation. The punctuation adds meaningful segmentation hints so stripping it out will reduce the quality. Segmentation works best on smaller phrases anyway.

@grantjenks
Copy link
Owner

The strategy also applies to capitalization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants