-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Unicode Line Breaking Algorithm wrapping with po2md #153
Comments
In terms of Scriptio continua, I suspect they simply just need an easier treatment of breaking at whatever point that's convenient, but I think it might need some additional detection if a line still has space or other characters like numerical value, it could be sensible to linebreak on that space in some cases. The CJK encoding thingy might be helpful for Chinese, Japanese, and Korean script continua. |
This has been fixed in v0.3.60. |
Thank you! Although it solves the problem, I think this solution, which stops -w from making effect is suboptimal. I fiddled with the code last two days, and this is a similar solution I came up with: #154 It enables wrap long words when a scriptio continua appear. I'm not sure what's the side effect of turning it on. I'm not good at coding so the confidence level of my solution works correctly is low. At least it works for the line which previously threw an error: |
Please, could you share a reproducible example? Where is
Should have been fixed in v0.3.61. |
While I'm using #154, I noticed the --width is not well enforced for text with different actual widths(numbers, Chinese characters, periods, and more are affected by Kerning). The assumption of "fixed-width" text isn't optimal for markdown text in my opinion, since different rendering methods with different fonts and characters sometimes take more than one space. Below is an example, the code block use monospace fixed-width and the normal markdown rendering doesn't A reproducible example is the following input
input po
output md note it is single line
My modification results in the following output
Note the above first line is more than 80 in width. It is significantly longer since Chinese is double spaced |
It's fixed on all of my markdown files too, thank you! |
I found another test case where the -w isn't applied, this one is probably a little bit more universal. I tested it with some English characters as well, for example, Additionally, I don't think my
|
It seems that textwrap considers that string a long word because characters like >>> import textwrap
>>> text = "支持常见的温度传感器(例如,常见的热敏电阻、AD595、AD597、AD849x、PT100、PT1000、MAX6675、MAX31855、MAX31856、MAX31865、BME280、HTU21D和LM75)。还可以配置自定 义热敏电阻和自定义模拟温度传感器。"
>>> textwrap.wrap(text, break_long_words=False)
['支持常见的温度传感器(例如,常见的热敏电阻、AD595、AD597、AD849x、PT100、PT1000、MAX6675、MAX31855、MAX31856、MAX31865、BME280、HTU21D和LM75)。还可以配置自定义热敏电阻和自定义模拟温度传感器。']
>>> textwrap.wrap(string.replace('、', ' '), break_long_words=False)
['支持常见的温度传感器(例如,常见的热敏电阻 AD595 AD597 AD849x PT100 PT1000 MAX6675 MAX31855', 'MAX31856 MAX31865 BME280 HTU21D和LM75)。还可以配置自定义热敏电阻和自定义模拟温度传感器。']
>>> textwrap.wrap(string, break_long_words=True)
['支持常见的温度传感器(例如,常见的热敏电阻、AD595、AD597、AD849x、PT100、PT1000、MAX6675、MAX31855', '、MAX31856、MAX31865、BME280、HTU21D和LM75)。还可以配置自定义热敏电阻和自定义模拟温度传感器。'] The problem is just simple and the solution I'm afraid that will not reach soon. It seems that Python is missing a reliable Unicode Line Breaking Algorithm implementation, so the more high level developers opt for the ASCII-oriented |
Why it would be applied? It's a word, can't be splitted. Remember: a soft break |
Basically, for languages that used scriptio continua, the words have no space in between, similar to an extremely long word. I don't think there is an existing way to solve this problem, guess I'll see if there is anything I could do using the image drawing thingy. |
You're referring practically to chinese and japanese, right? The Unicode Line Breaking Algorithm can handle those cases, you can see at 6. Line Breaking Algorithm that defines non mandatory line break opportunities. |
Thank you! I'll look into it! |
This problem can be solved with https://github.com/mondeja/py-unicode-linebreak, I'll try to do it ASAP. |
Hi @dingyifei, the problem seems solved in v1.1.1. I've used your example to implement it as I'm not used to read these languages, so I don't really know if is correctly fixed. If you find problems just let me know. |
Thank you for the update! There may be some room for improvement:
sections of test1.md:
You can see it not applying any wrapping for several rows. A better way to do it may approximately be:
A test case maybe:
Output:
Where the |
In some parts, it seems to do a better job:
I think "starts with a english word" triggers it, where a example of the bug would be
|
I realigned the example abit, note that the comma and periods |
Thanks for the detailed report @dingyifei. This problem is fixed in v1.1.2 . Probably is not perfect, if you find more inconsistencies, please, share the reproducible examples. |
Two issues are described below:
It threw an error when the following Gettext is being processed.
The section of Gettext:
Error
I tried to fiddle with the code a bit, and it seems like
break_long_words=True
underif self._inside_liblock or self._inside_quoteblock:
can solve this error (although not perfectly). Since I also found a second issue that relates to text wrap, I'm thinking if they could be fixed together.textwrap attempts to wrap links when a line is len(text) > width. Both Github and VSC Markdown engines couldn't render the links with a link break in the middle. Setting
break_on_hyphens=false
resolves this issue, but it definitely causes text wrap to produce less optimal wraps.I'm not sure how to fix these two problems since these fixes can bring drawbacks. Maybe adding additional parameters for TextWrapper is an intermediate solution, but wrapping according to the string length including the link length might still cause the index out-of-range error when a very long link without
-
is in a string.The text was updated successfully, but these errors were encountered: