-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explicitly encode output as UTF-8 #237
base: master
Are you sure you want to change the base?
Conversation
These will only change the debug output. Python3 should already do this but won't break with it. LGTM. @zuphilip ? |
|
4eddbff
to
47d86c4
Compare
Thanks @QuLogic, I don't write a lot of Python, so hadn't spotted the mixing of str + bytes. I just updated the pull request, with a version that ensures nothing but str is in the print() arguments. And to force UTF-8, I switched to using codecs.getwriter('utf8') for stdout and stderr for all cases where they could output it. I also improved the commit message, so it's clearer what the problem is and why it's worth fixing. |
I think I've covered all the cases we need to, but there could be more. This is needed as if a locale isn't set to UTF-8, an error will result of this form: UnicodeEncodeError: 'ascii' codec can't encode character u'\u0113' in position 60: ordinal not in range(128) While the ideal solution is for the user to set their locale to UTF-8, it is better that we print debug output which may not be displayed correctly than that we output a fatal (and non-obvious) error, potentially some time into processing. This also fixes some cases of implicitly combining str and *obj together when printing debug output, which fails with some Python versions, by explicitly using str.join(obj).
Just as an additional justification / use-case for this being necessary, I'm currently using Ocropus on a system (that I don't manage) which uses slurm, and no matter what I try I cannot persuade the slurm job manager on this system to not strip locale information. And again, to reiterate, UTF-8 is always the correct thing to output here, regardless of locale, and no fatal errors should be emitted just because of a weird locale. |
Does setting |
@QuLogic, yes, that should do the job as well, I had not found that environment variable before, thanks for the suggestion. However, I think ocropy should automatically and always output UTF-8, as that's what it's dealing with for input and output, and to do otherwise risks unnecessary runtime crashes. Given the diverse environments Ocropus can run in I think it's best to just enforce this in the program - after all, we have got bug reports from users (#10, #197), and frankly it took me ages to figure out where the issue was too, with the weird HFS system I need to use. |
I accidentally missed these from the original commit (c4ae4b).
As laid out in #197 there are a few options to achieve this. Revisiting the UTF8/debug output issues, I'm not keen on explicitly adding boilerplate to every file with print statements. Changing stdout/stderr encoding in a single place (like common.py) probably achieves the same goal but is error-prone. I would prefer to replace print/print_info/print_error with a dedicated logging mechanism, such as python's bundled |
I think I've covered all the cases we need to, but it's possible there could be more.
This should address issue #197 and #10.