Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

not save as utf8, UnicodeDecodeError('utf-8', #9

Open
Yensan opened this issue Jan 31, 2018 · 10 comments
Open

not save as utf8, UnicodeDecodeError('utf-8', #9

Yensan opened this issue Jan 31, 2018 · 10 comments

Comments

@Yensan
Copy link

Yensan commented Jan 31, 2018

if there are some character beyond ASCII, it do not save as utf-8.
for example, Chinese in *.ipynb, it is saved as GBK actully. So cause UnicodeDecodeError('utf-8',

@jbn
Copy link
Owner

jbn commented Feb 7, 2018

Hello, @Yensan. Can you post a gist to an example notebook along with the command you used to reproduce it?

Thanks!

@Yensan
Copy link
Author

Yensan commented Feb 8, 2018

@jbn
just as what you said in readme nbmerge file_1.ipynb file_2.ipynb file_3.ipynb > merged.ipynb, but the file I edit have some some character beyond ASCII.
It is very simple to you to reproduce: new a *.ipynb; paste some Chinese; then nbmerge file_1.ipynb file_2.ipynb file_3.ipynb > merged.ipynb
I use VScode(Editor) to reset the encode, every thing is ok.

@jbn
Copy link
Owner

jbn commented Mar 6, 2018

Sorry for the delay, @Yensan!

I was unable to replicate this. Are you on windows? I think the default encoding for command line is not unicode for windows, so when you pipe output it's going to give a problem. Try doing,

nbmerge file_1.ipynb file_2.ipynb file_3.ipynb -o _merged.ipynb

instead to skip piping. If not, let me know and I'll go back to debugging.

@Yensan
Copy link
Author

Yensan commented Mar 17, 2018

@jbn
Not sorry at all. Thank you for this tool and reply.
Yes you are right, I was using company computer which is Win7.
I use MacOS, I just resigned one week ago. So it will delay to replicate

@jbn
Copy link
Owner

jbn commented Apr 18, 2018

Hi @Yensan.

I read up a bit on the problem and would like to fix it. Any chance I could get you to run this script:

https://gist.github.com/jbn/6b87f180cff5dae4b6554ef58ba26c6f

in the directory with your notebooks, replacing "./YOUR_NOTEBOOK_FILE.ipynb" with your notebook name. If you copy and paste the output, it should be a relatively easy fix.

Thanks if you can :)

@Yensan
Copy link
Author

Yensan commented Apr 19, 2018

(⊙o⊙) oh! Sorry I can't open https://gist.github.com/ in my net.... Because 'Greate wall' issue 😄
You can just paste here. I am in a new company now, so this is not the same environment. But I will use Chinese or other Non-Ascii words to test it.
Recent days I get in an ctypes trouble, if you know how to slove it, please paste your answer.
https://stackoverflow.com/questions/49913956/ctypes-use-pointer-and-cfunctype

@jbn
Copy link
Owner

jbn commented Apr 20, 2018

import sys, locale


exprs = """
locale.getpreferredencoding()
type(fp)
fp.encoding
sys.stdout.isatty()
sys.stdout.encoding
sys.stdin.isatty()
sys.stdin.encoding
sys.stderr.isatty()
sys.stderr.encoding
sys.getdefaultencoding()
sys.getfilesystemencoding()
"""

with open("./YOUR_NOTEBOOK_FILE.ipynb", "r") as fp:
    for expr in exprs.strip().split():
        print(expr.rjust(30), eval(expr))

Can't help with the ctypes issue. Never really use that code.

@Yensan
Copy link
Author

Yensan commented Jan 10, 2019

I am so sorry to reply so late, because my career is so tortuous. (If any remote job will be grateful)

This .ipynb file is edited in Windows and Mac, then I run your script in Windows 10 pro(Chinese-simpfied), Although Win10 is a virtual machine, but never mind, the result is the same.

Python 3.5.2 (v3.5.2:4def2a2901a5, Jun 25 2016, 22:18:55) [MSC v.1900 64 bit (AMD64)] on win32

Windows:

C:\Users\aC>systeminfo
主机名:           C53
OS 名称:          Microsoft Windows 10 专业版
OS 版本:          10.0.17763 暂缺 Build 17763
OS 制造商:        Microsoft Corporation
OS 配置:          独立工作站
OS 构件类型:      Multiprocessor Free
初始安装日期:     2019/1/6, 14:03:29
系统启动时间:     2019/1/11, 0:28:07
系统类型:         x64-based PC
处理器:           安装了 1 个处理器。
                  [01]: Intel64 Family 6 Model 61 Stepping 4 GenuineIntel ~1600 Mhz
BIOS 版本:        Parallels Software International Inc. 14.0.1 (45154), 2018/9/7
系统区域设置:     zh-cn;中文(中国)
输入法区域设置:   en-us;英语(美国)

Your script output:

 locale.getpreferredencoding() cp936
                      type(fp) <class '_io.TextIOWrapper'>
                   fp.encoding cp936
           sys.stdout.isatty() True
           sys.stdout.encoding cp936
            sys.stdin.isatty() True
            sys.stdin.encoding cp936
           sys.stderr.isatty() True
           sys.stderr.encoding cp936
      sys.getdefaultencoding() utf-8
   sys.getfilesystemencoding() mbcs

@shpj123
Copy link

shpj123 commented Dec 8, 2019

import sys, locale

exprs = """
locale.getpreferredencoding()
type(fp)
fp.encoding
sys.stdout.isatty()
sys.stdout.encoding
sys.stdin.isatty()
sys.stdin.encoding
sys.stderr.isatty()
sys.stderr.encoding
sys.getdefaultencoding()
sys.getfilesystemencoding()
"""

with open("./YOUR_NOTEBOOK_FILE.ipynb", "r") as fp:
for expr in exprs.strip().split():
print(expr.rjust(30), eval(expr))
Can't help with the ctypes issue. Never really use that code.

Hello @jbn,
I'm also having this problem while merging three notebooks with chinese characters,
here's the output of your script and I've also attached my three files to be merged:
Desktop.zip

!nbmerge 1.ipynb 2.ipynb 3.ipynb > merged.ipynb

Thx a lot!!

Best,
PJ

 locale.getpreferredencoding() cp936
                      type(fp) <class '_io.TextIOWrapper'>
                   fp.encoding cp936
           sys.stdout.isatty() False
           sys.stdout.encoding UTF-8
            sys.stdin.isatty() False
            sys.stdin.encoding cp936
           sys.stderr.isatty() False
           sys.stderr.encoding UTF-8
      sys.getdefaultencoding() utf-8
   sys.getfilesystemencoding() utf-8 

@krinsman
Copy link

krinsman commented Feb 2, 2020

To clarify, is this issue only on Windows, and not Unix (Linux or Mac OS)?

EDIT: I just ran this on Ubuntu Bionic (copy-pasted Chinese characters into two notebooks), e.g.

nbmerge unicode1.ipynb unicode2.ipynb > new.ipynb

and ran into new issues whatsoever.

So I think it could be helpful to label this issue as being specific to Windows only (to avoid unnecessarily freaking out/turning off people who aren't running this with Windows).

This is a great package by the way! Elegant solution to a recurring problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants