Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

整理资料库时出现Wide character导致失败的问题 #4

Open
Sgearjhz opened this issue Aug 25, 2018 · 3 comments
Open

整理资料库时出现Wide character导致失败的问题 #4

Sgearjhz opened this issue Aug 25, 2018 · 3 comments

Comments

@Sgearjhz
Copy link
Contributor

Sgearjhz commented Aug 25, 2018

在将某些资源导入资料库时,终端如下所示报错:
----- move to /Users/XXXXX/voice1808251341
----- unfoldDLSiteFile
----- grapDLCount
Wide character at /Library/Perl/5.18/darwin-thread-multi-2level/Encode.pm line 296.
----- buildDLSite

看了下err.log中显示:
malformed JSON string, neither array, object, number, string or atom, at character offset 0 (before "(end of string)") at ./buildDLSite line 41.
发现是new_works.json中没有任何信息输出。

于是定位到grapDLCount文件,尝试注释掉第71行

result{'text'} = Encode::decode_utf8( $result{'text'});

后能够工作,故发现是因为从dlsite爬到的work_text信息没有被成功解码。

通过分析这些失败资源的异同,发现原因出于work_text中的字符,也即HTML中的省略符…
可以通过在前面增加:
$result{'text'} =~ s/…/.../;
替换为三个.解决。

@bandiaozimu
Copy link
Owner

谢谢!也请发个pull request,留email,我把说好的脚本寄给您。
我对Wide character几乎是束手无策,为了这个报错我把grapDLCount里93-94行的
#Encode::_utf8_on( $out );
改成
$out = Encode::decode_utf8( $out);
随然暂时没有wide character了,但text有四成都变成乱码,如果可以也帮我看看。

@bandiaozimu bandiaozimu reopened this Aug 25, 2018
@Sgearjhz
Copy link
Contributor Author

@bandiaozimu 谢谢,我的邮箱: [email protected]
我还在一边尝试整理资源一边阅读代码的阶段,如果再发现乱码、或其他问题的话我再看看。
顺便问下,作者您系统预设的解压缩程式是什么呢?我这边用The Unarchiver,每次download都要我手动确认解压路径和编码,很不方便。

@bandiaozimu
Copy link
Owner

你单独运行The Unarchiver,可以调整设定,你要勾选以下项目:

Extraction
Extract archive to: Same folder as the archive
Create a new folder for the extracted files: Only if there is more then one top-level item
After successfully extracting an archive: Move the archive to the trash

Advanced
Filename Encoding: Detect Automatically with Thereshold 80% confidence
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
这个选项能让Unarchiver自动选编码

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants