The charset is wrongly detected #4

wanyancan · 2018-11-23T04:43:20Z

Hi,

Is there any way to manually set charset for opened files?
If not, how may I change source code of the auto-detection to manual selection?

Thank you!

HouQiming · 2018-11-23T06:20:38Z

I haven't implemented encoding selection. Right now, the JS function `EDLoader_Open` calls the JC function `DetectEncoding` to detect the encoding when reading the first chunk of a file. There is no easy data path from UI to that place... but you can always add a hard-coded rule based on the file name. A likely cause of mis-detection is `MAX_ENCODING_DETECTION_LENGTH` in `encoding.jc`. Right now qpad only checks the first 8KB of a file for encoding and will assume UTF8 if it's all ASCII. Maybe you can try increasing that? Also, can you share some detail about your file's content? Maybe I could improve the model.

…

On Fri, Nov 23, 2018 at 12:43 PM wanyancan ***@***.***> wrote: Hi, Is there any way to manually set charset for opened files? If not, how may I change source code of the auto-detection to manual selection? Thank you! — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#4>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHrIF2LuYqFUvd5KzTGQtVTAnhzLyeSBks5ux3zogaJpZM4YwG5v> .

wanyancan · 2018-11-23T07:00:47Z

The file contains some engineering symbols in CP936.
Designator Footprint Mid X Mid Y Ref X Ref Y Pad X Pad Y TB Rotation Comment R1 0402_R 817.716mil -5537.402mil 817.716mil -5537.401mil 829.548mil -5525.57mil T 225.00 10KΩ (1002) ±1%

Ω (A6 B8) and ±(A1 C0) are treated as separated ｡(A1) ﾀ(C0) and ｦ(A6) ｸ(B8).

In Cp932, from A1 to DF they are all single character but can be combined in CP936 as one character.

I'm not sure how the model can be updated. Maybe use two token score with penalty on the consecutive chars in range A1 to DF ?

I believe a manual selection in menu is the most convenient. Can I call ConvertToUTF8(encoding, s) directly?

HouQiming · 2018-11-23T07:11:38Z

I see. It's indeed possible to call `ConvertToUTF` directly. Here I can't really improve the model... since I also need to detect half-width katakana documents which has the same structure but replaces Ω with things like ｵﾒｶﾞ. In any case, I recommend UTF-8 and I'll add manual encoding selection to the to-do list. Qiming

…

On Fri, Nov 23, 2018 at 3:00 PM wanyancan ***@***.***> wrote: The file contains some engineering symbols in CP936. Designator Footprint Mid X Mid Y Ref X Ref Y Pad X Pad Y TB Rotation Comment R1 0402_R 817.716mil -5537.402mil 817.716mil -5537.401mil 829.548mil -5525.57mil T 225.00 10KΩ (1002) ±1% Ω (A6 B8) and ±(A1 C0) are treated as separated ｡(A1) ﾀ(C0) and ｦ(A6) ｸ(B8). In Cp932, from A1 to DF they are all single character but can be combined in CP936 as one character. I'm not sure how the model can be updated. Maybe use two token score with penalty on the consecutive chars in range A1 to DF ? I believe a manual selection in menu is the most convenient. Can I call ConvertToUTF8(encoding, s) directly? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#4 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHrIF44bqQGtt-VXNqAZPAOjBhnAEOxbks5ux50fgaJpZM4YwG5v> .

wanyancan mentioned this issue Dec 2, 2018

Greate work. But... #5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The charset is wrongly detected #4

The charset is wrongly detected #4

wanyancan commented Nov 23, 2018

HouQiming commented Nov 23, 2018 via email

wanyancan commented Nov 23, 2018

HouQiming commented Nov 23, 2018 via email

The charset is wrongly detected #4

The charset is wrongly detected #4

Comments

wanyancan commented Nov 23, 2018

HouQiming commented Nov 23, 2018 via email

wanyancan commented Nov 23, 2018

HouQiming commented Nov 23, 2018 via email