-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The charset is wrongly detected #4
Comments
I haven't implemented encoding selection.
Right now, the JS function `EDLoader_Open` calls the JC function
`DetectEncoding` to detect the encoding when reading the first chunk of a
file. There is no easy data path from UI to that place... but you can
always add a hard-coded rule based on the file name.
A likely cause of mis-detection is `MAX_ENCODING_DETECTION_LENGTH` in
`encoding.jc`. Right now qpad only checks the first 8KB of a file for
encoding and will assume UTF8 if it's all ASCII. Maybe you can try
increasing that?
Also, can you share some detail about your file's content? Maybe I could
improve the model.
…On Fri, Nov 23, 2018 at 12:43 PM wanyancan ***@***.***> wrote:
Hi,
Is there any way to manually set charset for opened files?
If not, how may I change source code of the auto-detection to manual
selection?
Thank you!
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#4>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHrIF2LuYqFUvd5KzTGQtVTAnhzLyeSBks5ux3zogaJpZM4YwG5v>
.
|
The file contains some engineering symbols in CP936. Ω (A6 B8) and ±(A1 C0) are treated as separated 。(A1) タ(C0) and ヲ(A6) ク(B8). In Cp932, from A1 to DF they are all single character but can be combined in CP936 as one character. I'm not sure how the model can be updated. Maybe use two token score with penalty on the consecutive chars in range A1 to DF ? I believe a manual selection in menu is the most convenient. Can I call ConvertToUTF8(encoding, s) directly? |
I see. It's indeed possible to call `ConvertToUTF` directly.
Here I can't really improve the model... since I also need to detect
half-width katakana documents which has the same structure but replaces Ω
with things like オメガ.
In any case, I recommend UTF-8 and I'll add manual encoding selection to
the to-do list.
Qiming
…On Fri, Nov 23, 2018 at 3:00 PM wanyancan ***@***.***> wrote:
The file contains some engineering symbols in CP936.
Designator Footprint Mid X Mid Y Ref X Ref Y Pad X Pad Y TB Rotation
Comment R1 0402_R 817.716mil -5537.402mil 817.716mil -5537.401mil
829.548mil -5525.57mil T 225.00 10KΩ (1002) ±1%
Ω (A6 B8) and ±(A1 C0) are treated as separated 。(A1) タ(C0) and ヲ(A6)
ク(B8).
In Cp932, from A1 to DF they are all single character but can be combined
in CP936 as one character.
I'm not sure how the model can be updated. Maybe use two token score with
penalty on the consecutive chars in range A1 to DF ?
I believe a manual selection in menu is the most convenient. Can I call
ConvertToUTF8(encoding, s) directly?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#4 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AHrIF44bqQGtt-VXNqAZPAOjBhnAEOxbks5ux50fgaJpZM4YwG5v>
.
|
Hi,
Is there any way to manually set charset for opened files?
If not, how may I change source code of the auto-detection to manual selection?
Thank you!
The text was updated successfully, but these errors were encountered: