Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix [Utils] Replace Text Encoder #4093

Closed
wants to merge 1 commit into from
Closed

Conversation

Rolaka
Copy link

@Rolaka Rolaka commented Feb 21, 2024

[+] fix(utils.ts): replace text encoder

Use a better way of encoding text, the current code is wrong.

new Uint8Array([...'新的对话'].map((c) => c.charCodeAt(0)))
// => Uint8Array(4) [176, 132, 249, 221]

Replacing old code with TextEncoder.

const encoder = new TextEncoder()
encoder.encode('新的对话')
// => Uint8Array(12) [230, 150, 176, 231, 154, 132, 229, 175, 185, 232, 175, 157]

Copy link

vercel bot commented Feb 21, 2024

@Rolaka is attempting to deploy a commit to the NextChat Team on Vercel.

A member of the Team first needs to authorize it.

Copy link
Contributor

Your build has completed!

Preview deployment

@H0llyW00dzZ
Copy link
Contributor

that's one might break other language

@Rolaka
Copy link
Author

Rolaka commented Feb 21, 2024

that's one might break other language

this change was made to ensure that any output of utf-8 characters would have a correct conversion, not just for a particular language. the original code's conversion method used Uint8 to store the converted value, a method that broke all encodings of characters longer than one byte, including all emoji.

for example, Japanese hiragana: あ

あ => U+3042 => 0011 0000 0100 0010

Uint8 => 0011 0000 0100 0010 => 66 => 0x42, It's wrong.
image

TextEncoder is a standardized method with no problems, the function is designed to convert the text into a uint8array.

@H0llyW00dzZ
Copy link
Contributor

that's one might break other language

this change was made to ensure that any output of utf-8 characters would have a correct conversion, not just for a particular language. the original code's conversion method used Uint8 to store the converted value, a method that broke all encodings of characters longer than one byte, including all emoji.

for example, Japanese hiragana: あ

あ => U+3042 => 0011 0000 0100 0010

Uint8 => 0011 0000 0100 0010 => 66 => 0x42, It's wrong. image

TextEncoder is a standardized method with no problems, the function is designed to convert the text into a uint8array.

test other language not only that language
for example, russia,arab

@Rolaka
Copy link
Author

Rolaka commented Feb 21, 2024

that's one might break other language

this change was made to ensure that any output of utf-8 characters would have a correct conversion, not just for a particular language. the original code's conversion method used Uint8 to store the converted value, a method that broke all encodings of characters longer than one byte, including all emoji.
for example, Japanese hiragana: あ

あ => U+3042 => 0011 0000 0100 0010

Uint8 => 0011 0000 0100 0010 => 66 => 0x42, It's wrong. image
TextEncoder is a standardized method with no problems, the function is designed to convert the text into a uint8array.

test other language not only that language for example, russia,arab

We don't need to discuss this issue from a particular language point of view, but only from the point of view of how to encode characters correctly.

First, there are two errors in the old code:

  1. charCodeAt should not be used to try to encode the character that needs to be output, because it does not return the correct encoding. Source

  2. Even if the conversion is correct in some encoding ranges, the way Uint8 is stored corrupts any character that exceeds the unicode number by more than 0xFF

Secondly, if you want to discuss the range of corrupted characters, see this Link , from the third row of the table onwards it's all in the wrong range.

Finally, why TextEncoder? Because it is a function in the Specification. Details: https://developer.mozilla.org/en-US/docs/Web/API/TextEncoder

We can find many examples for file output on Github

@Rolaka
Copy link
Author

Rolaka commented Feb 22, 2024

that's one might break other language

this change was made to ensure that any output of utf-8 characters would have a correct conversion, not just for a particular language. the original code's conversion method used Uint8 to store the converted value, a method that broke all encodings of characters longer than one byte, including all emoji.
for example, Japanese hiragana: あ

あ => U+3042 => 0011 0000 0100 0010

Uint8 => 0011 0000 0100 0010 => 66 => 0x42, It's wrong. image
TextEncoder is a standardized method with no problems, the function is designed to convert the text into a uint8array.

test other language not only that language for example, russia,arab

Do you have any other thoughts on this? If you still have questions about the utf-8, utf-16 encoded files, I can provide more information and test scripts to assist you in understanding this matter, and I'd be happy to do this to push for a resolution.

This bug has been introduced more than 5 months ago and is severely affecting all users whose exports contain characters outside of the ASCII character set.

@Rolaka
Copy link
Author

Rolaka commented Feb 23, 2024

that's one might break other language

this change was made to ensure that any output of utf-8 characters would have a correct conversion, not just for a particular language. the original code's conversion method used Uint8 to store the converted value, a method that broke all encodings of characters longer than one byte, including all emoji.
for example, Japanese hiragana: あ

あ => U+3042 => 0011 0000 0100 0010

Uint8 => 0011 0000 0100 0010 => 66 => 0x42, It's wrong. image
TextEncoder is a standardized method with no problems, the function is designed to convert the text into a uint8array.

test other language not only that language for example, russia,arab

Maybe you don't realize the seriousness of this problem, I would like to show you more relevant information to show where the old code is wrong.

Let's refer to the description of ecma256.

https://262.ecma-international.org/14.0/#sec-terms-and-definitions-string-value

A String value is a member of the String type. Each integer value in the sequence usually represents a single 16-bit unit of UTF-16 text. However, ECMAScript does not place any restrictions or requirements on the values except that they must be 16-bit unsigned integers.

https://262.ecma-international.org/14.0/#sec-ecmascript-language-types-string-type

The String type is the set of all ordered sequences of zero or more 16-bit unsigned integer values (“elements”) up to a maximum length of 253 - 1 elements.

As you can see above, in JavaScript, each character is stored using Uint16, not Uint8, unlike other, more primitive languages. So the old code uses Uint8 in the conversion process, resulting in the loss of two bytes of data.

So it's simple for us to verify this with a snippet of code, although it seems pointless.

const toUint8 = (number) => { return new Uint8Array([number])[0]; };
const getCharCode = (char) => { return char.charCodeAt(); };
// U+00FF 
toUint8(getCharCode('ÿ')) === getCharCode('ÿ') // => true
// U+0100 
toUint8(getCharCode('Ā')) === getCharCode('Ā') // => false

This question is only about coding, not about language, I hope I explained it clearly enough.

2.10.3 changed the package name so that tauri was unable to find data stored in localstorage in webkit from versions before to 2.10.2, and the "Export" supply was broken for 5 months due to a bug in the old code. Of course, that's just my suspicion.

"Export" is the most convenient method of data transfer, and fixing it as soon as possible is necessary.

@Rolaka
Copy link
Author

Rolaka commented Mar 7, 2024

@fred-bf I'm sorry to bother you, but this serious low-level issue hasn't been resolved for 5 months. Could you consider fixing this bug as soon as possible? Thank you for your time.

@Rolaka
Copy link
Author

Rolaka commented Mar 13, 2024

Many thanks to the maintainers for finally fixing this bug, the export feature is a very important thing for many people. This pr is obsolete and could be closed.

@Rolaka Rolaka closed this Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants