Fix [Utils] Replace Text Encoder #4093

Rolaka · 2024-02-21T03:29:21Z

[+] fix(utils.ts): replace text encoder

Use a better way of encoding text, the current code is wrong.

new Uint8Array([...'新的对话'].map((c) => c.charCodeAt(0)))
// => Uint8Array(4) [176, 132, 249, 221]

Replacing old code with TextEncoder.

const encoder = new TextEncoder()
encoder.encode('新的对话')
// => Uint8Array(12) [230, 150, 176, 231, 154, 132, 229, 175, 185, 232, 175, 157]

vercel · 2024-02-21T03:29:24Z

@Rolaka is attempting to deploy a commit to the NextChat Team on Vercel.

A member of the Team first needs to authorize it.

github-actions · 2024-02-21T03:32:20Z

Your build has completed!

Preview deployment

H0llyW00dzZ · 2024-02-21T04:49:23Z

that's one might break other language

Rolaka · 2024-02-21T05:38:32Z

that's one might break other language

this change was made to ensure that any output of utf-8 characters would have a correct conversion, not just for a particular language. the original code's conversion method used Uint8 to store the converted value, a method that broke all encodings of characters longer than one byte, including all emoji.

for example, Japanese hiragana: あ

あ => U+3042 => 0011 0000 0100 0010

Uint8 => ~~0011 0000~~ 0100 0010 => 66 => 0x42, It's wrong.

TextEncoder is a standardized method with no problems, the function is designed to convert the text into a uint8array.

H0llyW00dzZ · 2024-02-21T07:03:15Z

that's one might break other language

this change was made to ensure that any output of utf-8 characters would have a correct conversion, not just for a particular language. the original code's conversion method used Uint8 to store the converted value, a method that broke all encodings of characters longer than one byte, including all emoji.

for example, Japanese hiragana: あ
あ => U+3042 => 0011 0000 0100 0010
Uint8 => ~~0011 0000~~ 0100 0010 => 66 => 0x42, It's wrong.

TextEncoder is a standardized method with no problems, the function is designed to convert the text into a uint8array.

test other language not only that language
for example, russia,arab

Rolaka · 2024-02-21T08:22:34Z

that's one might break other language

this change was made to ensure that any output of utf-8 characters would have a correct conversion, not just for a particular language. the original code's conversion method used Uint8 to store the converted value, a method that broke all encodings of characters longer than one byte, including all emoji.
for example, Japanese hiragana: あ
あ => U+3042 => 0011 0000 0100 0010
Uint8 => ~~0011 0000~~ 0100 0010 => 66 => 0x42, It's wrong.
TextEncoder is a standardized method with no problems, the function is designed to convert the text into a uint8array.
test other language not only that language for example, russia,arab

We don't need to discuss this issue from a particular language point of view, but only from the point of view of how to encode characters correctly.

First, there are two errors in the old code:

charCodeAt should not be used to try to encode the character that needs to be output, because it does not return the correct encoding. Source
Even if the conversion is correct in some encoding ranges, the way Uint8 is stored corrupts any character that exceeds the unicode number by more than 0xFF

Secondly, if you want to discuss the range of corrupted characters, see this Link , from the third row of the table onwards it's all in the wrong range.

Finally, why TextEncoder? Because it is a function in the Specification. Details: https://developer.mozilla.org/en-US/docs/Web/API/TextEncoder

We can find many examples for file output on Github

Rolaka · 2024-02-22T05:45:31Z

that's one might break other language

this change was made to ensure that any output of utf-8 characters would have a correct conversion, not just for a particular language. the original code's conversion method used Uint8 to store the converted value, a method that broke all encodings of characters longer than one byte, including all emoji.
for example, Japanese hiragana: あ
あ => U+3042 => 0011 0000 0100 0010
Uint8 => ~~0011 0000~~ 0100 0010 => 66 => 0x42, It's wrong.
TextEncoder is a standardized method with no problems, the function is designed to convert the text into a uint8array.
test other language not only that language for example, russia,arab

Do you have any other thoughts on this? If you still have questions about the utf-8, utf-16 encoded files, I can provide more information and test scripts to assist you in understanding this matter, and I'd be happy to do this to push for a resolution.

This bug has been introduced more than 5 months ago and is severely affecting all users whose exports contain characters outside of the ASCII character set.

Rolaka · 2024-02-23T07:37:22Z

that's one might break other language

this change was made to ensure that any output of utf-8 characters would have a correct conversion, not just for a particular language. the original code's conversion method used Uint8 to store the converted value, a method that broke all encodings of characters longer than one byte, including all emoji.
for example, Japanese hiragana: あ
あ => U+3042 => 0011 0000 0100 0010
Uint8 => ~~0011 0000~~ 0100 0010 => 66 => 0x42, It's wrong.
TextEncoder is a standardized method with no problems, the function is designed to convert the text into a uint8array.
test other language not only that language for example, russia,arab

Maybe you don't realize the seriousness of this problem, I would like to show you more relevant information to show where the old code is wrong.

Let's refer to the description of ecma256.

https://262.ecma-international.org/14.0/#sec-terms-and-definitions-string-value

A String value is a member of the String type. Each integer value in the sequence usually represents a single 16-bit unit of UTF-16 text. However, ECMAScript does not place any restrictions or requirements on the values except that they must be 16-bit unsigned integers.

https://262.ecma-international.org/14.0/#sec-ecmascript-language-types-string-type

The String type is the set of all ordered sequences of zero or more 16-bit unsigned integer values (“elements”) up to a maximum length of 253 - 1 elements.

As you can see above, in JavaScript, each character is stored using Uint16, not Uint8, unlike other, more primitive languages. So the old code uses Uint8 in the conversion process, resulting in the loss of two bytes of data.

So it's simple for us to verify this with a snippet of code, although it seems pointless.

const toUint8 = (number) => { return new Uint8Array([number])[0]; };
const getCharCode = (char) => { return char.charCodeAt(); };
// U+00FF 
toUint8(getCharCode('ÿ')) === getCharCode('ÿ') // => true
// U+0100 
toUint8(getCharCode('Ā')) === getCharCode('Ā') // => false

This question is only about coding, not about language, I hope I explained it clearly enough.

2.10.3 changed the package name so that tauri was unable to find data stored in localstorage in webkit from versions before to 2.10.2, and the "Export" supply was broken for 5 months due to a bug in the old code. Of course, that's just my suspicion.

"Export" is the most convenient method of data transfer, and fixing it as soon as possible is necessary.

Rolaka · 2024-03-07T05:22:18Z

@fred-bf I'm sorry to bother you, but this serious low-level issue hasn't been resolved for 5 months. Could you consider fixing this bug as soon as possible? Thank you for your time.

Rolaka · 2024-03-13T05:38:01Z

Many thanks to the maintainers for finally fixing this bug, the export feature is a very important thing for many people. This pr is obsolete and could be closed.

Fix [Utils] Replace Text Encoder

564f73f

Rolaka closed this Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix [Utils] Replace Text Encoder #4093

Fix [Utils] Replace Text Encoder #4093

Rolaka commented Feb 21, 2024

vercel bot commented Feb 21, 2024

github-actions bot commented Feb 21, 2024

H0llyW00dzZ commented Feb 21, 2024

Rolaka commented Feb 21, 2024 •

edited

Loading

H0llyW00dzZ commented Feb 21, 2024

Rolaka commented Feb 21, 2024

Rolaka commented Feb 22, 2024

Rolaka commented Feb 23, 2024 •

edited

Loading

Rolaka commented Mar 7, 2024

Rolaka commented Mar 13, 2024

Fix [Utils] Replace Text Encoder #4093

Fix [Utils] Replace Text Encoder #4093

Conversation

Rolaka commented Feb 21, 2024

vercel bot commented Feb 21, 2024

github-actions bot commented Feb 21, 2024

H0llyW00dzZ commented Feb 21, 2024

Rolaka commented Feb 21, 2024 • edited Loading

H0llyW00dzZ commented Feb 21, 2024

Rolaka commented Feb 21, 2024

Rolaka commented Feb 22, 2024

Rolaka commented Feb 23, 2024 • edited Loading

Rolaka commented Mar 7, 2024

Rolaka commented Mar 13, 2024

Rolaka commented Feb 21, 2024 •

edited

Loading

Rolaka commented Feb 23, 2024 •

edited

Loading