-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Use UTF-32 Aware String in JavaScript #13
base: master
Are you sure you want to change the base?
Conversation
…s to use functions instead of operators that coerce arguments to primitive types. Checks or converts to utf32_string at the API boundary. Changes unit test suite to be run twice, one with standard string and once with forced utf32_string.
# Conflicts: # javascript/diff_match_patch_uncompressed.js # javascript/tests/diff_match_patch_test.js
|
||
|
||
diff_match_patch.utf32_string.prototype.valueOf = function() { | ||
throw new Error("warning: implicit conversion attempted on utf32_string; use toString instead"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
console.error instead of thrown error?
@NeilFraser Initial thoughts on this? |
@NeilFraser Hello? |
Hi! thank you for providing the patch. var dmp = new diff_match_patch();
var patches = dmp.patch_make('👏test', '👏');
var patchText = dmp.patch_toText(patches);
var result = dmp.patch_fromText(patchText);
var obj = dmp.patch_apply(result, '👏test'); I get this exception:
If you know any workaround around this issue, please could you share it here. |
Probably what's happening is if (text.codePoints) {
// Argument is another utf32_string; do not re-convert
return text;
} |
Thank you very much, The exception is gone 😄 but I still have some Unicode strings that give errors. var dmp = new diff_match_patch();
var str = '👏👏t👏'; var patches = dmp.patch_make(str, 'test👏');
var patchText = dmp.patch_toText(patches);
var _patches = dmp.patch_fromText(patchText);
var obj = dmp.patch_apply(_patches, str);
console.log(obj[0]); it will output this "mojibake"
|
Hi, thanks for finding these issues. I don't plan to spend any more time on my PR since it hasn't gotten any attention in 6 months. I encourage you to copy my branch into your own fork (pull it from my fork and push it to your fork) and then add your own commits where you fix these errors. You can then open a new PR and hopefully Neil will find some time one of these days to review it. |
Ok, no worries 😢 I will report back if I manage to fix the issue. Thank you again for the patch. |
Any updates? |
@ndvbd you can check out the branch in #80. it doesn't have fixes for all languages but it does for JavaScript. It's also only patched for We've been using it for more than a year in Simplenote and it's working fine. Previously we had some corruption issues that popped up on account of the issue with surrogate pairs. |
@dmsnell any idea why it is not merged into main? |
Looks like there are now at least 3 independent PRs with various approaches to fixing the issue: this one, #69, and #80. @NeilFraser Can you take a look at these and perhaps merge one? |
@ndvbd I would wager that @NeilFraser is busy with other things. @sffc this approach is fairly heavy-handed - did you run it against the performance tests? what were the costs? How do you feel about converting to UTF-32 vs. the other approaches? Per the comments in #69 I think that approach sounds better than it is in practice. The solution in #80 only addresses the delta generation (which is "fine" because this library doesn't specify that it produces Unicode byte streams/diffs) and it only imparts a marginal performance penalty and that only when surrogate halves are split. The only thing I see it lacking is the boundary cleanup as seen in the dissimilar because that normalizes all diff outputs. I haven't done that simply because it's working stably enough in #80 and I moved on to other problems. Your feedback on those issues is welcome. Regardless of what takes place in |
I'm happy if #69 is enough to solve the issue; it's certainly a lot smaller of a patch than this one. There are a lot of code paths; I think the key thing with any of these is a thorough test suite. |
I've tried all 3 solutions to this problem and yours is the only one that solved the encodingURI exception for me. I created a minimal repo which reproduces the issue: https://github.com/gamedevsam/diff-match-patch-emoji-issue Hope it's helpful to the developers of the other 2 alternate solutions. Thank you for your contributions! |
As noted in #10, standard JavaScript strings use UTF-16 encoding, which leads to false arithmetic for a tool like diff_match_patch. The only correct fixes to the problem would be to either rewrite the whole library to operate in offsets instead of char indices or to push the character encoding into an abstraction layer. This PR attempts to do the second option.
At every API boundary, the input string is checked to see if it contains supplemental code points, and if it does, the inputs are converted to utf32_string, an abstraction layer that has a String-like API but does all math in terms of code points instead of code units.
The utf32_string implementation is slower than the standard string implementation, but not prohibitively slower, and it is used only when required (when the string contains supplemental code points).
I tried my best to be consistent with the JavaScript style of the existing project, including pre-ES6 syntax. I did not add Closure type annotations.