Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grapheme handling issue with \r\n #58

Closed
alexrutar opened this issue Nov 17, 2024 · 5 comments
Closed

Grapheme handling issue with \r\n #58

alexrutar opened this issue Nov 17, 2024 · 5 comments

Comments

@alexrutar
Copy link
Contributor

There seems to be a grapheme handling ambiguity for strings containing the "windows-style newline" \r\n?

Since \r\n is treated as a single grapheme by the Unicode segmentation crate, the highlight indices corresponding to a Utf32String constructed from a string slice consisting only of ASCII characters and \r\n will be offset relative to the grapheme indices of that string.

I am curious if this is intended behaviour? If very fast reading of the implementation in helix is correct, this is also a "bug" in helix.

It is reasonable to me that helix does not support multi-line objects at all, so in practice this never turns up. I noticed this behaviour while implementing support for multi-line matches in nucleo-picker; here is the downstream PR: autobib/nucleo-picker#13

@alexrutar
Copy link
Contributor Author

More generally I feel like there are a few rough edges to the Utf32String API?

  1. My feeling somehow is that Utf32String itself being mutable is in principle incorrect, because you could easily construct a Utf32String which is entirely invalid by taking a large grapheme, iterating over chars, and pushing each char individually to the Utf32String. Of course the reasonable answer to this type of issue is 'stop hitting yourself in the head' (since pushing is totally valid if you just push complete graphemes), but in my opinion the API concerns are still an issue.
  2. I mentioned this in Feature request: Better conversions between Utf32String and Utf32Str #50 but it is also a bit strange to me that Utf32String implements Display. For instance the conversion String -> Utf32String -> String using the From and ToString implementations is lossy.

@pascalkuthe
Copy link
Member

That one is a bit ugly. I didn't really consider windows newlines because indeed nucleo is intended for single-line matches.

I guess you would want to normalize \r\n to \n instead of \r (and do that in the ascik case too).

Not sure about the API. Having a Display/From/Into implementations is useful. You can shoot yourself in the foot with it but it would be unfortunate to restrict the API. Maybe some naming changes to make things more explicit could be sufficient (or improving the docs). For example I think there can be usecases where you would want to use push() manually so just removing the API isn't great. Maybe call it push_grapheme or something isntead

@alexrutar
Copy link
Contributor Author

alexrutar commented Nov 17, 2024

Edit: See the comment in the closed PR instead: #59 (comment)

But leaving the old version below for posterity.

R.e. the \r\n issue, just to clarify, are you happy for this change to be implemented in nucleo, or do you think that I should perform normalization downstream? I suppose the relevant changes would be:

  1. Replace the value.is_ascii() check with an extra check that the string does not contain \r\n:
    value.is_ascii() && memmem::find(value.as_bytes, b"\r\n").is_none()
    
    (or a an epsilon cheaper version only looking for \r; not sure how much extra overhead this has).
  2. In the chars::graphemes function perform the extra check grapheme == "\r\n" and map to \n.

Also easy to implement downstream by just replacing \r\n with \n in any case.

I guess the newline support in general is an edge case in more scenarios, e.g. one might expect that foo$ would match foo\nbar. As you say nucleo as currently implemented isn't really designed for matching multiple lines.

Before putting this issue here I spent a bit of time searching for similar issues in helix and I only spotted one PR where there was some sort of multiline behaviour helix-editor/helix#11184

@alexrutar
Copy link
Contributor Author

alexrutar commented Nov 17, 2024

Edit: See the comment in the closed PR instead: #59 (comment)

But leaving the old version below for posterity.

Quick micro-benchmark shows about 10% performance hit on short ASCII strings, which I guess overall is quite mild since this almost certainly won't be a bottleneck in any reasonable situation (200k short string conversions takes about 10ms on my machine). On long Unicode strings it's essentially 0 since the grapheme iteration is so much slower anyway.

The performance hit is unsurprisingly much larger if the string actually contains a \r\n, since it then runs down the much slower Unicode path (about 6x slower). Replacing the unicode path in this situation with a custom \r\n -> \n substitution instead is about twice as slow.

@alexrutar
Copy link
Contributor Author

Closed by #61

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants