Support truncating from the left, width padding/truncation incorrect with some unicode input #283

moh-eulith · 2022-10-13T03:08:50Z

Left truncation is the default in the inspiration for this crate (Log4j). It usually produces more relevant results, e.g. for filenames, logger names, etc.

Both styles of truncation should be supported, with right truncation (current implementation) as the default for backward compatibility.

Log4j uses a minus sign on the length (negative length) to flip the truncation. So it might look like this for log4rs: {f:>-15.15} (left truncate or left pad the filename field to exactly 15 chars)

The text was updated successfully, but these errors were encountered:

estk · 2022-10-13T15:37:17Z

Sounds like a worthy enhancement, happy to mentor if you'd like to take a stab at it.

moh-eulith · 2022-10-17T02:37:08Z

I had a look at this. unicode encoding makes this diabolically difficult. The existing code is already running into this issue, so maybe fixing the unicode issues is the first step. Here is a failing test case inspired by this stackoverflow question

    #[test]
    #[cfg(feature = "simple_writer")]
    fn right_align_formatter_hard_unicode() {
        let pw = PatternEncoder::new("{({l} {m}):>15}");

        let mut buf = vec![];
        pw.encode(
            &mut SimpleWriter(&mut buf),
            &Record::builder()
                .level(Level::Info)
                .args(format_args!("\u{01f5}\u{0067}\u{0301}"))
                .build(),
        )
        .unwrap();
        assert_eq!(buf.len(), 18);
    }

a simple println!("\u{01f5}\u{0067}\u{0301}"); produces ǵǵ (two visible characters, not sure the pasting is working right), but the padding thinks it's 3 characters.

estk · 2022-10-17T17:17:06Z

@moh-eulith can you open a Draft pr with the above failing test?

estk · 2022-10-17T18:17:50Z

Ok so I've kinda gotten to the bottom of this. There are in fact 3 unicode "characters" ie codepoints here but they result in only 2 graphemes and therefore the message has a unicode width of 2. I have a fix for the test thanks to the unicode_segmentation crate. But I think I need to think on this for a minute to decide how to proceed.

What do you think?

fn char_starts(buf: &[u8]) -> usize {
    if let Ok(s) = std::str::from_utf8(buf) {
        s.graphemes(true).count()
    } else {
        buf.iter().filter(|&&b| is_char_boundary(b)).count()
    }
}
....
    #[test]
    fn char_starts_gs() {
        let buf = "\u{01f5}\u{0067}\u{0301}";
        let starts = char_starts(buf.as_bytes());
        assert_eq!(2, starts);
    }

moh-eulith · 2022-10-17T18:20:36Z

I'm not clear on what the buf contents are that are passed to the write function. Is the content guaranteed to be on byte, unicode code point, or grapheme boundary?

estk · 2022-10-17T18:30:43Z

Negative, an io::Writer just takes bytes and writes them, there is no guarantee of a particular segmentation, which is why this is really only a best-effort check.

moh-eulith · 2022-10-17T18:54:24Z

So a full fix would require some manner of buffering. Max encoding length for utf8 is 4 bytes, another 2 for combining, so that's a 6 byte buffer. I'm by no means a unicode expert, but it appears combining is not limited to a single one per grapheme, so theoretically, we have to look ahead past all combining codepoints.

Additionally, there are "Codepoints U+035C–0362 are double diacritics, diacritic signs placed across two letters."

A test case for this insanity might look like: println!("a\u{0308}\u{032C}\u{0362}e\u{0301}\u{0331}") which prints: ͏ä̬͢é̱

Honestly, I don't know how far this should go. I think enough buffering for 1 codepoint + 1 combining is practical enough for real text.

estk · 2022-10-17T20:24:53Z

Counting graphemes seems to be enough.

    #[test]
    fn char_starts_gs() {
        let buf = "\u{01f5}\u{0067}\u{0301}";
        let starts = char_starts(buf.as_bytes());
        assert_eq!(2, starts);
        let buf = "a\u{0308}\u{032C}\u{0362}e\u{0301}\u{0331}";
        let starts = char_starts(buf.as_bytes());
        assert_eq!(2, starts);
    }

moh-eulith · 2022-10-17T20:45:04Z

Agreed that grapheme counting is the right way to go. However, because we don't have a guarantee on the buf segmentation, it leads to failures:

    #[test]
    fn char_starts_gs() {
        let buf = "\u{01f5}\u{0067}\u{0301}";
        test_all_splits(buf, 2);
        let buf = "a\u{0308}\u{032C}\u{0362}e\u{0301}\u{0331}";
        test_all_splits(buf, 2);
    }

    fn test_all_splits(buf: &str, len: usize) {
        let b = buf.as_bytes();
        for i in 1..b.len() {
            assert_eq!(len, char_starts(&b[..i]) + char_starts(&b[i..]));
        }
    }

moh-eulith · 2022-10-17T20:51:55Z

Sorry, should've made the example simpler. I think the minimal test case that fails test_all_splits is:

        let buf = "e\u{0331}";
        test_all_splits(buf, 1);

estk · 2022-10-17T20:53:34Z

No worries, yep I agree there will be miscalculations when we span a Unicode sequence, we just don't have enough information to do the right thing.

DavidZemon · 2024-03-28T15:04:29Z

Is it rude to just ignore unicode and put a note in the docs about "unicode support is undefined and may result in unexpected trimmings" 😅

estk added enhancement help wanted labels Oct 13, 2022

moh-eulith linked a pull request Oct 17, 2022 that will close this issue

fix unicode issues and implement left truncation #285

Open

estk changed the title ~~Support truncating from the left~~ Support truncating from the left, width padding/truncation incorrect with some unicode input Oct 17, 2022

estk added the bug label Oct 17, 2022

estk added open pr An open PR exists and removed help wanted labels Oct 30, 2022

bconn98 linked a pull request Apr 4, 2024 that will close this issue

fix unicode issues and implement left truncation #285

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support truncating from the left, width padding/truncation incorrect with some unicode input #283

Support truncating from the left, width padding/truncation incorrect with some unicode input #283

moh-eulith commented Oct 13, 2022

estk commented Oct 13, 2022

moh-eulith commented Oct 17, 2022

estk commented Oct 17, 2022

estk commented Oct 17, 2022

moh-eulith commented Oct 17, 2022

estk commented Oct 17, 2022

moh-eulith commented Oct 17, 2022

estk commented Oct 17, 2022 •

edited

Loading

moh-eulith commented Oct 17, 2022

moh-eulith commented Oct 17, 2022

estk commented Oct 17, 2022

DavidZemon commented Mar 28, 2024

Support truncating from the left, width padding/truncation incorrect with some unicode input #283

Support truncating from the left, width padding/truncation incorrect with some unicode input #283

Comments

moh-eulith commented Oct 13, 2022

estk commented Oct 13, 2022

moh-eulith commented Oct 17, 2022

estk commented Oct 17, 2022

estk commented Oct 17, 2022

moh-eulith commented Oct 17, 2022

estk commented Oct 17, 2022

moh-eulith commented Oct 17, 2022

estk commented Oct 17, 2022 • edited Loading

moh-eulith commented Oct 17, 2022

moh-eulith commented Oct 17, 2022

estk commented Oct 17, 2022

DavidZemon commented Mar 28, 2024

estk commented Oct 17, 2022 •

edited

Loading