Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading a valid Unicode String with a limit #430

Open
OptimumCode opened this issue Jan 25, 2025 · 15 comments
Open

Reading a valid Unicode String with a limit #430

OptimumCode opened this issue Jan 25, 2025 · 15 comments

Comments

@OptimumCode
Copy link

OptimumCode commented Jan 25, 2025

Hello,

Could you please advise what could be used in KMP project to read a valid String from the Source providing the approximate limit in bytes for that String?

The use-case is the following:

I have a Source that is used for parsing data from a file (file might contain non-ASCII characters). I need to read a portion of the content, parse it and if more data needed - read another portion from the Source, etc.

Right now there is a method Source.readString(byteCount: Long) that accepts limit in bytes but if the last byte is just a part of the actual codepoint it will be substituted with the replacement codepoint. And I won't get that last codepoint on a second read attempt either.

I wonder if there is a way to solve my use-case without reimplementing UTF-16 decoding on my side (logic from here). For example, in Java I could use java.io.Reader#read(char[]) method and if the last char is a high-surrogate I could try to read another char to check whether the string is ill-formed or not (real example is a StreamReader from SnakeYAML)

Would really appreciate your thoughts and suggestions. Thank you!

@OptimumCode
Copy link
Author

One additional thought: if each code point of ill-formed UTF-8 sequence was mapped to a replacement code point individually (as mentioned in #301) it could be possible to adjust the length in bytes based on the number of code point replacements, I think

@fzhinkin
Copy link
Collaborator

@OptimumCode thank you for opening the issue. It would be nice to supplement the existing readString API with "read-N-bytes-until-they-are-valid" variant indeed.

Right now, something like that could be emulated using a combination of a peek source and readCodePointValue, but it will be anything but easy and convenient (disclaimer: I didn't test the code below):

public fun Source.readValidString(): String {
    val peek = peek()
    var bytesRead = 0L
    return buildString {
        while (true) {
            try {
                when (val cp = peek.readCodePointValue()) {
                    0xffff -> break
                    
                    in 0..<0x80 -> {
                        append(cp.toChar())
                        bytesRead += 1
                    }

                    in 0x80..<0x800 -> {
                        append(cp.toChar())
                        bytesRead += 2
                    }

                    in 0x800..<0x10000 -> {
                        append(cp.toChar())
                        bytesRead += 3
                    }

                    else -> {
                        val highSurrogate = (0xD800 + (cp - 0x10000).ushr(10)).toChar()
                        val lowSurrogate = (0xDC00 + (cp - 0x10000).and(0x3FF)).toChar()
                        append(highSurrogate)
                        append(lowSurrogate)
                        bytesRead += 4
                    }
                }
            } catch (e: EOFException) {
                break
            } finally {
                skip(bytesRead)
            }
        }
    }
}

@OptimumCode
Copy link
Author

Thank you for your reply @fzhinkin. I tried something similar, but the performance was much worse than I could accept.

Disclaimer: benchmarks are provided just as a reference. They show a single case on my local machine and do not demonstrate the overall performance of the kotlinx-io library

I have written a benchmark to measure the difference. Here is the benchmark code - if you see any mistakes that could cause incorrect measurements please let me know.

The execution command:

./gradlew :kotlinx-io-benchmarks:jvmBenchJar

java -jar benchmarks/build/benchmarks/jvm/jars/kotlinx-io-benchmarks-jvm-jmh-0.6.1-SNAPSHOT-JMH.jar ReadCodepointBenchmarks -f 1 -wi 5 -i 5 -tu us -w 1 -r 1

The most interesting benchmarks are ReadCodepointBenchmarks.readCodepointsFromBuffer and ReadCodepointBenchmarks.readStringFromBuffer.

ReadCodepointBenchmarks.readCodepointsFromBuffer - measures reading the whole buffer using readCodePointValue method
ReadCodepointBenchmarks.readStringFromBuffer - measures reading a string using Source.readString(byteCount: Long) method.

Using codepoints is x2-x10 slower than using readString on my machine.

The benchmarks results were the following:

Benchmark                                         (size)   Mode  Cnt   Score   Error   Units
ReadCodepointBenchmarks.readCodepointsFromBuffer      10  thrpt    5  12.684 ± 1.109  ops/us
ReadCodepointBenchmarks.readCodepointsFromBuffer     100  thrpt    5   1.824 ± 0.024  ops/us
ReadCodepointBenchmarks.readCodepointsFromBuffer    1000  thrpt    5   0.196 ± 0.006  ops/us
ReadCodepointBenchmarks.readCodepointsFromBuffer   10000  thrpt    5   0.018 ± 0.001  ops/us
ReadCodepointBenchmarks.readStringFromBuffer          10  thrpt    5  22.245 ± 2.753  ops/us
ReadCodepointBenchmarks.readStringFromBuffer         100  thrpt    5   9.819 ± 0.984  ops/us
ReadCodepointBenchmarks.readStringFromBuffer        1000  thrpt    5   1.737 ± 0.101  ops/us
ReadCodepointBenchmarks.readStringFromBuffer       10000  thrpt    5   0.191 ± 0.004  ops/us

# just for reference
ReadCodepointBenchmarks.readBytesFromBuffer           10  thrpt    5  32.188 ± 1.883  ops/us
ReadCodepointBenchmarks.readBytesFromBuffer          100  thrpt    5  30.787 ± 2.871  ops/us
ReadCodepointBenchmarks.readBytesFromBuffer         1000  thrpt    5  23.363 ± 1.495  ops/us
ReadCodepointBenchmarks.readBytesFromBuffer        10000  thrpt    5   6.098 ± 0.053  ops/us
ReadCodepointBenchmarks.readCharsFromReader           10  thrpt    5  30.364 ± 1.686  ops/us
ReadCodepointBenchmarks.readCharsFromReader          100  thrpt    5  31.232 ± 1.075  ops/us
ReadCodepointBenchmarks.readCharsFromReader         1000  thrpt    5  30.415 ± 1.402  ops/us
ReadCodepointBenchmarks.readCharsFromReader        10000  thrpt    5  31.385 ± 1.859  ops/us

@JakeWharton
Copy link
Contributor

I think you could request the bytes you want, then walk backwards in the Buffer(ed bytes) until you find a valid UTF-8 termination byte (i.e., a byte with the highest bit set to 0b0, or the highest two bits set to 0b10). Then you can call readString with the index of that termination byte + 1 to read a fully-buffered, valid UTF-8 string. Any remaining UTF-8 leading bytes will remain in the buffer and you can repeat this operation until you EOF or reach whatever termination condition you deem.

@OptimumCode
Copy link
Author

Thank you for the suggestion @JakeWharton. I will try that approach and let you know the results

@JakeWharton
Copy link
Contributor

Sorry I forgot all continuation bytes of UTF-8 start with 0b10.

I think you'll have to search backwards for a starting byte. You can then use its index as the number of bytes to read (which would NOT include that codepoint in the resulting string). You could also check if there were enough bytes available for the starting byte you landed on if you wanted to optionally include it when it was fully available.

@OptimumCode
Copy link
Author

Hi, I ended up with a method like this. If you have time @JakeWharton would you be able to take a look at it? Did I get your idea correctly?

Based on a brief benchmark, the overall performance looks almost the same as reading a string with a byte limit.

Benchmark                                     (size)   Mode  Cnt   Score   Error   Units
ReadCodepointBenchmarks.readStringFromBuffer      10  thrpt    5  22.457 ± 1.633  ops/us
ReadCodepointBenchmarks.readStringFromBuffer     100  thrpt    5  10.693 ± 0.498  ops/us
ReadCodepointBenchmarks.readStringFromBuffer    1000  thrpt    5   1.821 ± 0.059  ops/us
ReadCodepointBenchmarks.readStringFromBuffer   10000  thrpt    5   0.195 ± 0.007  ops/us
ReadCodepointBenchmarks.readStringWithLimit       10  thrpt    5  20.302 ± 2.891  ops/us
ReadCodepointBenchmarks.readStringWithLimit      100  thrpt    5   9.688 ± 0.613  ops/us
ReadCodepointBenchmarks.readStringWithLimit     1000  thrpt    5   1.686 ± 0.301  ops/us
ReadCodepointBenchmarks.readStringWithLimit    10000  thrpt    5   0.190 ± 0.009  ops/us

And also I have added some tests for a new method to check if it works correctly.

@fzhinkin
Copy link
Collaborator

@OptimumCode, it seems like the main potential issue with the suggested implementation is that only the last few bytes are checked for being valid UTF-8 byte sequence. If a source contain strings interleaved by some binary data, readStringWithLimit may read some gibberish, when invoked with large enough limit.

For instance, for a source containing something like hello binary data\0xff\0xff\0xff\0xffand now test data again, the readStringWithLimit(30) will consume a four bytes between "data" and "and", even though the valid UTF-8 string ends right after the word "data". It might not be an issue, but I got in impression that the whole point is to stop at first invalid byte.

@OptimumCode
Copy link
Author

Thank you @fzhinkin for taking a look at the method. The original idea was a bit different - make sure we don't split a valid UTF8 sequence in half (that is what this method does). But yes, you are right about this case - there is actually a test that makes sure the method is not affected by some gibberish in the middle and we still get those bytes in the result string (which are replaced by the library with replacement code points).

To be honest, I don't know what would be the best approach here... If we want to make sure we read a valid UTF8 sequence we need to check all bytes. There is no other way to ensure that, I think

@JakeWharton
Copy link
Contributor

Yeah my approach was just validation that you don't split a codepoint, and your code looks to be basically what I envisioned.

Unfortunately there's not really a way to combine full UTF-8 validation with decoding with creation of the actual String without incurring multiple passes or data copies (at least on the JVM/Android, not sure about JS, native probably can do it).

@OptimumCode
Copy link
Author

Thank you @JakeWharton and @fzhinkin. What do you think should be done to make it a part of the library (different naming, documentation, etc)? If you would like to see a method with such behavior in the library of course. I would be happy to open a PR and make required changes

@JakeWharton
Copy link
Contributor

JakeWharton commented Jan 30, 2025

So I would probably coodinate this a little with whatever the stdlib is going to do for greater codepoint support (https://youtrack.jetbrains.com/issue/KT-71298)

I don't think your specific use case is general enough for a helper. I would perhaps look towards something more general-purpose, such as a mechanism for UTF-8 codepoint iteration + validation. This would basically allow your operation to become a composite of three other high-level operations:

  • request some chunk of data
  • iterate+validate codepoints within the buffered data
  • readString with the final valid codepoint index

@OptimumCode
Copy link
Author

I probably didn't get correctly the part iterate+validate codepoints. Did you mean something like that:

fun Source.forEachCodePoint(limitBytes: Long, action: (Int) -> Unit): Long {}

where the function returns the index of the end byte associated with the last full code point.

@JakeWharton
Copy link
Contributor

Yeah something like that.

I wouldn't take a limit. You can either do request + a function on the buffer, or do peek().limit(bytes) (although not sure this library has a limit-style function yet. I'm going to add it to Okio one of these years...

I'd also be tempted to return a boolean so you could stop iterating when you find a condition that warrants it, like a codepoint outside a certain range.

@OptimumCode
Copy link
Author

In this case, I would create something similar to ByteBuf.forEachByte. I think this the closest example of what you described

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants