-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading a valid Unicode String with a limit #430
Comments
One additional thought: if each code point of ill-formed UTF-8 sequence was mapped to a replacement code point individually (as mentioned in #301) it could be possible to adjust the length in bytes based on the number of code point replacements, I think |
@OptimumCode thank you for opening the issue. It would be nice to supplement the existing Right now, something like that could be emulated using a combination of a peek source and public fun Source.readValidString(): String {
val peek = peek()
var bytesRead = 0L
return buildString {
while (true) {
try {
when (val cp = peek.readCodePointValue()) {
0xffff -> break
in 0..<0x80 -> {
append(cp.toChar())
bytesRead += 1
}
in 0x80..<0x800 -> {
append(cp.toChar())
bytesRead += 2
}
in 0x800..<0x10000 -> {
append(cp.toChar())
bytesRead += 3
}
else -> {
val highSurrogate = (0xD800 + (cp - 0x10000).ushr(10)).toChar()
val lowSurrogate = (0xDC00 + (cp - 0x10000).and(0x3FF)).toChar()
append(highSurrogate)
append(lowSurrogate)
bytesRead += 4
}
}
} catch (e: EOFException) {
break
} finally {
skip(bytesRead)
}
}
}
} |
Thank you for your reply @fzhinkin. I tried something similar, but the performance was much worse than I could accept. Disclaimer: benchmarks are provided just as a reference. They show a single case on my local machine and do not demonstrate the overall performance of the kotlinx-io library I have written a benchmark to measure the difference. Here is the benchmark code - if you see any mistakes that could cause incorrect measurements please let me know. The execution command: ./gradlew :kotlinx-io-benchmarks:jvmBenchJar
java -jar benchmarks/build/benchmarks/jvm/jars/kotlinx-io-benchmarks-jvm-jmh-0.6.1-SNAPSHOT-JMH.jar ReadCodepointBenchmarks -f 1 -wi 5 -i 5 -tu us -w 1 -r 1 The most interesting benchmarks are ReadCodepointBenchmarks.readCodepointsFromBuffer - measures reading the whole buffer using Using codepoints is x2-x10 slower than using The benchmarks results were the following:
|
I think you could |
Thank you for the suggestion @JakeWharton. I will try that approach and let you know the results |
Sorry I forgot all continuation bytes of UTF-8 start with 0b10. I think you'll have to search backwards for a starting byte. You can then use its index as the number of bytes to read (which would NOT include that codepoint in the resulting string). You could also check if there were enough bytes available for the starting byte you landed on if you wanted to optionally include it when it was fully available. |
Hi, I ended up with a method like this. If you have time @JakeWharton would you be able to take a look at it? Did I get your idea correctly? Based on a brief benchmark, the overall performance looks almost the same as reading a string with a byte limit.
And also I have added some tests for a new method to check if it works correctly. |
@OptimumCode, it seems like the main potential issue with the suggested implementation is that only the last few bytes are checked for being valid UTF-8 byte sequence. If a source contain strings interleaved by some binary data, For instance, for a source containing something like |
Thank you @fzhinkin for taking a look at the method. The original idea was a bit different - make sure we don't split a valid UTF8 sequence in half (that is what this method does). But yes, you are right about this case - there is actually a test that makes sure the method is not affected by some gibberish in the middle and we still get those bytes in the result string (which are replaced by the library with replacement code points). To be honest, I don't know what would be the best approach here... If we want to make sure we read a valid UTF8 sequence we need to check all bytes. There is no other way to ensure that, I think |
Yeah my approach was just validation that you don't split a codepoint, and your code looks to be basically what I envisioned. Unfortunately there's not really a way to combine full UTF-8 validation with decoding with creation of the actual |
Thank you @JakeWharton and @fzhinkin. What do you think should be done to make it a part of the library (different naming, documentation, etc)? If you would like to see a method with such behavior in the library of course. I would be happy to open a PR and make required changes |
So I would probably coodinate this a little with whatever the stdlib is going to do for greater codepoint support (https://youtrack.jetbrains.com/issue/KT-71298) I don't think your specific use case is general enough for a helper. I would perhaps look towards something more general-purpose, such as a mechanism for UTF-8 codepoint iteration + validation. This would basically allow your operation to become a composite of three other high-level operations:
|
I probably didn't get correctly the part fun Source.forEachCodePoint(limitBytes: Long, action: (Int) -> Unit): Long {} where the function returns the index of the end byte associated with the last full code point. |
Yeah something like that. I wouldn't take a limit. You can either do I'd also be tempted to return a boolean so you could stop iterating when you find a condition that warrants it, like a codepoint outside a certain range. |
In this case, I would create something similar to ByteBuf.forEachByte. I think this the closest example of what you described |
Hello,
Could you please advise what could be used in KMP project to read a valid String from the
Source
providing the approximate limit in bytes for thatString
?The use-case is the following:
I have a
Source
that is used for parsing data from a file (file might contain non-ASCII characters). I need to read a portion of the content, parse it and if more data needed - read another portion from theSource
, etc.Right now there is a method
Source.readString(byteCount: Long)
that accepts limit in bytes but if the last byte is just a part of the actual codepoint it will be substituted with the replacement codepoint. And I won't get that last codepoint on a second read attempt either.I wonder if there is a way to solve my use-case without reimplementing UTF-16 decoding on my side (logic from here). For example, in Java I could use
java.io.Reader#read(char[])
method and if the last char is a high-surrogate I could try to read anotherchar
to check whether the string is ill-formed or not (real example is aStreamReader
from SnakeYAML)Would really appreciate your thoughts and suggestions. Thank you!
The text was updated successfully, but these errors were encountered: