-
-
Notifications
You must be signed in to change notification settings - Fork 495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(parser): lexer match byte not char #2025
Conversation
CodSpeed Performance ReportMerging #2025 will not alter performanceComparing Summary
|
+3% speed-up on the parser benchmarks. |
I don't entirely know why this works either. But std's |
let size = c as usize; | ||
let byte = remaining.as_bytes()[0]; | ||
let kind = if byte < 128 { | ||
BYTE_HANDLERS[byte as usize](self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not fill BYTE_HANDLERS[128..256]
with self.match_unicode_char
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My assumption is that it's ideal for BYTE_HANDLERS
to be as small as possible, so as much of it stays in cache as possible. It's already 1 KiB (128 x 8) and I imagine loading chunks of it into memory is a slow-down.
Therefore I imagine the cost of a branch is probably less than the cost of loading a chunk of BYTE_HANDLERS
into memory for values 128-255. Especially as the CPU likely won't be able to speculatively execute beyond the BYTE_HANDLERS[byte as usize](self)
call, as it's too unpredictable which way it'll jump.
If anything, I wonder if taking other steps to reduce the size of BYTE_HANDLERS
might be worthwhile.
However, I am pretty new to all this stuff about speculative branching and suchlike, so this is total guesswork. If you think my assumptions may be incorrect, please say.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wait a second! I just realised where I know you from. I saw your video about perfect hash tables. You definitely know more than me. So please do give your opinion on my assumptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine the cost of a branch is probably less than the cost of loading a chunk of
BYTE_HANDLERS
into memory for values 128-255.
Perhaps. This is worth measuring.
Another option: Map 128..256 to index 0 and teach BYTE_HANDLERS[0]
to handle non-ASCII.
Another option: Map 128..256 to index 127 and teach BYTE_HANDLERS[127]
to handle non-ASCII.
Another option: Map 128..256 to index 128 and grow BYTE_HANDLERS
by one entry.
Of these three new options, reusing index 0 seems to produce the cleanest code on x86_64: https://godbolt.org/z/E1Esn77nf
xor eax, eax ; 0
test dil, dil ; check whether the character is in range
cmovg eax, edi ; choose the index (either the character or 0)
movzx eax, al ; clear top bits of the index
mov rcx, qword ptr [rip + HANDLERS@GOTPCREL]
mov rax, qword ptr [rcx + 8*rax]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for coming back, and for the suggestions.
After further thought, perhaps your original suggestion was the best after all. I guess it's not the size of BYTE_HANDLERS
which is the problem in itself, but the cost of reading it. So in the common case where source code is all ASCII, the 2nd half of BYTE_HANDLERS
for 128-255 would not be accessed, so it shouldn't incur any cost. There'd only be a performance penalty when it is accessed, which is the same deal as now with the #[cold]
branch incurring the cost of a branch misprediction.
The fast path for ASCII characters should benefit from removing the branch.
Anyway, I'll test out these various options, and see what the result is.
One last question: Why is this the "best codegen":
xor eax, eax
test dil, dil
cmovg eax, edi
rather than the other option:
test dil, dil
mov eax, 128
cmovns eax, edi
xor
is faster than mov
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xor
is faster thanmov
?
The xor
is smaller than the mov
. Fewer bytes = less icache pressure. That's my thinking anyway. I could be totally wrong.
It's possible that the cmov is uop-fused with the mov, making mov faster than xor. 🤷♀️ I am not an optimization guru. Measure measure measure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much for your comments. #2039 implements your first suggestion, and does produce +1% performance gain. I'm glad you questioned by assumptions, as they turned out to be wrong!
As suggested by @strager in #2025 (review), this PR adds `BYTE_HANDLERS` for first bytes of unicode characters. This removes a branch from `read_next_token()` and produces a +1% speed-up on parser benchmarks.
2 related changes to lexer's `read_next_token()`: 1. Hint to branch predictor that unicode identifiers and non-standard whitespace are rare by marking that branch `#[cold]`. 2. The branch is on whether next character is ASCII or not. This check only requires reading 1 byte, as ASCII characters are always single byte in UTF8. So only do the work of getting a `char` in the cold path, once it's established that character is not ASCII and this work is required.
As suggested by @strager in oxc-project#2025 (review), this PR adds `BYTE_HANDLERS` for first bytes of unicode characters. This removes a branch from `read_next_token()` and produces a +1% speed-up on parser benchmarks.
2 related changes to lexer's
read_next_token()
:Hint to branch predictor that unicode identifiers and non-standard whitespace are rare by marking that branch
#[cold]
.The branch is on whether next character is ASCII or not. This check only requires reading 1 byte, as ASCII characters are always single byte in UTF8. So only do the work of getting a
char
in the cold path, once it's established that character is not ASCII and this work is required.