perf(parser): lexer match byte not char #2025

overlookmotel · 2024-01-14T03:30:51Z

2 related changes to lexer's read_next_token():

Hint to branch predictor that unicode identifiers and non-standard whitespace are rare by marking that branch #[cold].
The branch is on whether next character is ASCII or not. This check only requires reading 1 byte, as ASCII characters are always single byte in UTF8. So only do the work of getting a char in the cold path, once it's established that character is not ASCII and this work is required.

codspeed-hq · 2024-01-14T03:42:27Z

CodSpeed Performance Report

Merging #2025 will not alter performance

_{Comparing overlookmotel:lexer-bytes (a79b52d) with main (a356918)}

Summary

✅ 14 untouched benchmarks

overlookmotel · 2024-01-14T03:56:05Z

+3% speed-up on the parser benchmarks.

Boshen · 2024-01-14T10:50:02Z

Wat.

My guess is that this also improves the CPU instruction cache?

I don't understand computers 🤷

overlookmotel · 2024-01-14T13:33:33Z

I don't understand computers 🤷

I don't entirely know why this works either. But std's Vec uses the same trick to hint that when you push to end of the Vec, the case where sufficient capacity is not already available is rare. I got the idea from there.

https://doc.rust-lang.org/src/alloc/raw_vec.rs.html#278-295

strager · 2024-01-14T17:23:22Z

crates/oxc_parser/src/lexer/mod.rs

-        let size = c as usize;
+            let byte = remaining.as_bytes()[0];
+            let kind = if byte < 128 {
+                BYTE_HANDLERS[byte as usize](self)


Why not fill BYTE_HANDLERS[128..256] with self.match_unicode_char?

My assumption is that it's ideal for BYTE_HANDLERS to be as small as possible, so as much of it stays in cache as possible. It's already 1 KiB (128 x 8) and I imagine loading chunks of it into memory is a slow-down.

Therefore I imagine the cost of a branch is probably less than the cost of loading a chunk of BYTE_HANDLERS into memory for values 128-255. Especially as the CPU likely won't be able to speculatively execute beyond the BYTE_HANDLERS[byte as usize](self) call, as it's too unpredictable which way it'll jump.

If anything, I wonder if taking other steps to reduce the size of BYTE_HANDLERS might be worthwhile.

However, I am pretty new to all this stuff about speculative branching and suchlike, so this is total guesswork. If you think my assumptions may be incorrect, please say.

Oh wait a second! I just realised where I know you from. I saw your video about perfect hash tables. You definitely know more than me. So please do give your opinion on my assumptions.

I imagine the cost of a branch is probably less than the cost of loading a chunk of BYTE_HANDLERS into memory for values 128-255.

Perhaps. This is worth measuring.

Another option: Map 128..256 to index 0 and teach BYTE_HANDLERS[0] to handle non-ASCII.

Another option: Map 128..256 to index 127 and teach BYTE_HANDLERS[127] to handle non-ASCII.

Another option: Map 128..256 to index 128 and grow BYTE_HANDLERS by one entry.

Of these three new options, reusing index 0 seems to produce the cleanest code on x86_64: https://godbolt.org/z/E1Esn77nf

xor eax, eax ; 0 test dil, dil ; check whether the character is in range cmovg eax, edi ; choose the index (either the character or 0) movzx eax, al ; clear top bits of the index mov rcx, qword ptr [rip + HANDLERS@GOTPCREL] mov rax, qword ptr [rcx + 8*rax]

Thanks for coming back, and for the suggestions.

After further thought, perhaps your original suggestion was the best after all. I guess it's not the size of BYTE_HANDLERS which is the problem in itself, but the cost of reading it. So in the common case where source code is all ASCII, the 2nd half of BYTE_HANDLERS for 128-255 would not be accessed, so it shouldn't incur any cost. There'd only be a performance penalty when it is accessed, which is the same deal as now with the #[cold] branch incurring the cost of a branch misprediction.

The fast path for ASCII characters should benefit from removing the branch.

Anyway, I'll test out these various options, and see what the result is.

One last question: Why is this the "best codegen":

xor eax, eax test dil, dil cmovg eax, edi

rather than the other option:

test dil, dil mov eax, 128 cmovns eax, edi

xor is faster than mov?

xor is faster than mov?

The xor is smaller than the mov. Fewer bytes = less icache pressure. That's my thinking anyway. I could be totally wrong.

It's possible that the cmov is uop-fused with the mov, making mov faster than xor. 🤷‍♀️ I am not an optimization guru. Measure measure measure.

Thanks very much for your comments. #2039 implements your first suggestion, and does produce +1% performance gain. I'm glad you questioned by assumptions, as they turned out to be wrong!

@strager

As suggested by @strager in #2025 (review), this PR adds `BYTE_HANDLERS` for first bytes of unicode characters. This removes a branch from `read_next_token()` and produces a +1% speed-up on parser benchmarks.

2 related changes to lexer's `read_next_token()`: 1. Hint to branch predictor that unicode identifiers and non-standard whitespace are rare by marking that branch `#[cold]`. 2. The branch is on whether next character is ASCII or not. This check only requires reading 1 byte, as ASCII characters are always single byte in UTF8. So only do the work of getting a `char` in the cold path, once it's established that character is not ASCII and this work is required.

@strager

As suggested by @strager in oxc-project#2025 (review), this PR adds `BYTE_HANDLERS` for first bytes of unicode characters. This removes a branch from `read_next_token()` and produces a +1% speed-up on parser benchmarks.

perf(parser): lexer match byte not char

a79b52d

github-actions bot added the A-parser Area - Parser label Jan 14, 2024

Boshen merged commit 60a927d into oxc-project:main Jan 14, 2024
17 checks passed

overlookmotel deleted the lexer-bytes branch January 14, 2024 13:25

strager reviewed Jan 14, 2024

View reviewed changes

overlookmotel mentioned this pull request Jan 15, 2024

refactor(parser): lexer handle unicode without branch #2039

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(parser): lexer match byte not char #2025

perf(parser): lexer match byte not char #2025

overlookmotel commented Jan 14, 2024

codspeed-hq bot commented Jan 14, 2024

overlookmotel commented Jan 14, 2024

Boshen commented Jan 14, 2024 •

edited

Loading

overlookmotel commented Jan 14, 2024 •

edited

Loading

strager Jan 14, 2024

overlookmotel Jan 14, 2024

overlookmotel Jan 14, 2024 •

edited

Loading

strager Jan 14, 2024 •

edited

Loading

overlookmotel Jan 15, 2024 •

edited

Loading

strager Jan 15, 2024

overlookmotel Jan 15, 2024 •

edited

Loading

perf(parser): lexer match byte not char #2025

perf(parser): lexer match byte not char #2025

Conversation

overlookmotel commented Jan 14, 2024

codspeed-hq bot commented Jan 14, 2024

CodSpeed Performance Report

Merging #2025 will not alter performance

Summary

overlookmotel commented Jan 14, 2024

Boshen commented Jan 14, 2024 • edited Loading

overlookmotel commented Jan 14, 2024 • edited Loading

strager Jan 14, 2024

Choose a reason for hiding this comment

overlookmotel Jan 14, 2024

Choose a reason for hiding this comment

overlookmotel Jan 14, 2024 • edited Loading

Choose a reason for hiding this comment

strager Jan 14, 2024 • edited Loading

Choose a reason for hiding this comment

overlookmotel Jan 15, 2024 • edited Loading

Choose a reason for hiding this comment

strager Jan 15, 2024

Choose a reason for hiding this comment

overlookmotel Jan 15, 2024 • edited Loading

Choose a reason for hiding this comment

Boshen commented Jan 14, 2024 •

edited

Loading

overlookmotel commented Jan 14, 2024 •

edited

Loading

overlookmotel Jan 14, 2024 •

edited

Loading

strager Jan 14, 2024 •

edited

Loading

overlookmotel Jan 15, 2024 •

edited

Loading

overlookmotel Jan 15, 2024 •

edited

Loading