Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(parser): lexer match byte not char #2025

Merged
merged 1 commit into from
Jan 14, 2024

Conversation

overlookmotel
Copy link
Contributor

2 related changes to lexer's read_next_token():

  1. Hint to branch predictor that unicode identifiers and non-standard whitespace are rare by marking that branch #[cold].

  2. The branch is on whether next character is ASCII or not. This check only requires reading 1 byte, as ASCII characters are always single byte in UTF8. So only do the work of getting a char in the cold path, once it's established that character is not ASCII and this work is required.

@github-actions github-actions bot added the A-parser Area - Parser label Jan 14, 2024
Copy link

codspeed-hq bot commented Jan 14, 2024

CodSpeed Performance Report

Merging #2025 will not alter performance

Comparing overlookmotel:lexer-bytes (a79b52d) with main (a356918)

Summary

✅ 14 untouched benchmarks

@overlookmotel
Copy link
Contributor Author

+3% speed-up on the parser benchmarks.

@Boshen
Copy link
Member

Boshen commented Jan 14, 2024

image

Wat.

My guess is that this also improves the CPU instruction cache?

I don't understand computers 🤷

@Boshen Boshen merged commit 60a927d into oxc-project:main Jan 14, 2024
17 checks passed
@overlookmotel overlookmotel deleted the lexer-bytes branch January 14, 2024 13:25
@overlookmotel
Copy link
Contributor Author

overlookmotel commented Jan 14, 2024

I don't understand computers 🤷

I don't entirely know why this works either. But std's Vec uses the same trick to hint that when you push to end of the Vec, the case where sufficient capacity is not already available is rare. I got the idea from there.

https://doc.rust-lang.org/src/alloc/raw_vec.rs.html#278-295

let size = c as usize;
let byte = remaining.as_bytes()[0];
let kind = if byte < 128 {
BYTE_HANDLERS[byte as usize](self)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not fill BYTE_HANDLERS[128..256] with self.match_unicode_char?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My assumption is that it's ideal for BYTE_HANDLERS to be as small as possible, so as much of it stays in cache as possible. It's already 1 KiB (128 x 8) and I imagine loading chunks of it into memory is a slow-down.

Therefore I imagine the cost of a branch is probably less than the cost of loading a chunk of BYTE_HANDLERS into memory for values 128-255. Especially as the CPU likely won't be able to speculatively execute beyond the BYTE_HANDLERS[byte as usize](self) call, as it's too unpredictable which way it'll jump.

If anything, I wonder if taking other steps to reduce the size of BYTE_HANDLERS might be worthwhile.

However, I am pretty new to all this stuff about speculative branching and suchlike, so this is total guesswork. If you think my assumptions may be incorrect, please say.

Copy link
Contributor Author

@overlookmotel overlookmotel Jan 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait a second! I just realised where I know you from. I saw your video about perfect hash tables. You definitely know more than me. So please do give your opinion on my assumptions.

Copy link
Contributor

@strager strager Jan 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine the cost of a branch is probably less than the cost of loading a chunk of BYTE_HANDLERS into memory for values 128-255.

Perhaps. This is worth measuring.

Another option: Map 128..256 to index 0 and teach BYTE_HANDLERS[0] to handle non-ASCII.

Another option: Map 128..256 to index 127 and teach BYTE_HANDLERS[127] to handle non-ASCII.

Another option: Map 128..256 to index 128 and grow BYTE_HANDLERS by one entry.

Of these three new options, reusing index 0 seems to produce the cleanest code on x86_64: https://godbolt.org/z/E1Esn77nf

xor     eax, eax    ; 0
test    dil, dil    ; check whether the character is in range
cmovg   eax, edi    ; choose the index (either the character or 0)
movzx   eax, al     ; clear top bits of the index
mov     rcx, qword ptr [rip + HANDLERS@GOTPCREL]
mov     rax, qword ptr [rcx + 8*rax]

Copy link
Contributor Author

@overlookmotel overlookmotel Jan 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for coming back, and for the suggestions.

After further thought, perhaps your original suggestion was the best after all. I guess it's not the size of BYTE_HANDLERS which is the problem in itself, but the cost of reading it. So in the common case where source code is all ASCII, the 2nd half of BYTE_HANDLERS for 128-255 would not be accessed, so it shouldn't incur any cost. There'd only be a performance penalty when it is accessed, which is the same deal as now with the #[cold] branch incurring the cost of a branch misprediction.

The fast path for ASCII characters should benefit from removing the branch.

Anyway, I'll test out these various options, and see what the result is.

One last question: Why is this the "best codegen":

xor     eax, eax
test    dil, dil
cmovg   eax, edi

rather than the other option:

test    dil, dil
mov     eax, 128
cmovns  eax, edi

xor is faster than mov?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xor is faster than mov?

The xor is smaller than the mov. Fewer bytes = less icache pressure. That's my thinking anyway. I could be totally wrong.

It's possible that the cmov is uop-fused with the mov, making mov faster than xor. 🤷‍♀️ I am not an optimization guru. Measure measure measure.

Copy link
Contributor Author

@overlookmotel overlookmotel Jan 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much for your comments. #2039 implements your first suggestion, and does produce +1% performance gain. I'm glad you questioned by assumptions, as they turned out to be wrong!

Boshen pushed a commit that referenced this pull request Jan 16, 2024
As suggested by @strager in
#2025 (review),
this PR adds `BYTE_HANDLERS` for first bytes of unicode characters.

This removes a branch from `read_next_token()` and produces a +1%
speed-up on parser benchmarks.
IWANABETHATGUY pushed a commit to IWANABETHATGUY/oxc that referenced this pull request May 29, 2024
2 related changes to lexer's `read_next_token()`:

1. Hint to branch predictor that unicode identifiers and non-standard
whitespace are rare by marking that branch `#[cold]`.

2. The branch is on whether next character is ASCII or not. This check
only requires reading 1 byte, as ASCII characters are always single byte
in UTF8. So only do the work of getting a `char` in the cold path, once
it's established that character is not ASCII and this work is required.
IWANABETHATGUY pushed a commit to IWANABETHATGUY/oxc that referenced this pull request May 29, 2024
As suggested by @strager in
oxc-project#2025 (review),
this PR adds `BYTE_HANDLERS` for first bytes of unicode characters.

This removes a branch from `read_next_token()` and produces a +1%
speed-up on parser benchmarks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-parser Area - Parser
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants