-
-
Notifications
You must be signed in to change notification settings - Fork 484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(parser): lexer replace Chars
with Source
#2288
Conversation
✅ Deploy Preview for oxc canceled.
|
CodSpeed Performance ReportMerging #2288 will improve performances by 9.41%Falling back to comparing Summary
Benchmarks breakdown
|
The semantic performance change is flaky, we can safely ignore it. |
Don't worry about the codspeed report, feel free to merge once you're satisfied with code since this is still marked as a draft. |
I'm a bit dubious about CodSpeed's flame graph diffs. I think the
I'm still a bit concerned it may possibly indicate UB. Miri run completed last night but the output was unreadable because it found 1150 memory leaks! (the ones we already know about) Am running it again with memory leak detection disabled. Will report back. Can I ask: Have you ever seen flaky results with such a large margin (4%) before? Also, now I've pushed the extra commit which added a comment, it's got much worse! (though that may be in part due to unrelated changes on |
I rebased and the report is correct now. codspeed sometimes get flaky when there is a large memory allocation, don't worry about it.
Don't forget to look at the absolute numbers. Remeber, it's only a regression when all of the benchmarks come back with a bad number. |
dfcfbec
to
77cd9b8
Compare
Have rebased on top of #2301 to see if the lexer flame graph with named functions sheds any more light on the weird semantic benchmark. Am also running Miri again (4 hours and counting!). If it doesn't find anything, I think we're all good. But I am nervous of merging this until Miri gives the all clear. I'm reassured to hear that the semantic benchmarks are known to be flaky, but still there's so much unsafe raw pointer action in this PR, it'd be easy for me to have made a mistake which causes UB. So I think best to err on cautious side. |
I'll run all the fuzzers when I wake up. |
77cd9b8
to
5d1402c
Compare
Here's what you can try:
|
random bytes fuzzer: the parser timed out on AST fuzzer ran for 30k iterations and found no issues. I think we are good to go after the merge conflict is resolved. |
5d1402c
to
fb1ee12
Compare
1c62665
to
51da40b
Compare
Thanks very much for running the fuzzer. It feels like we can be pretty confident now that no problem with the changes introduced in this PR. Have rebased on main, and now ready for merge once CI has passed. Weirdly |
CI passes. Have merged. Hooray! Thanks loads for all your help on this @Boshen. |
Chars
with Source
Chars
with Source
Make `Source::set_position` a safe function. This addresses a shortcoming of #2288. Instead of requiring caller of `Source::set_position` to guarantee that the `SourcePosition` is created from this `Source`, the preceding PRs enforce this guarantee at the type level. `Source::set_position` is going to be a central API for transitioning the lexer to processing the source as bytes, rather than `char`s (and the anticipated speed-ups that will produce). So making this method safe will remove the need for a *lot* of unsafe code blocks, and boilerplate comments promising "SAFETY: There's only one `Source`", when to the developer, this is blindingly obvious anyway. So, while splitting the parser into `Parser` and `ParserImpl` (#2339) is an annoying change to have to make, I believe the benefit of this PR justifies it.
This PR replaces the `Chars` iterator in the lexer with a new structure `Source`. ## What it does `Source` holds the source text, and allows: * Iterating through source text char-by-char (same as `Chars` did). * Iterating byte-by-byte. * Getting a `SourcePosition` for current position, which can be used later to rewind to that position, without having to clone the entire `Source` struct. `Source` has the same invariants as `Chars` - cursor must always be positioned on a UTF-8 character boundary (i.e. not in the middle of a multi-byte Unicode character). However, unsafe APIs are provided to allow a caller to temporarily break that invariant, as long as they satisfy it again before they pass control back to safe code. This will be useful for processing batches of bytes. ## Why I envisage most of the Lexer migrating to byte-by-byte iteration, and I believe it'll make a significant impact on performance. It will allow efficiently processing batches of bytes (e.g. consuming identifiers or whitespace) without the overhead of calculating code points for every character. It should also make all the many `peek()`, `next_char()` and `next_eq()` calls faster. `Source` is also more performant than `Chars` in itself. This wasn't my intent, but seems to be a pleasant side-effect of it being less opaque to the compiler than `Chars`, so it can apply more optimizations. In addition, because checkpoints don't need to store the entire `Source` struct, but only a `SourcePosition` (8 bytes), was able to reduce the size of `LexerCheckpoint` and `ParserCheckpoint`, and make them both `Copy`. ## Notes on implementation `Source` is heavily based on Rust's `std::str::Chars` and `std::slice::Iter` iterators and I've copied the code/concepts from them as much as possible. As it's a low-level primitive, it uses raw pointers and contains a *lot* of unsafe code. I *think* I've crossed the T's and dotted the I's, and I've commented the code extensively, but I'd appreciate a close review if anyone has time. I've split it into 2 commits. * First commit is all the substantive changes. * 2nd commit just does away with `lexer.current` which is no longer needed, and replaces `lexer.current.token` with `lexer.token` everywhere. Hopefully looking just at the 1st commit will reduce the noise and make it easier to review. ### `SourcePosition` There is one annoyance with the API which I haven't been able solve: `SourcePosition` is a wrapper around a pointer, which can only be created from the current position of `Source`. Due to the invariant mentioned above, therefore `SourcePosition` is always in bounds of the source text, and points to a UTF-8 character boundary. So `Source` can be rewound to a `SourcePosition` cheaply, without any checks. I had originally envisaged `Source::set_position` being a safe function, as `SourcePosition` enforces the necessary invariants itself. The fly in the ointment is that a `SourcePosition` could theoretically have been created from *another* `Source`. If that was the case, it would be out of bounds, and it would be instant UB. Consequently, `Source::set_position` has to be an unsafe function. This feels rather ridiculous. *Of course* the parser won't create 2 Lexers at the same time. But still it's *possible*, so I think better to take the strict approach and make it unsafe until can find a way to statically prove the safety by some other means. Any ideas? ## Oddity in the benchmarks There's something really odd going on with the semantic benchmark for `pdf.mjs`. While I was developing this, small and seemingly irrelevant changes would flip that benchmark from +0.5% or so to -4%, and then another small change would flip it back. What I don't understand is that parsing happens outside of the measurement loop in the semantic benchmark, so the parser shouldn't have *any* effect either way on semantic's benchmarks. If CodSpeed's flame graph is to be believed, most of the negative effect appears to be a large Vec reallocation happening somewhere in semantic. I've ruled out a few things: The AST produced by the parser for `pdf.mjs` after this PR is identical to what it was before. And semantic's `nodes` and `scopes` Vecs are same length as they were before. Nothing seems to have changed! I really am at a loss to explain it. Have you seen anything like this before? One possibility is a fault in my unsafe code which is manifesting only with `pdf.mjs`, and it's triggering UB, which I guess could explain the weird effects. I'm running the parser on `pdf.mjs` in Miri now and will see if it finds anything (Miri doesn't find any problem running the tests). It's been running for over an hour now. Hopefully it'll be done by morning! I feel like this shouldn't merged until that question is resolved, so marking this as draft in the meantime.
Make `Source::set_position` a safe function. This addresses a shortcoming of oxc-project#2288. Instead of requiring caller of `Source::set_position` to guarantee that the `SourcePosition` is created from this `Source`, the preceding PRs enforce this guarantee at the type level. `Source::set_position` is going to be a central API for transitioning the lexer to processing the source as bytes, rather than `char`s (and the anticipated speed-ups that will produce). So making this method safe will remove the need for a *lot* of unsafe code blocks, and boilerplate comments promising "SAFETY: There's only one `Source`", when to the developer, this is blindingly obvious anyway. So, while splitting the parser into `Parser` and `ParserImpl` (oxc-project#2339) is an annoying change to have to make, I believe the benefit of this PR justifies it.
This PR replaces the
Chars
iterator in the lexer with a new structureSource
.What it does
Source
holds the source text, and allows:Chars
did).SourcePosition
for current position, which can be used later to rewind to that position, without having to clone the entireSource
struct.Source
has the same invariants asChars
- cursor must always be positioned on a UTF-8 character boundary (i.e. not in the middle of a multi-byte Unicode character).However, unsafe APIs are provided to allow a caller to temporarily break that invariant, as long as they satisfy it again before they pass control back to safe code. This will be useful for processing batches of bytes.
Why
I envisage most of the Lexer migrating to byte-by-byte iteration, and I believe it'll make a significant impact on performance.
It will allow efficiently processing batches of bytes (e.g. consuming identifiers or whitespace) without the overhead of calculating code points for every character. It should also make all the many
peek()
,next_char()
andnext_eq()
calls faster.Source
is also more performant thanChars
in itself. This wasn't my intent, but seems to be a pleasant side-effect of it being less opaque to the compiler thanChars
, so it can apply more optimizations.In addition, because checkpoints don't need to store the entire
Source
struct, but only aSourcePosition
(8 bytes), was able to reduce the size ofLexerCheckpoint
andParserCheckpoint
, and make them bothCopy
.Notes on implementation
Source
is heavily based on Rust'sstd::str::Chars
andstd::slice::Iter
iterators and I've copied the code/concepts from them as much as possible.As it's a low-level primitive, it uses raw pointers and contains a lot of unsafe code. I think I've crossed the T's and dotted the I's, and I've commented the code extensively, but I'd appreciate a close review if anyone has time.
I've split it into 2 commits.
lexer.current
which is no longer needed, and replaceslexer.current.token
withlexer.token
everywhere.Hopefully looking just at the 1st commit will reduce the noise and make it easier to review.
SourcePosition
There is one annoyance with the API which I haven't been able solve:
SourcePosition
is a wrapper around a pointer, which can only be created from the current position ofSource
. Due to the invariant mentioned above, thereforeSourcePosition
is always in bounds of the source text, and points to a UTF-8 character boundary. SoSource
can be rewound to aSourcePosition
cheaply, without any checks. I had originally envisagedSource::set_position
being a safe function, asSourcePosition
enforces the necessary invariants itself.The fly in the ointment is that a
SourcePosition
could theoretically have been created from anotherSource
. If that was the case, it would be out of bounds, and it would be instant UB. Consequently,Source::set_position
has to be an unsafe function.This feels rather ridiculous. Of course the parser won't create 2 Lexers at the same time. But still it's possible, so I think better to take the strict approach and make it unsafe until can find a way to statically prove the safety by some other means. Any ideas?
Oddity in the benchmarks
There's something really odd going on with the semantic benchmark for
pdf.mjs
.While I was developing this, small and seemingly irrelevant changes would flip that benchmark from +0.5% or so to -4%, and then another small change would flip it back.
What I don't understand is that parsing happens outside of the measurement loop in the semantic benchmark, so the parser shouldn't have any effect either way on semantic's benchmarks.
If CodSpeed's flame graph is to be believed, most of the negative effect appears to be a large Vec reallocation happening somewhere in semantic.
I've ruled out a few things: The AST produced by the parser for
pdf.mjs
after this PR is identical to what it was before. And semantic'snodes
andscopes
Vecs are same length as they were before. Nothing seems to have changed!I really am at a loss to explain it. Have you seen anything like this before?
One possibility is a fault in my unsafe code which is manifesting only with
pdf.mjs
, and it's triggering UB, which I guess could explain the weird effects. I'm running the parser onpdf.mjs
in Miri now and will see if it finds anything (Miri doesn't find any problem running the tests). It's been running for over an hour now. Hopefully it'll be done by morning!I feel like this shouldn't merged until that question is resolved, so marking this as draft in the meantime.