refactor(parser): lexer replace Chars with Source (#2288)

This PR replaces the `Chars` iterator in the lexer with a new structure `Source`. ## What it does `Source` holds the source text, and allows: * Iterating through source text char-by-char (same as `Chars` did). * Iterating byte-by-byte. * Getting a `SourcePosition` for current position, which can be used later to rewind to that position, without having to clone the entire `Source` struct. `Source` has the same invariants as `Chars` - cursor must always be positioned on a UTF-8 character boundary (i.e. not in the middle of a multi-byte Unicode character). However, unsafe APIs are provided to allow a caller to temporarily break that invariant, as long as they satisfy it again before they pass control back to safe code. This will be useful for processing batches of bytes. ## Why I envisage most of the Lexer migrating to byte-by-byte iteration, and I believe it'll make a significant impact on performance. It will allow efficiently processing batches of bytes (e.g. consuming identifiers or whitespace) without the overhead of calculating code points for every character. It should also make all the many `peek()`, `next_char()` and `next_eq()` calls faster. `Source` is also more performant than `Chars` in itself. This wasn't my intent, but seems to be a pleasant side-effect of it being less opaque to the compiler than `Chars`, so it can apply more optimizations. In addition, because checkpoints don't need to store the entire `Source` struct, but only a `SourcePosition` (8 bytes), was able to reduce the size of `LexerCheckpoint` and `ParserCheckpoint`, and make them both `Copy`. ## Notes on implementation `Source` is heavily based on Rust's `std::str::Chars` and `std::slice::Iter` iterators and I've copied the code/concepts from them as much as possible. As it's a low-level primitive, it uses raw pointers and contains a *lot* of unsafe code. I *think* I've crossed the T's and dotted the I's, and I've commented the code extensively, but I'd appreciate a close review if anyone has time. I've split it into 2 commits. * First commit is all the substantive changes. * 2nd commit just does away with `lexer.current` which is no longer needed, and replaces `lexer.current.token` with `lexer.token` everywhere. Hopefully looking just at the 1st commit will reduce the noise and make it easier to review. ### `SourcePosition` There is one annoyance with the API which I haven't been able solve: `SourcePosition` is a wrapper around a pointer, which can only be created from the current position of `Source`. Due to the invariant mentioned above, therefore `SourcePosition` is always in bounds of the source text, and points to a UTF-8 character boundary. So `Source` can be rewound to a `SourcePosition` cheaply, without any checks. I had originally envisaged `Source::set_position` being a safe function, as `SourcePosition` enforces the necessary invariants itself. The fly in the ointment is that a `SourcePosition` could theoretically have been created from *another* `Source`. If that was the case, it would be out of bounds, and it would be instant UB. Consequently, `Source::set_position` has to be an unsafe function. This feels rather ridiculous. *Of course* the parser won't create 2 Lexers at the same time. But still it's *possible*, so I think better to take the strict approach and make it unsafe until can find a way to statically prove the safety by some other means. Any ideas? ## Oddity in the benchmarks There's something really odd going on with the semantic benchmark for `pdf.mjs`. While I was developing this, small and seemingly irrelevant changes would flip that benchmark from +0.5% or so to -4%, and then another small change would flip it back. What I don't understand is that parsing happens outside of the measurement loop in the semantic benchmark, so the parser shouldn't have *any* effect either way on semantic's benchmarks. If CodSpeed's flame graph is to be believed, most of the negative effect appears to be a large Vec reallocation happening somewhere in semantic. I've ruled out a few things: The AST produced by the parser for `pdf.mjs` after this PR is identical to what it was before. And semantic's `nodes` and `scopes` Vecs are same length as they were before. Nothing seems to have changed! I really am at a loss to explain it. Have you seen anything like this before? One possibility is a fault in my unsafe code which is manifesting only with `pdf.mjs`, and it's triggering UB, which I guess could explain the weird effects. I'm running the parser on `pdf.mjs` in Miri now and will see if it finds anything (Miri doesn't find any problem running the tests). It's been running for over an hour now. Hopefully it'll be done by morning! I feel like this shouldn't merged until that question is resolved, so marking this as draft in the meantime.
oxc-project · Feb 5, 2024 · cdef41d · cdef41d
1 parent d9bea52
commit cdef41d
Show file tree

Hide file tree

Showing 13 changed files with 525 additions and 109 deletions.
diff --git a/crates/oxc_parser/src/cursor.rs b/crates/oxc_parser/src/cursor.rs
@@ -10,6 +10,7 @@ use crate::{
     Context, Parser,
 };
 
+#[derive(Clone, Copy)]
 pub struct ParserCheckpoint<'a> {
     lexer: LexerCheckpoint<'a>,
     cur_token: Token,
@@ -254,7 +255,9 @@ impl<'a> Parser<'a> {
         let ParserCheckpoint { lexer, cur_token, prev_span_end, errors_pos: errors_lens } =
             checkpoint;
 
-        self.lexer.rewind(lexer);
+        // SAFETY: Parser only ever creates a single `Lexer`,
+        // therefore all checkpoints must be created from it.
+        unsafe { self.lexer.rewind(lexer) };
         self.token = cur_token;
         self.prev_token_end = prev_span_end;
         self.errors.truncate(errors_lens);

diff --git a/crates/oxc_parser/src/lexer/byte_handlers.rs b/crates/oxc_parser/src/lexer/byte_handlers.rs
@@ -6,9 +6,8 @@ use crate::diagnostics;
 ///
 /// SAFETY:
 /// * Lexer must not be at end of file.
-/// * `byte` must be next byte of source code, corresponding to current position
-///   of `lexer.current.chars`.
-/// * Only `BYTE_HANDLERS` for ASCII characters may use the `ascii_byte_handler!` macro.
+/// * `byte` must be next byte of source code, corresponding to current position of `lexer.source`.
+/// * Only `BYTE_HANDLERS` for ASCII characters may use the `ascii_byte_handler!()` macro.
 pub(super) unsafe fn handle_byte(byte: u8, lexer: &mut Lexer) -> Kind {
     BYTE_HANDLERS[byte as usize](lexer)
 }
@@ -82,7 +81,7 @@ macro_rules! byte_handler {
 ///
 /// These assertions produce no runtime code, but hint to the compiler that it can assume that
 /// next char is ASCII, and it uses that information to optimize the rest of the handler.
-/// e.g. `lexer.current.chars.next()` becomes just a single assembler instruction.
+/// e.g. `lexer.consume_char()` becomes just a single assembler instruction.
 /// Without the assertions, the compiler is unable to deduce the next char is ASCII, due to
 /// the indirection of the `BYTE_HANDLERS` jump table.
 ///
@@ -108,8 +107,8 @@ macro_rules! byte_handler {
 ///     unsafe {
 ///       use assert_unchecked::assert_unchecked;
 ///       let s = lexer.current.chars.as_str();
-///       assert_unchecked!(!s.is_empty());
-///       assert_unchecked!(s.as_bytes()[0] < 128);
+///       assert_unchecked!(!lexer.source.is_eof());
+///       assert_unchecked!(lexer.source.peek_byte_unchecked() < 128);
 ///     }
 ///     {
 ///       lexer.consume_char();
@@ -125,9 +124,8 @@ macro_rules! ascii_byte_handler {
             // SAFETY: This macro is only used for ASCII characters
             unsafe {
                 use assert_unchecked::assert_unchecked;
-                let s = $lex.current.chars.as_str();
-                assert_unchecked!(!s.is_empty());
-                assert_unchecked!(s.as_bytes()[0] < 128);
+                assert_unchecked!(!$lex.source.is_eof());
+                assert_unchecked!($lex.source.peek_byte_unchecked() < 128);
             }
             $body
         });
@@ -150,14 +148,14 @@ ascii_byte_handler!(SPS(lexer) {
 // <VT> <FF> Irregular Whitespace
 ascii_byte_handler!(ISP(lexer) {
     lexer.consume_char();
-    lexer.trivia_builder.add_irregular_whitespace(lexer.current.token.start, lexer.offset());
+    lexer.trivia_builder.add_irregular_whitespace(lexer.token.start, lexer.offset());
     Kind::Skip
 });
 
 // '\r' '\n'
 ascii_byte_handler!(LIN(lexer) {
     lexer.consume_char();
-    lexer.current.token.is_on_new_line = true;
+    lexer.token.is_on_new_line = true;
     Kind::Skip
 });
 
@@ -190,7 +188,7 @@ ascii_byte_handler!(HAS(lexer) {
     lexer.consume_char();
     // HashbangComment ::
     //     `#!` SingleLineCommentChars?
-    if lexer.current.token.start == 0 && lexer.next_eq('!') {
+    if lexer.token.start == 0 && lexer.next_eq('!') {
         lexer.read_hashbang_comment()
     } else {
         lexer.private_identifier()

diff --git a/crates/oxc_parser/src/lexer/comment.rs b/crates/oxc_parser/src/lexer/comment.rs
@@ -7,10 +7,10 @@ impl<'a> Lexer<'a> {
     /// Section 12.4 Single Line Comment
     #[allow(clippy::cast_possible_truncation)]
     pub(super) fn skip_single_line_comment(&mut self) -> Kind {
-        let start = self.current.token.start;
+        let start = self.token.start;
         while let Some(c) = self.next_char() {
             if is_line_terminator(c) {
-                self.current.token.is_on_new_line = true;
+                self.token.is_on_new_line = true;
                 self.trivia_builder
                     .add_single_line_comment(start, self.offset() - c.len_utf8() as u32);
                 return Kind::Skip;
@@ -25,11 +25,11 @@ impl<'a> Lexer<'a> {
     pub(super) fn skip_multi_line_comment(&mut self) -> Kind {
         while let Some(c) = self.next_char() {
             if c == '*' && self.next_eq('/') {
-                self.trivia_builder.add_multi_line_comment(self.current.token.start, self.offset());
+                self.trivia_builder.add_multi_line_comment(self.token.start, self.offset());
                 return Kind::Skip;
             }
             if is_line_terminator(c) {
-                self.current.token.is_on_new_line = true;
+                self.token.is_on_new_line = true;
             }
         }
         self.error(diagnostics::UnterminatedMultiLineComment(self.unterminated_range()));
@@ -43,7 +43,7 @@ impl<'a> Lexer<'a> {
                 break;
             }
         }
-        self.current.token.is_on_new_line = true;
+        self.token.is_on_new_line = true;
         Kind::HashbangComment
     }
 }
diff --git a/crates/oxc_parser/src/lexer/jsx.rs b/crates/oxc_parser/src/lexer/jsx.rs
@@ -37,7 +37,7 @@ impl<'a> Lexer<'a> {
     }
 
     pub(crate) fn next_jsx_child(&mut self) -> Token {
-        self.current.token.start = self.offset();
+        self.token.start = self.offset();
         let kind = self.read_jsx_child();
         self.finish_next(kind)
     }