-
-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(linter): regex parser #1164
Comments
Let's do this incrementally with proper testing, we can cherry-pick some of the code from Maneren's branch My requirements:
|
Roadmap (Draft) (For Contributors)
|
@ubugeeei Nice work! I hope your are learning a lot. |
ref: #1164 I have initialized a crate for handling JavaScript Regexp and defined the AST. I implemented the AST while referring to [eslint-community/regexpp](https://github.com/eslint-community/regexpp/blob/2e8f1af992fb12eae46a446253e8fa3f6cede92a/src/ast.ts).
Hello, I saw it has been 3 weeks since your last pr, are you still interested in this feature, or can I continue your work? |
@ubugeeei is too busy with other stuff (vapor mode I guess?). Feel free to assign this to yourself @IWANABETHATGUY |
Ah, sorry, I completely missed this comment. (My time bottleneck is my main job... Actually, Vapor often has discussions that are put on hold, so it's not that busy yet..) |
This seems like the kind of self-inflicted torture that I'd enjoy. May I have a go at moving it forward? |
@IWANABETHATGUY have you started working on this? |
WIP |
@maurice it seems like @IWANABETHATGUY is working on it, I'll let him coordinate the tasks. |
I have quick investigated the current status of this issue and the rules related to Rules related to
|
Linting regexp may be doable without a parser, but it'll definitely be easier to lint by visiting an AST. I still want to regex parser for oxc for maximum performance gains 😅 We should only parse the regex once and visit the AST once for linting. There completeness, there is also https://github.com/ota-meshi/eslint-plugin-regexp from the same author. |
Ah~, I see, that makes sense! I've updated my comment. I didn't know this... And this looks also... challenging! 😂 |
@leaysgur The current blocking task is the parser, we can always distribute the linter rules to future contributors. You don't need to implement the linter rules like you did with the jsdoc plugins 😅 |
Yes, I understand. I have finally grasped the situation now. 👍🏻
🙈 Now, I will tackle the implementation of the parser. I'm going to read the |
I'm going through the code and written a bit myself, then I'd like to confirm a few things. Is a Lexer necessary?In the original implementation of What do you think? Support for strict mode (annexB) and various ecmaVersions?These options seem to significantly affect behavior and implementation. What should we do about them? |
Probably not, since there is no whitespace or semi colons like JavaScript. We can try one without a lexer.
We always support the latest ecma version, just like our parser. For strict mode, you may leave them out from your first version. |
OK, thanks! 🙏 |
I pushed current progress #3824 , of course it is WIP... 😴
Maybe yes. But I will refer to or pick up on them as necessary. Currently, they are in
That is very helpful! I was just thinking about what to do. 😅 |
Progress update: Implementation is ongoing. 🚧 #3824 But there's still some work to complete.(Early errors, test262 tests and integration to Linter, etc...) While working on this, I went through the specification and the
Also, for ease of rewriting in Rust, it cannot be implemented as a complete 1-to-1 copy. (since it relies heavily on the dynamic nature of JavaScript) For these reasons, I'm thinking of abandoning original goal of re-implementing Fortunately or unfortunately, there are no de-facts like ESTree, and I think there are any problems, but just in case, I'll look at the specific usecases of the Lint rules that depend on
All usecase were just as expected. |
👍 always bet on the spec! |
Progress update: Implementation partFinally completed all ES2024 specs! 🎉 However, we noticed that while it supports the literal pattern In JavaScript, Now I'm thinking how to cope with this. Perhaps a pre-treatment, such as a lexer for RegExp parser is required? Integration partRegExp parser is now integrated in But how to use parsed result in user land like linter is still considering. @Boshen Could you please take a look these and give your feedback if you have time? 🙏🏻 |
Thank you for the tremendous effort! I'll take a look at the problems soon. |
Part of #1164 ## Progress updates 🗞️ Waiting for the review and advice, while thinking how to handle escaped string when `new RegExp(pat)`. ## TODOs - [x] `RegExp(Literal = Body + Flags)#parse()` structure - [x] Base `Reader` impl to handle both unicode(u32) and utf-16(u16) units - [x] Global `Span` and local offset conversion - [x] Design AST shapes - [x] Keep `enum` size small by `Box<'a, T>` - [x] Rework AST shapes - [x] Split body and flags w/ validating literal - [x] Parse `RegExpFlags` - [x] Parse `RegExpBody` = `Pattern` - [x] Parse `Pattern` > `Disjunction` - [x] Parse `Disjunction` > `Alternative` - [x] Parse `Alternative` > `Term` - [x] Parse `Term` > `Assertion` - [x] Parse `BoundaryAssertion` - [x] Parse `LookaroundAssertion` - [x] Parse `Term` > `Quantifier` - [x] Parse `Term` > `Atom` - [x] Parse `Atom` > `PatternCharacter` - [x] Parse `Atom` > `.` - [x] Parse `Atom` > `\AtomEscape` - [x] Parse `\AtomEscape` > `DecimalEscape` - [x] Parse `\AtomEscape` > `CharacterClassEscape` - [x] Parse `CharacterClassEscape` > `\d, \D, \s, \S, \w, \W` - [x] Parse `CharacterClassEscape` > `\p{UnicodePropertyValueExpression}, \P{UnicodePropertyValueExpression}` - [x] Parse `\AtomEscape` > `CharacterEscape` - [x] Parse `CharacterEscape` > `ControlEscape` - [x] Parse `CharacterEscape` > `c AsciiLetter` - [x] Parse `CharacterEscape` > `0` - [x] Parse `CharacterEscape` > `HexEscapeSequence` - [x] Parse `CharacterEscape` > `RegExpUnicodeEscapeSequence` - [x] Parse `CharacterEscape` > `IdentityEscape` - [x] Parse `\AtomEscape` > `kGroupName` - [x] Parse `Atom` > `[CharacterClass]` - [x] Parse `[CharacterClass]` > `ClassContents` > `[~UnicodeSetsMode] NonemptyClassRanges` - [x] Parse `[CharacterClass]` > `ClassContents` > `[+UnicodeSetsMode] ClassSetExpression` - [x] Parse `ClassSetExpression` > `ClassUnion` - [x] Parse `ClassSetExpression` > `ClassIntersection` - [x] Parse `ClassSetExpression` > `ClassSubtraction` - [x] Parse `ClassSetExpression` > `ClassSetOperand` - [x] Parse `ClassSetExpression` > `ClassSetRange` - [x] Parse `ClassSetExpression` > `ClassSetCharacter` - [x] Parse `Atom` > `(GroupSpecifier)` - [x] Parse `Atom` > `(?:Disjunction)` - [x] Annex B - [x] Parse `QuantifiableAssertion` - [x] Parse `ExtendedAtom` - [x] Parse `ExtendedAtom` > `\ [lookahead = c]` - [x] Parse `ExtendedAtom` > `InvalidBracedQuantifier` - [x] Parse `ExtendedAtom` > `ExtendedPatternCharacter` - [x] Parse `ExtendedAtom` > `\AtomEscape` > `CharacterEscape` > `LegacyOctalEscapeSequence` - [x] Early errors - [x] Pattern :: Disjunction(1/2) - [x] Pattern :: Disjunction(2/2) - [x] QuantifierPrefix :: { DecimalDigits , DecimalDigits } - [x] ExtendedAtom :: InvalidBracedQuantifier (Annex B) - [x] AtomEscape :: k GroupName - [x] AtomEscape :: DecimalEscape - [x] NonemptyClassRanges :: ClassAtom - ClassAtom ClassContents(1/2) - [x] NonemptyClassRanges :: ClassAtom - ClassAtom ClassContents(2/2) - [x] NonemptyClassRanges :: ClassAtom - ClassAtom ClassContents(Annex B) - [x] NonemptyClassRangesNoDash :: ClassAtomNoDash - ClassAtom ClassContents(1/2) - [x] NonemptyClassRangesNoDash :: ClassAtomNoDash - ClassAtom ClassContents(2/2) - [x] NonemptyClassRangesNoDash :: ClassAtomNoDash - ClassAtom ClassContents(Annex B) - [x] RegExpIdentifierStart :: \ RegExpUnicodeEscapeSequence - [x] RegExpIdentifierStart :: UnicodeLeadSurrogate UnicodeTrailSurrogate - [x] RegExpIdentifierPart :: \ RegExpUnicodeEscapeSequence - [x] RegExpIdentifierPart :: UnicodeLeadSurrogate UnicodeTrailSurrogate - [x] UnicodePropertyValueExpression :: UnicodePropertyName = UnicodePropertyValue(1/2) - [x] UnicodePropertyValueExpression :: UnicodePropertyName = UnicodePropertyValue(2/2) - [x] UnicodePropertyValueExpression :: LoneUnicodePropertyNameOrValue(1/2) - [x] UnicodePropertyValueExpression :: LoneUnicodePropertyNameOrValue(2/2) - [x] CharacterClassEscape :: P{ UnicodePropertyValueExpression } - [x] CharacterClass :: [^ ClassContents ] - [x] NestedClass :: [^ ClassContents ] - [x] ClassSetRange :: ClassSetCharacter - ClassSetCharacter - [x] Add `Span` to `Err(OxcDiagnostic::error())` calls - [x] Perf improvement - [x] `Reader#peek()` should avoid `iter.next()` equivalent - [x] ~~Use `char` everywhere and split and push 2 surrogates(pair) for `Character`?~~ - [x] ~~Try 1(+1) loop parsing for capturing groups?~~ ## Follow up - [x] @Boshen Test suite > #4242 - [x] Investigate CI errors... - Next... - Support ES2025 Duplicate named capturing groups? - Support ES20XX Stage3 Modifiers?
Progress update:
However, to officially use parsed AST in linter, there is a little more to do. As for the remaining tasks, we can address them in a separate issue. But for now, can't we close this long-lived issue? 😃 |
Thank you again @leaysgur, and also everyone who participated in this. Thank you all! |
Implement a regular expression parser equivalent to @eslint-community/regexpp.
Todo:
Ref: #611
The text was updated successfully, but these errors were encountered: