feat(linter): regex parser #1164

ubugeeei · 2023-11-06T03:13:46Z

Implement a regular expression parser equivalent to @eslint-community/regexpp.

Todo:

Consider how to handle Maneren's branch
- Check if any part can be integrated
- See if the remaining issues can be shared in this repository
Document the rules requiring the implementation of the parser as pending

Ref: #611

Boshen · 2023-11-06T06:08:13Z

Let's do this incrementally with proper testing, we can cherry-pick some of the code from Maneren's branch

My requirements:

100% test coverage
Good documentation
use the same infra as oxc_parser, i.e. allocate the AST on oxc_allocator for maximum performance, have lexer / parser phase for maximum code control

ubugeeei · 2023-11-18T04:35:38Z

Boshen · 2023-11-18T06:27:53Z

@ubugeeei Nice work! I hope your are learning a lot.

ref: #1164 I have initialized a crate for handling JavaScript Regexp and defined the AST. I implemented the AST while referring to [eslint-community/regexpp](https://github.com/eslint-community/regexpp/blob/2e8f1af992fb12eae46a446253e8fa3f6cede92a/src/ast.ts).

IWANABETHATGUY · 2023-12-11T09:20:31Z

Hello, I saw it has been 3 weeks since your last pr, are you still interested in this feature, or can I continue your work?

Boshen · 2024-01-06T04:58:53Z

@ubugeeei is too busy with other stuff (vapor mode I guess?). Feel free to assign this to yourself @IWANABETHATGUY

ubugeeei · 2024-01-06T06:29:44Z

@IWANABETHATGUY

Hello, I saw it has been 3 weeks since your last pr, are you still interested in this feature, or can I continue your work?

Ah, sorry, I completely missed this comment.
I was planning to work on it when I had time, if it was still pending, but I haven't been able to do anything yet, so please go ahead, there's no problem at all.

(My time bottleneck is my main job... Actually, Vapor often has discussions that are put on hold, so it's not that busy yet..)

maurice · 2024-01-12T14:25:53Z

This seems like the kind of self-inflicted torture that I'd enjoy. May I have a go at moving it forward?

Boshen · 2024-01-12T14:32:20Z

@IWANABETHATGUY have you started working on this?

IWANABETHATGUY · 2024-01-12T14:33:00Z

@IWANABETHATGUY have you started working on this?

WIP

Boshen · 2024-01-12T14:38:35Z

@maurice it seems like @IWANABETHATGUY is working on it, I'll let him coordinate the tasks.

leaysgur · 2024-06-18T05:25:25Z

I have quick investigated the current status of this issue and the rules related to @eslint-community/regexpp, so I am leaving a NOTE...

Rules related to `regexpp`

ESLint uses an ECMAScript RegExp parser implementation called @eslint-community/regexpp.

There are currently 10 rules implemented that depend on it.

eslint/no-misleading-character-class
- Enforce u flag, \q{} with v flag
- parsePattern(), onCharacterClassEnter()
eslint/no-invalid-regexp ( [$50 Opire Bounty] feat(linter): eslint/no-invalid-regexp #611 )
- Prevent syntax error at runtime
- validatePattern(), validateFlags()
eslint/no-useless-backreference
- Report backrefs refer to another alternatives, appers later, etc
- parsePattern(), onBackreferenceEnter()
- ⚠️ Need to resolve ref > groups
eslint/prefer-named-capture-group
- Prefer named over normal(indexed) capture group
- parsePattern(), onCapturingGroupEnter()
- ⚠️ Only for fixer
eslint/prefer-regex-literals
- Prefer /abc/i over new RegExp("abc", "i")
- validatePattern(), validateFlags(), onCharacterEnter()
eslint/require-unicode-regexp
- Enforce using u or v flags
- validatePattern()
- ⚠️ Only for fixer
eslint/no-control-regex
- Disallow control characters and some escape characters
- onCharacter()
eslint/no-empty-character-class
- Disallow empty character class like /a[]/
- onCharacterClassEnter()
eslint/no-useless-escape
- Disallow unnecessary escape characters
- parsePattern(), onCharacterClass(Enter|Leave)(), onExpressionCharacterClass(Enter|Leave)(), onCharacterEnter()
eslint/no-regex-spaces
- Disallow multiple consecutive spaces
- parsePattern(), onCharacterClassEnter()

https://github.com/search?q=repo%3Aeslint%2Feslint+%28%22%40eslint-community%2Fregexpp%22+OR+%22utils%2Fregular-expressions%22%29&type=code

And there is a related plugin ota-meshi/eslint-plugin-regexp too. / #3263

Implementation status of `oxlint`

It seems that 4/10 rules are already implemented:
- eslint/no-control-regex
  - Already implemented with regex crate
- eslint/no-empty-character-class
  - Already implemented with regex crate
- eslint/no-useless-escape
  - Already implemented with manual char checking
- eslint/no-regex-spaces
  - Already implemented with manual char checking

But it should be re-implemented with regexpp for better performance.

What to do

Complete regexpp port for Rust feat(regex_parser): Implement RegExp parser #3824
Integrate it into AST visiting for oxlint
Re-implement already implemented 4 rules
Implement the remaining 6 rules
Implement eslint-plugin-regex...?

Boshen · 2024-06-18T06:18:05Z

Linting regexp may be doable without a parser, but it'll definitely be easier to lint by visiting an AST.

I still want to regex parser for oxc for maximum performance gains 😅 We should only parse the regex once and visit the AST once for linting.

There completeness, there is also https://github.com/ota-meshi/eslint-plugin-regexp from the same author.

leaysgur · 2024-06-18T07:10:04Z

We should only parse the regex once and visit the AST once

Ah~, I see, that makes sense! I've updated my comment.

ota-meshi/eslint-plugin-regexp

I didn't know this... And this looks also... challenging! 😂

Boshen · 2024-06-18T08:15:07Z

@leaysgur The current blocking task is the parser, we can always distribute the linter rules to future contributors. You don't need to implement the linter rules like you did with the jsdoc plugins 😅

leaysgur · 2024-06-18T08:38:58Z

Yes, I understand. I have finally grasped the situation now. 👍🏻

you did with the jsdoc plugins

🙈

Now, I will tackle the implementation of the parser.
(Since it seems like many people are actively developing the playground. 👀 )

I'm going to read the regexpp implementation and to spend a few days for the oxc_parser architecture and #2030.

leaysgur · 2024-06-21T13:51:15Z

@Boshen

I'm going through the code and written a bit myself, then I'd like to confirm a few things.

Is a Lexer necessary?

In the original implementation of regexpp, it seems to handle characters directly without using Tokens.
To create something that works for now, I'm thinking of following the original approach and not implementing a Lexer.

What do you think?

Support for strict mode (annexB) and various ecmaVersions?

These options seem to significantly affect behavior and implementation. What should we do about them?
One option might be to always support only the latest version... 🤔

Boshen · 2024-06-21T14:16:20Z

Is a Lexer necessary?

Probably not, since there is no whitespace or semi colons like JavaScript.

We can try one without a lexer.

Support for strict mode (annexB) and various ecmaVersions?

We always support the latest ecma version, just like our parser.

For strict mode, you may leave them out from your first version.

leaysgur · 2024-06-21T14:56:40Z

OK, thanks! 🙏
I'll try it minimally for the first step.

Boshen · 2024-06-21T15:01:16Z

@leaysgur Do you intend to start from scratch? You can start from a new branch by coping over some of the code from #2030

Once you have something, I'll help you with the test infrastructure - running the regex parser on all regexes from test262, babel and typescript.

leaysgur · 2024-06-22T11:50:15Z

I pushed current progress #3824 , of course it is WIP... 😴

Do you intend to start from scratch?

Maybe yes. But I will refer to or pick up on them as necessary.

Currently, they are in /_oxc_js_regex dir, just for reference.

I'll help you with the test infrastructure - running the regex parser on all regexes from test262, babel and typescript.

That is very helpful! I was just thinking about what to do. 😅

leaysgur · 2024-07-12T04:32:11Z

Progress update:

Implementation is ongoing. 🚧 #3824
Most of the syntax is implemented, except for [character_class] and v flag related stuff.

But there's still some work to complete.(Early errors, test262 tests and integration to Linter, etc...)

While working on this, I went through the specification and the regexpp implementation several times.
As a result, I realized that the regexpp implementation may not be fully compliant with the specification.

There are some patterns where the implementation is not complete
Some parts simply aren't up to date

Also, for ease of rewriting in Rust, it cannot be implemented as a complete 1-to-1 copy. (since it relies heavily on the dynamic nature of JavaScript)

For these reasons, I'm thinking of abandoning original goal of re-implementing regexpp in Rust and focusing on implementing it as an original specification-compliant parser.
(It may not have cared about compatibility in the first place, though 🤔 )

Fortunately or unfortunately, there are no de-facts like ESTree, and I think there are any problems, but just in case, I'll look at the specific usecases of the Lint rules that depend on regexpp first.

I'll look at the specific usecases of the Lint rules that depend on regexpp first.

All usecase were just as expected.
I updated my comment above. ⬆️

Boshen · 2024-07-12T04:35:07Z

and focusing on implementing it as an original specification-compliant parser

👍 always bet on the spec!

leaysgur · 2024-08-19T06:39:48Z

Progress update:

Implementation part

#3824

Finally completed all ES2024 specs! 🎉

However, we noticed that while it supports the literal pattern /(.)\1/, it misbehaves the pattern for new RegExp("(.)\\1") whose strings are escaped...

In JavaScript, \\ is treated as \. But in Rust, \\ is \\.

Now I'm thinking how to cope with this. Perhaps a pre-treatment, such as a lexer for RegExp parser is required?

Integration part

#4242

RegExp parser is now integrated in oxc_parser itself. It emits error when invalid RegExp literal is found.

But how to use parsed result in user land like linter is still considering.

@Boshen Could you please take a look these and give your feedback if you have time? 🙏🏻

Boshen · 2024-08-19T07:08:29Z

Progress update:

Implementation part

#3824

Finally completed all ES2024 specs! 🎉

However, we noticed that while it supports the literal pattern /(.)\1/, it misbehaves the pattern for new RegExp("(.)\\1") whose strings are escaped...

In JavaScript, \\ is treated as \. But in Rust, \\ is \\.

Now I'm thinking how to cope with this. Perhaps a pre-treatment, such as a lexer for RegExp parser is required?

Integration part

#4242

RegExp parser is now integrated in oxc_parser itself. It emits error when invalid RegExp literal is found.

But how to use parsed result in user land like linter is still considering.

@Boshen Could you please take a look these and give your feedback if you have time? 🙏🏻

Thank you for the tremendous effort! I'll take a look at the problems soon.

@Boshen

Part of #1164 ## Progress updates 🗞️ Waiting for the review and advice, while thinking how to handle escaped string when `new RegExp(pat)`. ## TODOs - [x] `RegExp(Literal = Body + Flags)#parse()` structure - [x] Base `Reader` impl to handle both unicode(u32) and utf-16(u16) units - [x] Global `Span` and local offset conversion - [x] Design AST shapes - [x] Keep `enum` size small by `Box<'a, T>` - [x] Rework AST shapes - [x] Split body and flags w/ validating literal - [x] Parse `RegExpFlags` - [x] Parse `RegExpBody` = `Pattern` - [x] Parse `Pattern` > `Disjunction` - [x] Parse `Disjunction` > `Alternative` - [x] Parse `Alternative` > `Term` - [x] Parse `Term` > `Assertion` - [x] Parse `BoundaryAssertion` - [x] Parse `LookaroundAssertion` - [x] Parse `Term` > `Quantifier` - [x] Parse `Term` > `Atom` - [x] Parse `Atom` > `PatternCharacter` - [x] Parse `Atom` > `.` - [x] Parse `Atom` > `\AtomEscape` - [x] Parse `\AtomEscape` > `DecimalEscape` - [x] Parse `\AtomEscape` > `CharacterClassEscape` - [x] Parse `CharacterClassEscape` > `\d, \D, \s, \S, \w, \W` - [x] Parse `CharacterClassEscape` > `\p{UnicodePropertyValueExpression}, \P{UnicodePropertyValueExpression}` - [x] Parse `\AtomEscape` > `CharacterEscape` - [x] Parse `CharacterEscape` > `ControlEscape` - [x] Parse `CharacterEscape` > `c AsciiLetter` - [x] Parse `CharacterEscape` > `0` - [x] Parse `CharacterEscape` > `HexEscapeSequence` - [x] Parse `CharacterEscape` > `RegExpUnicodeEscapeSequence` - [x] Parse `CharacterEscape` > `IdentityEscape` - [x] Parse `\AtomEscape` > `kGroupName` - [x] Parse `Atom` > `[CharacterClass]` - [x] Parse `[CharacterClass]` > `ClassContents` > `[~UnicodeSetsMode] NonemptyClassRanges` - [x] Parse `[CharacterClass]` > `ClassContents` > `[+UnicodeSetsMode] ClassSetExpression` - [x] Parse `ClassSetExpression` > `ClassUnion` - [x] Parse `ClassSetExpression` > `ClassIntersection` - [x] Parse `ClassSetExpression` > `ClassSubtraction` - [x] Parse `ClassSetExpression` > `ClassSetOperand` - [x] Parse `ClassSetExpression` > `ClassSetRange` - [x] Parse `ClassSetExpression` > `ClassSetCharacter` - [x] Parse `Atom` > `(GroupSpecifier)` - [x] Parse `Atom` > `(?:Disjunction)` - [x] Annex B - [x] Parse `QuantifiableAssertion` - [x] Parse `ExtendedAtom` - [x] Parse `ExtendedAtom` > `\ [lookahead = c]` - [x] Parse `ExtendedAtom` > `InvalidBracedQuantifier` - [x] Parse `ExtendedAtom` > `ExtendedPatternCharacter` - [x] Parse `ExtendedAtom` > `\AtomEscape` > `CharacterEscape` > `LegacyOctalEscapeSequence` - [x] Early errors - [x] Pattern :: Disjunction(1/2) - [x] Pattern :: Disjunction(2/2) - [x] QuantifierPrefix :: { DecimalDigits , DecimalDigits } - [x] ExtendedAtom :: InvalidBracedQuantifier (Annex B) - [x] AtomEscape :: k GroupName - [x] AtomEscape :: DecimalEscape - [x] NonemptyClassRanges :: ClassAtom - ClassAtom ClassContents(1/2) - [x] NonemptyClassRanges :: ClassAtom - ClassAtom ClassContents(2/2) - [x] NonemptyClassRanges :: ClassAtom - ClassAtom ClassContents(Annex B) - [x] NonemptyClassRangesNoDash :: ClassAtomNoDash - ClassAtom ClassContents(1/2) - [x] NonemptyClassRangesNoDash :: ClassAtomNoDash - ClassAtom ClassContents(2/2) - [x] NonemptyClassRangesNoDash :: ClassAtomNoDash - ClassAtom ClassContents(Annex B) - [x] RegExpIdentifierStart :: \ RegExpUnicodeEscapeSequence - [x] RegExpIdentifierStart :: UnicodeLeadSurrogate UnicodeTrailSurrogate - [x] RegExpIdentifierPart :: \ RegExpUnicodeEscapeSequence - [x] RegExpIdentifierPart :: UnicodeLeadSurrogate UnicodeTrailSurrogate - [x] UnicodePropertyValueExpression :: UnicodePropertyName = UnicodePropertyValue(1/2) - [x] UnicodePropertyValueExpression :: UnicodePropertyName = UnicodePropertyValue(2/2) - [x] UnicodePropertyValueExpression :: LoneUnicodePropertyNameOrValue(1/2) - [x] UnicodePropertyValueExpression :: LoneUnicodePropertyNameOrValue(2/2) - [x] CharacterClassEscape :: P{ UnicodePropertyValueExpression } - [x] CharacterClass :: [^ ClassContents ] - [x] NestedClass :: [^ ClassContents ] - [x] ClassSetRange :: ClassSetCharacter - ClassSetCharacter - [x] Add `Span` to `Err(OxcDiagnostic::error())` calls - [x] Perf improvement - [x] `Reader#peek()` should avoid `iter.next()` equivalent - [x] ~~Use `char` everywhere and split and push 2 surrogates(pair) for `Character`?~~ - [x] ~~Try 1(+1) loop parsing for capturing groups?~~ ## Follow up - [x] @Boshen Test suite > #4242 - [x] Investigate CI errors... - Next... - Support ES2025 Duplicate named capturing groups? - Support ES20XX Stage3 Modifiers?

leaysgur · 2024-08-22T08:53:40Z

Progress update:

oxc_regular_expression crate is finally merged into main branch ✨

Now oxc_parser can emit syntax error if invalid regex literals are found

This is optional

oxc/crates/oxc_parser/src/lib.rs

Lines 109 to 112 in f73d486

    
               /// Whether to parse regular expressions or not. 
        
               /// 
        
               /// Default: false 
        
               pub parse_regular_expression: bool,

Only /literals/v is checked, RegExp("string", "v") is not checked

However, to officially use parsed AST in linter, there is a little more to do.
e.g. #5060 , how to use it with RegExp("string")?, etc.

As for the remaining tasks, we can address them in a separate issue. But for now, can't we close this long-lived issue? 😃

Boshen · 2024-08-22T08:59:56Z

Thank you again @leaysgur, and also everyone who participated in this. Thank you all!

ubugeeei mentioned this issue Nov 6, 2023

[$50 Opire Bounty] feat(linter): eslint/no-invalid-regexp #611

Closed

ubugeeei mentioned this issue Nov 22, 2023

Initialize JS Regex crates and def AST. #1500

Merged

Boshen added the E-Help Wanted Experience level - For the experienced collaborators label Jan 6, 2024

Boshen assigned IWANABETHATGUY Jan 12, 2024

maurice mentioned this issue Jan 12, 2024

feat(linter): eslint/no-misleading-character-class #614

Open

leaysgur mentioned this issue Jun 22, 2024

feat(regex_parser): Implement RegExp parser #3824

Merged

82 tasks

leaysgur linked a pull request Jun 24, 2024 that will close this issue

feat(regex_parser): Implement RegExp parser #3824

Merged

82 tasks

Boshen assigned Boshen and unassigned IWANABETHATGUY Aug 19, 2024

Boshen added P-high Priority - High and removed E-Help Wanted Experience level - For the experienced collaborators labels Aug 19, 2024

Boshen closed this as completed Aug 22, 2024

Boshen removed the P-high Priority - High label Aug 22, 2024

camchenry mentioned this issue Sep 22, 2024

regular_expression: implement visitor pattern for regex AST #5977

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(linter): regex parser #1164

feat(linter): regex parser #1164

ubugeeei commented Nov 6, 2023

Boshen commented Nov 6, 2023 •

edited

Loading

ubugeeei commented Nov 18, 2023 •

edited

Loading

Boshen commented Nov 18, 2023 •

edited

Loading

IWANABETHATGUY commented Dec 11, 2023

Boshen commented Jan 6, 2024

ubugeeei commented Jan 6, 2024

maurice commented Jan 12, 2024

Boshen commented Jan 12, 2024

IWANABETHATGUY commented Jan 12, 2024

Boshen commented Jan 12, 2024

leaysgur commented Jun 18, 2024 •

edited

Loading

Boshen commented Jun 18, 2024 •

edited

Loading

leaysgur commented Jun 18, 2024

Boshen commented Jun 18, 2024 •

edited

Loading

leaysgur commented Jun 18, 2024

leaysgur commented Jun 21, 2024

Boshen commented Jun 21, 2024

leaysgur commented Jun 21, 2024

Boshen commented Jun 21, 2024

leaysgur commented Jun 22, 2024

leaysgur commented Jul 12, 2024 •

edited

Loading

Boshen commented Jul 12, 2024

leaysgur commented Aug 19, 2024

Boshen commented Aug 19, 2024

Implementation part

Integration part

leaysgur commented Aug 22, 2024

Boshen commented Aug 22, 2024

feat(linter): regex parser #1164

feat(linter): regex parser #1164

Comments

ubugeeei commented Nov 6, 2023

Boshen commented Nov 6, 2023 • edited Loading

ubugeeei commented Nov 18, 2023 • edited Loading

Roadmap (Draft) (For Contributors)

Boshen commented Nov 18, 2023 • edited Loading

IWANABETHATGUY commented Dec 11, 2023

Boshen commented Jan 6, 2024

ubugeeei commented Jan 6, 2024

maurice commented Jan 12, 2024

Boshen commented Jan 12, 2024

IWANABETHATGUY commented Jan 12, 2024

Boshen commented Jan 12, 2024

leaysgur commented Jun 18, 2024 • edited Loading

Rules related to regexpp

Implementation status of oxlint

What to do

Boshen commented Jun 18, 2024 • edited Loading

leaysgur commented Jun 18, 2024

Boshen commented Jun 18, 2024 • edited Loading

leaysgur commented Jun 18, 2024

leaysgur commented Jun 21, 2024

Is a Lexer necessary?

Support for strict mode (annexB) and various ecmaVersions?

Boshen commented Jun 21, 2024

leaysgur commented Jun 21, 2024

Boshen commented Jun 21, 2024

leaysgur commented Jun 22, 2024

leaysgur commented Jul 12, 2024 • edited Loading

Boshen commented Jul 12, 2024

leaysgur commented Aug 19, 2024

Implementation part

Integration part

Boshen commented Aug 19, 2024

Implementation part

Integration part

leaysgur commented Aug 22, 2024

Boshen commented Aug 22, 2024

Boshen commented Nov 6, 2023 •

edited

Loading

ubugeeei commented Nov 18, 2023 •

edited

Loading

Boshen commented Nov 18, 2023 •

edited

Loading

leaysgur commented Jun 18, 2024 •

edited

Loading

Rules related to `regexpp`

Implementation status of `oxlint`

Boshen commented Jun 18, 2024 •

edited

Loading

Boshen commented Jun 18, 2024 •

edited

Loading

leaysgur commented Jul 12, 2024 •

edited

Loading