Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/mini parse 2 alpha #62

Draft
wants to merge 7 commits into
base: feat/mini-parse-rework
Choose a base branch
from

Conversation

stefnotch
Copy link
Contributor

@stefnotch stefnotch commented Feb 3, 2025

tl;dr: I finally got to try out all my mini-parse ideas. 🎉 I am now wondering which of these are worth keeping, and which ones are not.

I tried writing a mini-parse library which

  • keeps track of the input stream type
    • This gives us a typed variation of token(kind: Kind, value: string)
    • In winnow-land, this design also lets them generically operate on strings or binary streams. This isn't useful for us.
    • I am using this for the span combinator, but I'm not convinced that that design is good.
    • tl;dr: It's nifty, but I don't care too much about this.
  • keeps track of whether a parser can backtrack
    • This is used to almost always guarantee "no backtracking". e.g. seq2(tryToken("keyword", "import"), token("symbol", ";")) has the semantics "only the first thing of the sequence parser can backtrack", so only that part will get a if (result == null) return null; check.
    • or parsers assert that their children must be capable of backtracking. Otherwise they'd be useless children.
    • This made me realize that the imports grammar, as written, actually requires a 2 tokens lookahead.
    • However, it makes writing parsers a lot more verbose. See ImportGrammar.ts
    • It should, in theory, make debugging parsing failures easier.
    • tl;dr: I think it's super neat that this is possible. I am, however, not convinced that it's what we want for our implementation.
  • Has a public _run method called parseNext. It's intended that users can write parsers in an alternative style, see Parser2.test.ts. This lets us hand-write parsers and parsing logic for hot paths of the code.
  • has way less overhead
    • yeah that is useful. I wonder why its overhead is so much lower.

To try it out, I then wrote a parser combinator which calls the new implementation. And then I rewrote the imports grammar to use the new implementation.
Benchmarks are on Discord, but the rough results are that the perf could go from wgsl-linker LOC/sec: 33.229 to wgsl-linker LOC/sec: 123.075.

The unit tests are failing, and that's fine.

@stefnotch
Copy link
Contributor Author

stefnotch commented Feb 3, 2025

For tracking spans, I can think of a few different options

  • Add previousToken to checkpoints => track first token with a custom lexer, last token is what checkpoint says
  • Add the rule that parseNext always leaves you at the end of a token. => track first token with a custom lexer, end is checkpoint
  • Add the rule that you are always at the beginning of a token, and add previousToken to checkpoints => [checkpoint().token.span[1], checkpoint()]. Could also allow for another optimisation
  • Or not having a span() combinator, and manually building it up from the info that is already present in the tokens.

What does not work

  • Only using checkpoints. Because whitespace
  • Storing a "previous token", because peek + reset would invalidate that
  • Split parseNext into "skipIgnored and parseNext". parseNext would always leave you at the end of a token, skipIgnored would always bring you to the start of a token. => checkpoints are spans. However I wouldn't know if a child parser already called skipIgnored, so the checkpoint might not be reliable.
  • Just adding "prevTokenEnd" and "nextTokenStart" to the API, because .reset() would invalidate the next token and force us to inefficiently recompute it. There are more efficient variations above.

@stefnotch
Copy link
Contributor Author

stefnotch commented Feb 4, 2025

I'm picking the option where parseNext always leaves you at the end of a token, and the span combinator does a peek() (peeking is done as const before = lexer.checkpoint(); const s = lexer.peek().span[0]; lexer.reset(before);).
Then I'll make sure that peeking is optimized (it'll get used a lot), and more importantly: It's a zero cost abstraction!

If it still ends up being slow, I can try that option:

  • Add the rule that you are always at the beginning of a token, and add previousToken to checkpoints => [checkpoint().token.span[1], checkpoint()]. Could also allow for another optimisation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Unprioritized
Development

Successfully merging this pull request may close these issues.

1 participant