Parsing v4: 28th Oct 2024 #1998

amitu · 2024-10-28T10:29:53Z

amitu
Oct 28, 2024
Maintainer

In How we are parsing? 24th oct version (v3) I discussed how we are going to write handwritten parser. Been working on it for a while and decided to slightly tweak the approach to parsing.

Logos Not Really Needed

First tweak is I am getting rid of logos based tokeniser and directly working on Vec<char>.

To me the straw that broke camels back was this "fix". I have studied the Token disambiguation from Logos Handbook, and it's def not a bug in logos, I am using regex wrong. They use greedy match. And the rules says "Longer beats shorter." and I felt "foo:" should match with foo and Colon, which I was wrong, as the longer beats shorter rule is only applied at the start of the token, and the regex matches as much as it can, and so my had to explicitly exclude colon #[regex(r"[^\s^:]+")].

At this rate I would have to exclude all the characters, as \s is too greedy.

The thing is regex is hard, and logos came up with its own intricate behaviour that is built on top of regex, which makes it harder still. I was thinking "no big deal", I will just write tests for my tokens. But then I realised I am not really winning anything by using that library. It was just one really heavy library to keep in head when trying to reason out things, when the benefit was really dubious.

Baby Steps

Another mistake I made in v3 was to try to parse top down. We are going to parse top down eventually, but I started doing it top down when writing code as well. It was really hard to unit test top down parser, till I have parsed some meaningful chunk of grammar, and in that attempt I wrote many functions without thorough testing, and eventually when I encountered a meaning full test case, -- foo: I was having difficulty debugging the issue.

So I wanted to go bottom up, starting with the lowest level thing (in p1), which is an identifier. Write lots of unit test and go one step up only after gaining confidence in what we have at hand, seems to be better approach.

Error Handling / Recovery

In my v3 I was greedy (and the issue in regex was because regex is greedy!), and was trying to also implement error discovery and error recovery while parsing, all in one go. Turns out I am not as smart as I stupidly think, so this time I am ignoring the error recovery etc, and just writing happy case.

Having said that, the approach seems to not necessarily rule out adding these things after wards. Currently every parser is returning an Option<T>, but in future we can make them return PResult:

pub enum PResult<T> {
    NotFound,
    Found(T),
    Error(SingleError),
    Errors(Vec<SingleError>),
    FoundWithErrors {
        partial: T,
        errors: Vec<SingleError>,
    },
}

And let's how that refactor would be straightforward, which I feel it would be, as PResult<T> is quite similar to Option<T>.

Combinators

We have avoided using parser combinators, and relying on our scanner abstraction. We have low level items now. Soon we will need higher level items.

Lower level parser, like identifier parser, is supposed to be a strict parser, it will not eat up any preceding white spaces for example. Low level parser will eat up white space if it is part of it, eg list< string> (there is an extra space before string, which will be eaten).

When combing a bunch of such low level parsers, we need some high level parser, eg <any-number-of-space>--<any-number-of-space><kinded-name>--<any-number-of-space>: is our section-init (e.g. -- string foo:. This can be procedurally parse by putting the scanner.skip_spaces() calls for every <any-number-of-space>, but may be we can also create scanner.space_separated((p1, p2)) -> (FResult<P1>, FResult<P2>), basically a generic over tuple version, which takes a tuple of parsers, and returns a corresponding tuple FResult<>.

This would be a very fun exercise, and probably this is why we will not do it, and just implement scanner.space_separated_2() and space_separated_3() versions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing v4: 28th Oct 2024 #1998

{{title}}

Replies: 0 comments

Select a reply

Parsing v4: 28th Oct 2024 #1998

amitu Oct 28, 2024 Maintainer

Logos Not Really Needed

Baby Steps

Error Handling / Recovery

Combinators

Replies: 0 comments

amitu
Oct 28, 2024
Maintainer