You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In How we are parsing? 24th oct version (v3) I discussed how we are going to write handwritten parser. Been working on it for a while and decided to slightly tweak the approach to parsing.
Logos Not Really Needed
First tweak is I am getting rid of logos based tokeniser and directly working on Vec<char>.
To me the straw that broke camels back was this "fix". I have studied the Token disambiguation from Logos Handbook, and it's def not a bug in logos, I am using regex wrong. They use greedy match. And the rules says "Longer beats shorter." and I felt "foo:" should match with foo and Colon, which I was wrong, as the longer beats shorter rule is only applied at the start of the token, and the regex matches as much as it can, and so my had to explicitly exclude colon #[regex(r"[^\s^:]+")].
At this rate I would have to exclude all the characters, as \s is too greedy.
The thing is regex is hard, and logos came up with its own intricate behaviour that is built on top of regex, which makes it harder still. I was thinking "no big deal", I will just write tests for my tokens. But then I realised I am not really winning anything by using that library. It was just one really heavy library to keep in head when trying to reason out things, when the benefit was really dubious.
Baby Steps
Another mistake I made in v3 was to try to parse top down. We are going to parse top down eventually, but I started doing it top down when writing code as well. It was really hard to unit test top down parser, till I have parsed some meaningful chunk of grammar, and in that attempt I wrote many functions without thorough testing, and eventually when I encountered a meaning full test case, -- foo: I was having difficulty debugging the issue.
So I wanted to go bottom up, starting with the lowest level thing (in p1), which is an identifier. Write lots of unit test and go one step up only after gaining confidence in what we have at hand, seems to be better approach.
Error Handling / Recovery
In my v3 I was greedy (and the issue in regex was because regex is greedy!), and was trying to also implement error discovery and error recovery while parsing, all in one go. Turns out I am not as smart as I stupidly think, so this time I am ignoring the error recovery etc, and just writing happy case.
Having said that, the approach seems to not necessarily rule out adding these things after wards. Currently every parser is returning an Option<T>, but in future we can make them return PResult:
And let's how that refactor would be straightforward, which I feel it would be, as PResult<T> is quite similar to Option<T>.
Combinators
We have avoided using parser combinators, and relying on our scanner abstraction. We have low level items now. Soon we will need higher level items.
Lower level parser, like identifier parser, is supposed to be a strict parser, it will not eat up any preceding white spaces for example. Low level parser will eat up white space if it is part of it, eg list< string> (there is an extra space before string, which will be eaten).
When combing a bunch of such low level parsers, we need some high level parser, eg <any-number-of-space>--<any-number-of-space><kinded-name>--<any-number-of-space>: is our section-init (e.g. -- string foo:. This can be procedurally parse by putting the scanner.skip_spaces() calls for every <any-number-of-space>, but may be we can also create scanner.space_separated((p1, p2)) -> (FResult<P1>, FResult<P2>), basically a generic over tuple version, which takes a tuple of parsers, and returns a corresponding tuple FResult<>.
This would be a very fun exercise, and probably this is why we will not do it, and just implement scanner.space_separated_2() and space_separated_3() versions.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
In How we are parsing? 24th oct version (v3) I discussed how we are going to write handwritten parser. Been working on it for a while and decided to slightly tweak the approach to parsing.
Logos Not Really Needed
First tweak is I am getting rid of logos based tokeniser and directly working on
Vec<char>
.To me the straw that broke camels back was this "fix". I have studied the Token disambiguation from Logos Handbook, and it's def not a bug in logos, I am using regex wrong. They use greedy match. And the rules says "Longer beats shorter." and I felt
"foo:"
should match withfoo
and Colon, which I was wrong, as the longer beats shorter rule is only applied at the start of the token, and the regex matches as much as it can, and so my had to explicitly exclude colon#[regex(r"[^\s^:]+")]
.At this rate I would have to exclude all the characters, as
\s
is too greedy.The thing is regex is hard, and logos came up with its own intricate behaviour that is built on top of regex, which makes it harder still. I was thinking "no big deal", I will just write tests for my tokens. But then I realised I am not really winning anything by using that library. It was just one really heavy library to keep in head when trying to reason out things, when the benefit was really dubious.
Baby Steps
Another mistake I made in v3 was to try to parse top down. We are going to parse top down eventually, but I started doing it top down when writing code as well. It was really hard to unit test top down parser, till I have parsed some meaningful chunk of grammar, and in that attempt I wrote many functions without thorough testing, and eventually when I encountered a meaning full test case,
-- foo:
I was having difficulty debugging the issue.So I wanted to go bottom up, starting with the lowest level thing (in p1), which is an identifier. Write lots of unit test and go one step up only after gaining confidence in what we have at hand, seems to be better approach.
Error Handling / Recovery
In my v3 I was greedy (and the issue in regex was because regex is greedy!), and was trying to also implement error discovery and error recovery while parsing, all in one go. Turns out I am not as smart as I stupidly think, so this time I am ignoring the error recovery etc, and just writing happy case.
Having said that, the approach seems to not necessarily rule out adding these things after wards. Currently every parser is returning an
Option<T>
, but in future we can make them returnPResult
:And let's how that refactor would be straightforward, which I feel it would be, as
PResult<T>
is quite similar toOption<T>
.Combinators
We have avoided using parser combinators, and relying on our scanner abstraction. We have low level items now. Soon we will need higher level items.
Lower level parser, like
identifier
parser, is supposed to be a strict parser, it will not eat up any preceding white spaces for example. Low level parser will eat up white space if it is part of it, eglist< string>
(there is an extra space beforestring
, which will be eaten).When combing a bunch of such low level parsers, we need some high level parser, eg
<any-number-of-space>--<any-number-of-space><kinded-name>--<any-number-of-space>:
is oursection-init
(e.g.-- string foo:
. This can be procedurally parse by putting thescanner.skip_spaces()
calls for every<any-number-of-space>
, but may be we can also createscanner.space_separated((p1, p2)) -> (FResult<P1>, FResult<P2>)
, basically a generic over tuple version, which takes a tuple of parsers, and returns a corresponding tupleFResult<>
.This would be a very fun exercise, and probably this is why we will not do it, and just implement
scanner.space_separated_2()
andspace_separated_3()
versions.Beta Was this translation helpful? Give feedback.
All reactions