-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a typed recursive ascent-descent backend #174
Conversation
This commit implements code generation for strongly-typed, continuation-based, directly-executable parsers, as described by Hinze and Paterson 2005. In addition, generated code can either be in LALR or RAD (Recursive Ascent-Descent, Horspool 1991) form. Recursive Ascent-Descent significantly reduces the number of states while maintaining the power of LR/LALR parsers.
Use • for rules and · for items (instead of . and .). Also use _ instead of |- for the lhs of an artificial item.
When specifying --optims in addition to --cb-rad, some optimizations are applied to the produced code: • All rule functions are marked with INLINE • All applications of goto-functions and k-functions are eta-expanded
… as in the thesis
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit of review feedback to get the discussion rolling. Generally quite nice, but we need to try harder to blend in with and reuse code of the existing implementation.
* Add two new backends: | ||
* Continuation-based LALR(1) (recursive ascent) | ||
* Continuation-based RAD(1) (recursive ascent-descent) | ||
* RAD generally produces smaller compiled code using less states |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... than a recursive ascent parser, which is an irrelevant comparison if all you know is the table-based LALR code generation scheme. Which is in fact smaller, so this is a bit misleading.
@@ -161,12 +161,14 @@ executable happy | |||
build-depends: base < 5, | |||
array, | |||
containers >= 0.4.2, | |||
dom-lt >= 0.2.2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NB for other reviewers: A new non-boot dependency (not sure if that's a big deal)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is some kind of a deal
- https://matrix.hackage.haskell.org/#/package/happy
- https://matrix.hackage.haskell.org/#/package/dom-lt
Currently happy
is buildable with virtually any GHC you can get your hand on. With dom-lt
it won't be true anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I wonder if we could remove the lower bound. Then we'd be fine, I think. AFAICT, the lower bound is because of AndreasPK/dom-lt#2, but if 0.2.0 and 0.2.1 were marked broken, the solver wouldn't pick these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That issue says that dom-lt-0.1.3
is broken as well, so that won't work.
It might be easier to say that happy-next
would work only with GHC-8.0+. I think it is fine, not a small issue, but not huge either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why exactly is dom-lt-0.2.2
not compatible with GHC-7? Maybe this can be dealt with in dom-lt
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why exactly is dom-lt-0.2.2 not compatible with GHC-7
Maybe the code would work. I honestly just haven't tried it with GHC 7 as I haven't used that old an GHC in years by now.
If someone put's the effort in to make it work/test it with GHC 7 I have no issue lowering the bounds.
@@ -0,0 +1,73 @@ | |||
module Follow where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's pretty uncommon in the Haskell code I've seen so far to indent declaration after the module
declaration with two spaces. I'd prefer we stick to the formatting conventions that are already present in happy
.
Also I'd like to see an explicit export list, which makes it simpler to understand what the API is.
@@ -186,6 +188,11 @@ executable happy | |||
AttrGrammarParser | |||
ParamRules | |||
PrettyGrammar | |||
RADCodeGen | |||
RADCodeGen_LALR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having thought about it a bit, I'd prefer it if you named the LALR backend "RA" for recursive ascent instead (so perhaps (RACodeGen
). Because that's exactly what it is.
I'm also unsure if it's actually worth integrating the vanilla RA backend. I don't see a reason to pick it over the RAD backend.
Relatedly: The other backends seem to be named according to the convention Produce*Code
. Stick to that. So ProduceRADCode
, etc..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. There seems to be no advantage of the recursive ascent backend over the recursive ascent descent backend.
> Option [] ["cb-rad"] (NoArg OptCB_RAD) | ||
> "create a continuation-based Recursive Ascent-Descent parser. Not compatible with most other options", | ||
> Option [] ["cb-rad-tuple"] (NoArg OptCB_RAD_TupleBased) | ||
> "same as cb-rad, but uses tuples instead of continuations inside rule functions", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to integrate the RAD-Tuple
backend. There's no advantage to it, it was merely an experiment. Kill it.
> | ||
> options = LALR.GenOptions { | ||
> LALR.ptype = ptype, | ||
> LALR.wrapperType = if parserType == "Parser" then "HappyP" else "Parser", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this can't be right, can it? The info you seek is encoded in the Grammar
's monad
field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this wrapperType is a wrapper around the user-supplied monad type (see 4.2.1 in here: the user-defined monad is called P
, and we define a wrapper and call it Parser
). This wrapper is always called "Parser", except when the Grammar
's monad
field is also called "Parser"; then we call the wrapper type "HappyP".
ptype :: ParserType, | ||
|
||
wrapperType :: String, -- e.g. "Parser" | ||
errorTokenType :: String, -- e.g. "ErrorToken" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something that needs configuring? What do the other Backends do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this doesn't really need to be configured, we just need to choose some string for the dummy type. We could also fully remove the ErrorToken dummy type; this would change the number of arguments of affected continuations and semantic actions (so we would have to be more careful at other places), but should still work.
But I do prefer using the dummy ErrorToken type. Just as in the other happy backends, it cannot be directly accessed be the user in semantic actions (In fact it can, but has no sensible value).
data GenOptions = GenOptions { | ||
ptype :: ParserType, | ||
|
||
wrapperType :: String, -- e.g. "Parser" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is hardly enough, I think. See my comments above
import GHC.Arr ((!), indices) | ||
|
||
|
||
data ParserType = Normal | Monad | MonadLexer deriving (Eq, Show) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably need more modes than that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ParserType
enumeration only reflects the operating modes introduced via the presence of %monad
and %lexer
directives. These affect whether wrapper types are required (because of lexerWrapper / repeatTok etc.) and some other places of code generation.
Therefore, we would, at most, need four of these values. Currently, only Normal
and MonadLexer
are really supported by my implementation, and the other options (Monad
and a possible Lexer
) still need to be implemented. I could add this on the TBD list if requested. But I haven't found any happy grammars which were using either of %monad
or %lexer
without also using the other. Therefore I decided to, at the moment, just support Normal
and MonadLexer
.
genCode opts x states action goto = do | ||
return $ newlines 3 [languageFeatures, header', entryPoints', definitions', rules', parseNTs', parseTerminals', states', actions', footer'] where | ||
languageFeatures | ||
| rank2Types opts = newline $ map extension ["RankNTypes", "ScopedTypeVariables"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the user should activate -XRankNTypes
if needed. Although I'm not so sure about ScopedTypeVariables
...
Maybe you should grep for a line starting with {-#
and cointaining RankNTypes
to insert ScopedTypeVariables
and to activate the annotation of type signatures in general.
This PR extends
happy
by a Recursive Ascent-Descent backend and therefore implements #167.In particular, the following options are added:
--cb-lalr
: usehappy
's LALR states to create code in well-typed, continuation-based style. This style is due to Hinze and Paterson as described in Derivation of a Typed Functional LR Parser.--cb-rad
: produce a recursive ascent-descent (RAD) parser, also using well-typed, continuation-based style. The RAD states are generated based onhappy
's LALR states.--cb-rad --optims
: optimize the generated code for speed by employing some optimizations.--types
,--comments
: annotate all functions with their explicit type (if possible) / with a comment. Sometimes, explicit types are required in order for GHC to build the generated parser.In the following, I'll summarize the intention behind using recursive ascent-descent parsing, how it interacts with
happy
, and what advantages it brings.TL;DR:
Replacing the current,
happy
-generated, Haskell parser of the Glasgow Haskell Compiler with a recursive ascent-descent variant speeds up parsing of Haskell files by GHC by around 10%.In smaller grammars (e.g. JSON), recursive ascent-descent parsers beat
happy
-generated ones by a factor of 3x to 4x.The remainder is structured as follows:
happy
-featuresWhat is Recursive Ascent-Descent?
A recursive descent parser contains one parsing routine per nonterminal. This routine decides, based on the lookahead token, which production to follow. Then, all symbols of this production's right-hand side are parsed consecutively.
In contrast, LALR parsers can pursue multiple productions at once. An LALR state can have multiple core items, all of which describe a possible position inside a production in which the parser could currently be.
LALR parsers are very often found in table-based form. There also exists recursive ascent – similar to recursive descent, a recursive ascent parser has one function per state. Shift and goto actions are executed by calling a different (or the same) state function, while reduce actions pop the function call stack after calling a semantic action.
Because of its direct, function-based nature, a recursive ascent parser is often faster than a respective table-based parser. This comes with a burden: The compiled code is much larger. As an example, the table-based
happy
-generated LALR parser of GHC is 2.5 MB. A recursive-ascent form is a little faster, but is 33 MB large.Now comes recursive ascent-descent: By merging and reusing similarities of multiple LALR states, the remaining states can be made smaller or partly unnecessary. This works by splitting each production at its recognition point: Up to the recognition point, parsing proceeds bottom-up using the RAD states, similar to LALR parsing. Once the recognition point of a rule is reached, it can be decided – based on the lookahead token – whether the rule is active or not. Remember: in an SLL grammar, this can be decided at the very beginning of each rule. After deciding that a rule is active, its remaining symbols are parsed consecutively, in a top-down fashion.
Bottom-up is only used while required, and the switch to top-down happens as early as possible.
Consider a nonterminal
Value
which appears in the right-hand side of multiple rules. Every time it appears after the recognition point, thisValue
parse proceeds identically. In an LALR parser, each of these parsing situations would require separate state sequences, whereas a RAD parser extracts them into a single procedure –parseValue
. This procedure begins by entering into an entry state – the entry state ofValue
. Then, bottom-up parsing proceeds until theValue
has fully been parsed and control is passed back to the calling rule procedure.What is this good for? Because of the great savings of states and actions, a recursive ascent-descent parser is much smaller than a respective recursive ascent parser. Take another look at GHC's Haskell parser:
The RAD parser is, of course, still larger than the status-quo, table-based, happy-generated parser, but not by a factor of 14 but only of 3. This goes along with a speedup of around 10%, which comes through the continuation-based form that we have chosen for code generation.
For more detailed explanations of what RAD is, how it works, how RAD states are generated from LALR states, how recognition points are found, how continuation-based code is generated and used in conjunction with happy, and for performance comparisons with LALR parsers, I refer to my bachelor's thesis: A Typed Recursive Ascent-Descent Backend for Happy.
Originally, recursive ascent-descent parsing was described by Horspool and builds upon Generalized Left Corner parsing.
Implementation
We shortly talk about how we implemented the new options.
--cb-lalr
: useshappy
's LALR states to create code in well-typed, continuation-based style. This code is often faster than code generated by happy, but much larger.happy
is executed normally, until conflict reporting and info file creation.RADCodeGen_LALR.hs
to generate the code.--cb-rad
: creates a RAD parser in well-typed, continuation-based style. This code is often faster than code generated by happy, and not much larger.happy
is executed normally, until conflict reporting and info file creation.happy
's LALR states are used to determine the recognition points of all productions.happy
's LALR states (inRADStateGen.hs
). This process respects features likehappy
's conflict resolution and the error token. It is described in detail here.--cb-lalr
case (RADCodeGen.hs
). Code generation is described in detail here.--cb-rad --optims
: we employ further optimizations to further speed up the generated parser. Currently, these include rule function inlining and eta-expansion of continuation functions.Compatibility with existing
happy
-featuresOur backend supports most of
happy
's features and directives. It supports all options that are used in GHC's Haskell grammar and is therefore powerful enough to produce a Haskell parser.Especially, because
happy
's LALR states (i.e. action and goto tables) are used for RAD state generation, conflict resolution happens just as before, and directives like%left
or%expect
work as desired.Monadic parsers and lexers are supported – the generated code is adapted to the monadic style. Partial parsers are also supported, just like the error token.
Features that are not (yet) supported are:
%monad
without%lexer
, or vice versa. It doesn't seem anyone is doing this, therefore we didn't really care about this case.{%^ ... }
and{%% ... }
semantic actions. If these are desired, they could be implemented. Of course, normal monadic actions ({% ... }
) are supported.happy
-specific options like-a
,-c
, and-g
have no equivalent in continuation-based code. We are however currently working on a--strict
implementation.Speed And Size Comparison
The detailed results can be seen here. We just briefly look at two results now: small grammars, and GHC.
First we look at two small grammars – an expression grammar and a JSON grammar. We compare the result of the fastest
happy
-generated parser (which could behappy
,happy -acg
orhappy -acg --strict
) with our--cb-rad
and--cb-rad --optims
parsers. The following table shows the average plain parsing times (using a large input file), without lexing.happy
--cb-rad
--cb-rad --optims
Our parser beats the
happy
-generated ones by a factor of 3-4.Now we consider GHC. We built GHC 8.10.1 with its normal
happy
-generated parser, and with a Parser.hs in RAD-form. Then we used-ddump-timings
to obtain the parsing times when parsing Haskell files. We parsed four large files. Each time we compared both the stage-1 and the stage-2 parser.happy -acg --strict
)--cb-rad --optims
We believe that we can expect even higher speed-ups by employing further optimizations. Especially, there are some performance regressions from stage-1 to stage-2 using our RAD parsers (file 1 and file 4) that do not happen for the
happy
-generated parsers. Getting rid of these regressions would mean higher speedups.Finally here is the above table again comparing the Parser.o sizes:
--cb-lalr
--cb-rad --optims
TBD
Some time, before or after merging this PR, some things must still be done:
-i
. Currently, no RAD-specific information is given to the user (except when they use the--comments
flag and look at the generated code).--strict
flagrepeatTok/lexerWrapper
call pairs