Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider writing scanners by hand #167

Open
jgm opened this issue Dec 2, 2016 · 9 comments
Open

Consider writing scanners by hand #167

jgm opened this issue Dec 2, 2016 · 9 comments

Comments

@jgm
Copy link
Member

jgm commented Dec 2, 2016

Currently we generate a number of scanners from regexes using re2c.

This has two advantages:

  1. It is easy to see at a glance what the scanners do, and easy to modify them.
  2. The scanners generated are DFAs and should perform well. Since re2c is well tested, this also lowers testing burden.

Disadvantage: Either we require re2c as a build dependency, or we have to ship a gigantic scanners.c file.

Should we simply hand-write the scanners and dispense with re2c and scanners.re?

@Knagis
Copy link
Contributor

Knagis commented Dec 2, 2016

If you decide to rewrite them by hand, perhaps my code could be useful:

https://github.com/Knagis/CommonMark.NET/blob/master/CommonMark/Parser/Scanner.cs
https://github.com/Knagis/CommonMark.NET/blob/master/CommonMark/Parser/ScannerHtmlTag.cs

I did rewrite them by hand, it wasn't the hardest part when porting the parser, as for the maintenance, I haven't had any problems with the few changes that were required so far.

@tin-pot
Copy link

tin-pot commented Dec 3, 2016

Would flex provide a better trade-off between convenience and amount of generated source code? (IIRC flex uses tables to represent DFAs, not thousands of if () goto lines.)

But generally hand-coded scanners are IMO preferrable, for size and configurability reasons (ie, such scanning is easier to influence by command-line options to the cmark executable etc.). But this can still be done when the syntax is stable and "frozen".

@kainjow
Copy link
Contributor

kainjow commented Dec 3, 2016

I say go for it. Doesn't need to be all or nothing. Could just do a few functions at a time. May be some possible speed improvements as well.

@MathieuDuponchelle
Copy link
Contributor

What is your exact concern with shipping this scanners.c file ? My only concern about this would be version control, but I'm pretty sure there are ways around it.

@jgm
Copy link
Member Author

jgm commented Dec 5, 2016 via email

@MathieuDuponchelle
Copy link
Contributor

@jgm, unless there's an actual technical reason to do so, I really don't see why we should do that to be honest. If it took multiple gigs of memory to compile this file for example, that could be a compelling reason, but I don't see the point in "compactness" for the sake of it, I'd much rather have easy-to-read sources to be honest.

@tin-pot
Copy link

tin-pot commented Dec 5, 2016

While I find the ~30000 lines (~400 KB) sized scanners.c a bit hefty, the tools do handle it easily. I once looked into a linker map file and found that the scanners.c contributed around 50 KB to the executable size, IIRC.

Re-generating this file (and comitting it) has never bothered me.

So all in all, I don't loose sleep over it, at least not for the time being ;-)

@nwellnhof
Copy link
Contributor

There's also issue #121 but this is caused by a GCC bug. The biggest problem I see is that re2c generates needlessly repetitive code for quantifiers like {1,31} which is the main cause of the large object code size.

@jgm
Copy link
Member Author

jgm commented Dec 5, 2016 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants