Consider writing scanners by hand #167

jgm · 2016-12-02T10:53:38Z

Currently we generate a number of scanners from regexes using re2c.

This has two advantages:

It is easy to see at a glance what the scanners do, and easy to modify them.
The scanners generated are DFAs and should perform well. Since re2c is well tested, this also lowers testing burden.

Disadvantage: Either we require re2c as a build dependency, or we have to ship a gigantic scanners.c file.

Should we simply hand-write the scanners and dispense with re2c and scanners.re?

Knagis · 2016-12-02T16:49:55Z

If you decide to rewrite them by hand, perhaps my code could be useful:

https://github.com/Knagis/CommonMark.NET/blob/master/CommonMark/Parser/Scanner.cs
https://github.com/Knagis/CommonMark.NET/blob/master/CommonMark/Parser/ScannerHtmlTag.cs

I did rewrite them by hand, it wasn't the hardest part when porting the parser, as for the maintenance, I haven't had any problems with the few changes that were required so far.

tin-pot · 2016-12-03T10:51:57Z

Would flex provide a better trade-off between convenience and amount of generated source code? (IIRC flex uses tables to represent DFAs, not thousands of if () goto lines.)

But generally hand-coded scanners are IMO preferrable, for size and configurability reasons (ie, such scanning is easier to influence by command-line options to the cmark executable etc.). But this can still be done when the syntax is stable and "frozen".

kainjow · 2016-12-03T19:57:15Z

I say go for it. Doesn't need to be all or nothing. Could just do a few functions at a time. May be some possible speed improvements as well.

MathieuDuponchelle · 2016-12-05T12:41:24Z

What is your exact concern with shipping this scanners.c file ? My only concern about this would be version control, but I'm pretty sure there are ways around it.

jgm · 2016-12-05T13:14:24Z

+++ Mathieu Duponchelle [Dec 05 16 04:41 ]:

What is your exact concern with shipping this scanners.c file ? My only concern about this would be version control, but I'm pretty sure there are ways around it.

It's not a big problem. But it's a very big source file. Some people object to that, it seems (see the md4c announcement). (There's also the problem that it's a generated file in version control, but that hasn't been a big problem so far.)

MathieuDuponchelle · 2016-12-05T13:20:08Z

@jgm, unless there's an actual technical reason to do so, I really don't see why we should do that to be honest. If it took multiple gigs of memory to compile this file for example, that could be a compelling reason, but I don't see the point in "compactness" for the sake of it, I'd much rather have easy-to-read sources to be honest.

tin-pot · 2016-12-05T13:25:52Z

While I find the ~30000 lines (~400 KB) sized scanners.c a bit hefty, the tools do handle it easily. I once looked into a linker map file and found that the scanners.c contributed around 50 KB to the executable size, IIRC.

Re-generating this file (and comitting it) has never bothered me.

So all in all, I don't loose sleep over it, at least not for the time being ;-)

nwellnhof · 2016-12-05T19:12:39Z

There's also issue #121 but this is caused by a GCC bug. The biggest problem I see is that re2c generates needlessly repetitive code for quantifiers like {1,31} which is the main cause of the large object code size.

jgm · 2016-12-05T20:40:05Z

I regenerated scanners.c using re2c 0.16, which includes some dfa minimization. That cut off 65K+ from the generated source (though it's still pretty big).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider writing scanners by hand #167

Consider writing scanners by hand #167

jgm commented Dec 2, 2016

Knagis commented Dec 2, 2016

tin-pot commented Dec 3, 2016

kainjow commented Dec 3, 2016

MathieuDuponchelle commented Dec 5, 2016

jgm commented Dec 5, 2016 via email

MathieuDuponchelle commented Dec 5, 2016

tin-pot commented Dec 5, 2016

nwellnhof commented Dec 5, 2016

jgm commented Dec 5, 2016 via email

Consider writing scanners by hand #167

Consider writing scanners by hand #167

Comments

jgm commented Dec 2, 2016

Knagis commented Dec 2, 2016

tin-pot commented Dec 3, 2016

kainjow commented Dec 3, 2016

MathieuDuponchelle commented Dec 5, 2016

jgm commented Dec 5, 2016 via email

MathieuDuponchelle commented Dec 5, 2016

tin-pot commented Dec 5, 2016

nwellnhof commented Dec 5, 2016

jgm commented Dec 5, 2016 via email