[ fixed #119 ] latin1 encoding: each byte counts as 1 char #156

andreasabel · 2020-01-26T20:12:51Z

The computation of the length component of AlexToken was tailored to
the utf8 encoding, and didn't work correctly for latin1.

This is fixed by having a new flag ALEX_LATIN1 in
templates/GenericTemplate.hs that turns on code that increases the
length by 1 for each byte, while for utf8 something more sophisticated
is done.

The fix requires more template instances to be generated. To streamline
the instance generation, now all 2^4 = 16 template instances are
generated for the 4 flags

ghc
latin1
nopred
debug

To ensure consistent reference to the template instance, a function

  templateFileName

residing both in src/Main and gen-alex-sdist/Main needs to be kept
consistent, should more dimensions be added to the template.

(Putting this function into a separate file that is included by both
modules could be an option, but seemed not enough in the spirit of
cabal-organized projects.)

The computation of the length component of AlexToken was tailored to the utf8 encoding, and didn't work correctly for latin1. This is fixed by having a new flag ALEX_LATIN1 in templates/GenericTemplate.hs that turns on code that increases the length by 1 for each byte, while for utf8 something more sophisticated is done. The fix requires more template instances to be generated. To streamline the instance generation, now all 2^4 = 16 template instances are generated for the 4 flags - ghc - latin1 - nopred - debug To ensure consistent reference to the template instance, a function templateFileName residing both in src/Main and gen-alex-sdist/Main needs to be kept consistent, should more dimensions be added to the template. (Putting this function into a separate file that is included by both modules could be an option, but seemed not enough in the spirit of cabal-organized projects.)

simonmar · 2020-01-27T08:15:21Z

Nice. Thanks!

mtolly · 2021-01-10T03:48:54Z

Hi, it looks like this (and some other merges) were not included in the recent Alex 3.2.6 release. Understandable since it was a stopgap for a GHC release.

This fix to the Latin-1 mode would be helpful in order to fix a language-c (and thus c2hs) issue: visq/language-c#72

Any info on when a new release can happen with some of these PRs that have been merged since 3.2.5?

Ericson2314 · 2021-03-18T02:15:49Z

Yes, I suppose I should release another now that GHC is finally using 3.2.5. I did want to finish #174 first, I guess I should get on that.

andreasabel requested a review from simonmar January 26, 2020 20:13

andreasabel mentioned this pull request Jan 26, 2020

[ #71 ] warn about nullable regexs in the absence of start codes #155

Merged

simonmar merged commit 574ec8c into haskell:master Jan 27, 2020

andreasabel deleted the issue119 branch January 31, 2020 18:22

mtolly mentioned this pull request Jan 10, 2021

GCC preprocessor output generated in non-ASCII locales cannot be processed visq/language-c#72

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ fixed #119 ] latin1 encoding: each byte counts as 1 char #156

[ fixed #119 ] latin1 encoding: each byte counts as 1 char #156

andreasabel commented Jan 26, 2020

simonmar commented Jan 27, 2020

mtolly commented Jan 10, 2021

Ericson2314 commented Mar 18, 2021

[ fixed #119 ] latin1 encoding: each byte counts as 1 char #156

[ fixed #119 ] latin1 encoding: each byte counts as 1 char #156

Conversation

andreasabel commented Jan 26, 2020

simonmar commented Jan 27, 2020

mtolly commented Jan 10, 2021

Ericson2314 commented Mar 18, 2021