Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCC preprocessor output generated in non-ASCII locales cannot be processed #72

Open
arsdragonfly opened this issue May 5, 2020 · 2 comments

Comments

@arsdragonfly
Copy link

see this issue

@expipiplus1
Copy link
Collaborator

Hopefully this should just be a simple change in the lexer. PR's welcome!

@mtolly
Copy link

mtolly commented Jan 10, 2021

So, I looked into this and I think I found the fix, but Alex might need to release a bug fix first.

I saved the sample from the linked issue as a UTF-8 file:

# 1 "test.c"
# 1 "<built-in>"
# 1 "<命令行>"
# 31 "<命令行>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 32 "<命令行>" 2
# 1 "test.c"
int main()
{
 return 0;
}

And sure enough got Prelude.head: empty list. The error comes from the second usage of head at this location, and is caused by the first non-ASCII line # 1 "<命令行>".

Basically the problem is that Alex is assuming the input bytestring is UTF-8, but the InputStream is a byte-by-byte abstraction (effectively Latin-1). In these lines:

\#$space*@digits$space*(\"($infname|@charesc)*\"$space*)?(@int$space*)*\r?$eol
  { \pos len str -> setPos (adjustLineDirective len (takeChars len str) pos) >> lexToken' False }

Alex is passing 12 for len, which is the correct Unicode codepoint length of # 1 "<命令行>" plus a newline at the end. But takeChars then takes 12 bytes off the bytestring, so adjustLineDirective receives a broken string which does not include the double quote at the end.

The correct fix is to put Alex back into Latin-1 mode (my impression is that this was the default previously, but was then switched in Alex 3.0). This is done with the %encoding "latin1" directive (added in Alex 3.1.7). However, it still doesn't work because there was a remaining bug in character counting that caused it to still pass the too-short length. This was fixed in haskell/alex#156 but even though that was merged a year ago it appears to not have made it into the recent Alex 3.2.6. So, I'll ping that to see when it can be released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants