You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As the documentation says, Alex works over a stream of UTF-8 encoded bytes, retrieved one at a time by alexGetByte.
Lexer specifications are written in terms of Unicode characters, but Alex works internally on a UTF-8 encoded byte sequence.
Depending on how you use Alex, the fact that Alex uses UTF-8 encoding internally may or may not affect you. If you use one of the wrappers (below) that takes input from a Haskell String, then the UTF-8 encoding is handled automatically. However, if you take input from a ByteString, then it is your responsibility to ensure that the input is properly UTF-8 encoded.
From an external viewpoint as a consumer (I am not familiar with how Alex is implemented), this seems like a strange design decision to me. If my source is already in a UTF-8 format like String or Data.Text(.Lazy).Text (with the new text 2.0 release), it seems that in order to run Alex on it, I (or a wrapper) would have to write logic to decode the UTF-8 content into individual bytes, just so that Alex can immediately re-encode it back into UTF-8 internally.
So I'm wondering if there's a reason that Alex needs to work over bytes and not Chars, or if that was perhaps done to support lexing ByteStrings directly, without having to first unpack the data into UTF-8 Strings or decode into UTF-16 Texts (with text <2.0), which would be unnecessary overhead either way.
If Alex has to work over bytes for internal reasons, I think it would be a good idea to implement new wrappers for the UTF-8 Text types, since I'd imagine that would be a pretty common use case.
Otherwise, would it be possible to expose an alexGetChar-based interface that simply skips the UTF-8 decoding portion of Alex's internal logic, which would be more ergonomic and efficient for UTF-8 based types?
The text was updated successfully, but these errors were encountered:
@pnotequalnp : I haven't looked in detail, but I think Alex generates arrays indexed by 256-bit characters to make swift automata transitions. That wouldn't work with unicode characters for the sheer size of such arrays.
As the documentation says, Alex works over a stream of UTF-8 encoded bytes, retrieved one at a time by
alexGetByte
.From an external viewpoint as a consumer (I am not familiar with how Alex is implemented), this seems like a strange design decision to me. If my source is already in a UTF-8 format like
String
orData.Text(.Lazy).Text
(with the new text 2.0 release), it seems that in order to run Alex on it, I (or a wrapper) would have to write logic to decode the UTF-8 content into individual bytes, just so that Alex can immediately re-encode it back into UTF-8 internally.So I'm wondering if there's a reason that Alex needs to work over bytes and not
Char
s, or if that was perhaps done to support lexingByteString
s directly, without having to first unpack the data into UTF-8String
s or decode into UTF-16Text
s (with text <2.0), which would be unnecessary overhead either way.If Alex has to work over bytes for internal reasons, I think it would be a good idea to implement new wrappers for the UTF-8
Text
types, since I'd imagine that would be a pretty common use case.Otherwise, would it be possible to expose an
alexGetChar
-based interface that simply skips the UTF-8 decoding portion of Alex's internal logic, which would be more ergonomic and efficient for UTF-8 based types?The text was updated successfully, but these errors were encountered: