ISO-8859-1 Encoding: Problems and Validation #3
-
@thoni56, I wanted to discuss the various problems relating to ensuring that ALAN files (sources, transcripts, etc.) are properly encoded in ISO-8859-1. This discussion could have been opened elsewhere (e.g. on the Alan Docs repo) but I chose this project because it contains multi-language ALAN files, so it's a better real-case test bed for the issue at hand. Common ISO ProblemsFrom my experience, working with ISO-8859-1 files today is quite problematic. Modern editors offer poor support for ISO encodings, and even when they do allow associating a file extension to the encoding, they usually break it up (and revert to UTF-8) when carrying out cut-and-paste operations, for they rarely offer conversion of the clipboard contents. EClint does a very poor job at validating ISO-8859-1, and in fact I had to disable it in this project because it would report as invalid perfectly encoded files — basically, any non-English (ASCII) file is seen as broken. Git is not particularly good at handling ISO encodings either, and doesn't offer any specific settings or features for them. It has been my experience, with various repos using legacy ISO files, that contributors tend to break the original encoding quite often, without realizing it. So we do need to find a way to validate ISO files via some script that can be run locally (before committing) as well as on GitHub (via Travis, etc.) to protect the repository contents from malformed ALAN files. Validating ISO-8859-1?From what I understood so far, there is no bullet proof way to determine whether a file is actually encoded in ISO-8859-1. Tools like iconv do a better job at validating UTF-8 files. I couldn't find any robust tool for ISO validation. But I guess that in our context we should change the question to: What should a valid ALAN ISO-8859-1 file look like? Maybe we could come up with our custom validator, based on the expectation of what characters should be found in an ALAN file, and those which shouldn't be there (e.g. most control characters). ISO vs MacI'm assuming all ALAN sources in this project would be in ISO-8859-1, but maybe some new localization of the library might rely on other ISO-8859 encodings. I still haven't understood what ALAN means by the Would ALAN work with any ISO encoding? i.e. as long as it's a single char encoding it's fine with it? Would that work only on the terminal, or also in the graphic interpreter? Any ideas and clarifications on this topic? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 8 replies
-
Having started on the Swedish i18n I'm starting to feel the pain ;-) Well, I have thought about it for a long time, and I think we already have an issue on that in the I've always considered this a huge compatibility problem, because of multiple layers of conversions. In the early days of Alan there were a number of different character encoding "principles" for non-ascii characters, as well as the irritating line-ending conventions. Since ISO8859-1 handles many western languages, including Swedish that was the choice for the "native" coding in the Acode. But both the compiler and the interpreter was made to also handle "native" ms-dos files (yes, before Windows), and files created on early versions of Apples OS (called "Mac OS" ;-). They all had different line ending and encoded non-ascii characters differently. So the Alan compiler has three different scanners/lexers to correctly recognize what is a line-ending and what are legal characters in words, identifiers etc. The switches And, no, the The big problem is that since Acode files are supposed to be completely portable, the Alan compiler must know of, and be able to convert any characters coming in, to ISO8859-1. It also does its best to convert between upper and lower case to normalize how strings are stored. Then the reverse happens in the interpreter(s), so it too, needs to understand how to convert between ISO8859-1 to its own locale. Since the same source code is used for all interpreters, that means the code is there for all combinations, too. I suppose that means that if you compile a game in a locale that has some other flavour of ISO, using a particular code point to encode a different character (glyph), that "glyph" will be interpreted as the ISO8859-1 counterpart, and if that was acceptable as a character in a player word, so would the "glyph". This might be correct, or completely wrong. If the ISO8859-1 counterpart was considered an upper case character it would also be converted into the "glyph" represented by the code point for the ISO8859-1 lower case character corresponding to the... Yeah, I guess you get it... And finally, if that conversion is reversible (can it not be?) then the interpreter would convert all that back. Probably showing the same "glyph" that was initially intended. But all this is from memory, and without any verification. What is very clear is that Alan needs to move away from ISO8859-1. It would make everything much easier and we could clean out a lot of garbage in the source code. And if the choice were done today UTF-8 would be the natural one. I don't want to make a compatibility check all over for this, it would be a lot of work, and something that should be there only for a time. So that means it would be a breaking change, and not a small one, rendering most (or only many?) games unplayable with newer versions. (But that is probably a discussion for alan-if/alan#12... I think there are two separate issues here. One is the actual encoding, for that we can make precise trials and experiments to understand exactly what works. The other one is harder, and that is "maintaining files in their intended encoding". Here there is only one answer, we shouldn't need to. I hope I cleared something out, or at least gave what information I have in my head. |
Beta Was this translation helpful? Give feedback.
-
Idea for a Basic ISO ValidatorIt's been a while that I've been thinking of creating a simple tool that would scan ALAN sources in search of unexpected characters that shouldn't be there. I have this feeling that the heinous whitespace bug might be due to some corrupted or out-of-place char(s) in the sources (either of the StdLib or the test adventure) because it was present also in the Italian StdLib, but then it disparaged while I was editing the sources. So, probably during my translation work I just deleted a while code block and rewrote it, and the problem went away (and never came back). When I first ported to Git the StdLib sources I faced a lot of ISO related problems, and had to do quite some manual fixing work to pass the basic ISO validation test, but some dodgy characters might still be left floating around as a result of the various conversions to and from different encodings. The bug could be due to the presence of a single character out of place — e.g. In any case, what I had in mind for this tool was to carry out a series of checks, by scanning the files byte-per-byte (as binary data, instead of as strings, one line at the time):
In theory, this tool could eventually be made context aware based on the file extension — i.e. From what I read around, since there is no real way to distinguish between the different single-byte encodings, which share many overlapping chars, and there's no way to know what the special chars are intended to represent, a custom tool would need this sort of approach based on expectations of how the specific encoding is generally used. It might be a shot in the dark, but if this could help me solve the whitespace bug I'd be quite happy just by that, since it's a bug that has been haunting me for a number of years, interfering with the test-suite automation. |
Beta Was this translation helpful? Give feedback.
-
Non-Obvious UTF-8 Benefits@thoni56, I haven't mentioned this so far, but as a result of the new UTF-8 feature we are now able to use many punctuation characters that we couldn't before. E.g. curly quotes Luckily, we can now inject all these special chars in UTF-8 sources. I don't know why the UTF-8 conversion works, but I guess it's handled by the OS layer, rather than the editor. In any case, being able to use smart quotes and dashes in adventures is going to make them look nicer. The only caveat is that some of these chars might not be correctly represented in the terminal/CMD, since few monospace fonts cover them (surely not the default OS fonts). But you can find some good FOSS monospace fonts that cover lots of Unicode glyphs, including Russian, math symbols, etc. |
Beta Was this translation helpful? Give feedback.
Having started on the Swedish i18n I'm starting to feel the pain ;-)
Well, I have thought about it for a long time, and I think we already have an issue on that in the
alan
project (alan-if/alan#12) of which I'm sure you are aware.I've always considered this a huge compatibility problem, because of multiple layers of conversions. In the early days of Alan there were a number of different character encoding "principles" for non-ascii characters, as well as the irritating line-ending conventions.
Since ISO8859-1 handles many western languages, including Swedish that was the choice for the "native" coding in the Acode. But both the compiler and the interpreter was made to also handle "nativ…