ISO-8859-1 Encoding: Problems and Validation #3

tajmone · 2021-04-19T20:44:02Z

tajmone
Apr 19, 2021
Maintainer

@thoni56, I wanted to discuss the various problems relating to ensuring that ALAN files (sources, transcripts, etc.) are properly encoded in ISO-8859-1.

This discussion could have been opened elsewhere (e.g. on the Alan Docs repo) but I chose this project because it contains multi-language ALAN files, so it's a better real-case test bed for the issue at hand.

Common ISO Problems

From my experience, working with ISO-8859-1 files today is quite problematic. Modern editors offer poor support for ISO encodings, and even when they do allow associating a file extension to the encoding, they usually break it up (and revert to UTF-8) when carrying out cut-and-paste operations, for they rarely offer conversion of the clipboard contents.

EClint does a very poor job at validating ISO-8859-1, and in fact I had to disable it in this project because it would report as invalid perfectly encoded files — basically, any non-English (ASCII) file is seen as broken.

Git is not particularly good at handling ISO encodings either, and doesn't offer any specific settings or features for them.

It has been my experience, with various repos using legacy ISO files, that contributors tend to break the original encoding quite often, without realizing it. So we do need to find a way to validate ISO files via some script that can be run locally (before committing) as well as on GitHub (via Travis, etc.) to protect the repository contents from malformed ALAN files.

Validating ISO-8859-1?

From what I understood so far, there is no bullet proof way to determine whether a file is actually encoded in ISO-8859-1. Tools like iconv do a better job at validating UTF-8 files.

I couldn't find any robust tool for ISO validation. But I guess that in our context we should change the question to: What should a valid ALAN ISO-8859-1 file look like?

Maybe we could come up with our custom validator, based on the expectation of what characters should be found in an ALAN file, and those which shouldn't be there (e.g. most control characters).

ISO vs Mac

I'm assuming all ALAN sources in this project would be in ISO-8859-1, but maybe some new localization of the library might rely on other ISO-8859 encodings.

I still haven't understood what ALAN means by the iso and mac options — what's the Mac encoding anyhow? Is it still used today?

Would ALAN work with any ISO encoding? i.e. as long as it's a single char encoding it's fine with it? Would that work only on the terminal, or also in the graphic interpreter?

Any ideas and clarifications on this topic?

Answered by thoni56

Apr 19, 2021

Having started on the Swedish i18n I'm starting to feel the pain ;-)

Well, I have thought about it for a long time, and I think we already have an issue on that in the alan project (alan-if/alan#12) of which I'm sure you are aware.

I've always considered this a huge compatibility problem, because of multiple layers of conversions. In the early days of Alan there were a number of different character encoding "principles" for non-ascii characters, as well as the irritating line-ending conventions.

Since ISO8859-1 handles many western languages, including Swedish that was the choice for the "native" coding in the Acode. But both the compiler and the interpreter was made to also handle "nativ…

View full answer

thoni56 · 2021-04-19T21:40:09Z

thoni56
Apr 19, 2021
Maintainer

Having started on the Swedish i18n I'm starting to feel the pain ;-)

Well, I have thought about it for a long time, and I think we already have an issue on that in the alan project (alan-if/alan#12) of which I'm sure you are aware.

I've always considered this a huge compatibility problem, because of multiple layers of conversions. In the early days of Alan there were a number of different character encoding "principles" for non-ascii characters, as well as the irritating line-ending conventions.

Since ISO8859-1 handles many western languages, including Swedish that was the choice for the "native" coding in the Acode. But both the compiler and the interpreter was made to also handle "native" ms-dos files (yes, before Windows), and files created on early versions of Apples OS (called "Mac OS" ;-).

They all had different line ending and encoded non-ascii characters differently. So the Alan compiler has three different scanners/lexers to correctly recognize what is a line-ending and what are legal characters in words, identifiers etc. The switches -iso, -dos and -mac just selects between those, the idea was that you could force the compiler to treat a source file as if it was encoded another way, perhaps by it being copied from one type of OS to another. (Which we did quite heavily back then...)

And, no, the -mac encoding is no longer of any use, the Apple Mac OS Roman encoding died with Mac OS 9, since Apple were smart enough to go with UTF-8 in Mac OS X. (What a different world it had been if Microsoft would have done the same with Windows X, sorry, I mean Windows 10...) I've been meaning to clean this up, but it is not actually a problem, so it's a way down on the list.

The big problem is that since Acode files are supposed to be completely portable, the Alan compiler must know of, and be able to convert any characters coming in, to ISO8859-1. It also does its best to convert between upper and lower case to normalize how strings are stored.

Then the reverse happens in the interpreter(s), so it too, needs to understand how to convert between ISO8859-1 to its own locale. Since the same source code is used for all interpreters, that means the code is there for all combinations, too.

I suppose that means that if you compile a game in a locale that has some other flavour of ISO, using a particular code point to encode a different character (glyph), that "glyph" will be interpreted as the ISO8859-1 counterpart, and if that was acceptable as a character in a player word, so would the "glyph". This might be correct, or completely wrong.

If the ISO8859-1 counterpart was considered an upper case character it would also be converted into the "glyph" represented by the code point for the ISO8859-1 lower case character corresponding to the... Yeah, I guess you get it...

And finally, if that conversion is reversible (can it not be?) then the interpreter would convert all that back. Probably showing the same "glyph" that was initially intended.

But all this is from memory, and without any verification.

What is very clear is that Alan needs to move away from ISO8859-1. It would make everything much easier and we could clean out a lot of garbage in the source code. And if the choice were done today UTF-8 would be the natural one.

I don't want to make a compatibility check all over for this, it would be a lot of work, and something that should be there only for a time. So that means it would be a breaking change, and not a small one, rendering most (or only many?) games unplayable with newer versions. (But that is probably a discussion for alan-if/alan#12...

I think there are two separate issues here. One is the actual encoding, for that we can make precise trials and experiments to understand exactly what works.

The other one is harder, and that is "maintaining files in their intended encoding". Here there is only one answer, we shouldn't need to.

I hope I cleared something out, or at least gave what information I have in my head.

6 replies

tajmone Apr 19, 2021
Maintainer Author

So that means it would be a breaking change, and not a small one, rendering most (or only many?) games unplayable with newer versions.

Well, as long as we know the original encoding of the source file we only need to convert it to whatever new input format is required — and iconv does a great job at converting from one encoding to another, as long as the original file is properly encoded.

But you're right, in this repository we really should focus on ensuring the the current ALAN files are in valid ISO-8859-1, and that contributors don't break the encoding while editing them (a badly set editor can corrupt the whole file at save time, and restoring it might be harder than one things).

Comments to Verify ISO Encoding

I came across an interesting solution to the ISO corruption problem, proposed by @Naheulf, elsewhere. He suggest adding at the beginning of the file a comment line containing some special accented letters, like this:

--  Ref Line:  e a i o E U (simple letters without diacritics)
--  Test Line: é à î ö Ê Ü (same letters with one diacritic)

which would provide an immediate visual clue in case the file became corrupted, and which could also be easily check via GREP, e.g. expecting it to be within the first 10 lines of any ALAN file.

What I and @Naheulf noticed in that project was that ISO-8859-1 files which contained only English letters (no special accents) would be auto-detected by most editors as being US-ASCII, which could lead to unexpected encoding if/when a special non-ASCII letter was added (e.g. Windows-1252, or other incompatible fallback encodings).

The mere presence of the above comment lines would lead editors' auto-detection to better guesses. Probably similar comments might require some skilful choice of selected special letters which are specific to ISO-8859-1, but the idea of such comments is by itself a simple solution providing a good safe guard, both for end users as well as editors.

thoni56 Apr 21, 2021
Maintainer

So that means it would be a breaking change, and not a small one, rendering most (or only many?) games unplayable with newer versions.

Well, as long as we know the original encoding of the source file we only need to convert it to whatever new input format is required — and iconv does a great job at converting from one encoding to another, as long as the original file is properly encoded.

I was actually talking about going for UTF-8 as the internal representation, which is also how strings are stored in the game file. So this change would make any game compiled with pre-UTF-8 compilers unusable (unless you duplicate all the character handling). But anyway, that is not my current thinking. I think you already stated somewhere that ISO8859-1 is probably a good enough choice for the internal format. As it covers "covering most Western European languages" and a handful of other languages, and given that the handling of a single byte character encoding is soooo much simpler, I've decided to keep ISO8859-1 as the internal and game file encoding for the foreseeable future.

But you're right, in this repository we really should focus on ensuring the the current ALAN files are in valid ISO-8859-1, and that contributors don't break the encoding while editing them (a badly set editor can corrupt the whole file at save time, and restoring it might be harder than one things).

Comments to Verify ISO Encoding

I came across an interesting solution to the ISO corruption problem, proposed by @Naheulf, elsewhere. He suggest adding at the beginning of the file a comment line containing some special accented letters, like this:
--  Ref Line:  e a i o E U (simple letters without diacritics)
--  Test Line: é à î ö Ê Ü (same letters with one diacritic)
which would provide an immediate visual clue in case the file became corrupted, and which could also be easily check via GREP, e.g. expecting it to be within the first 10 lines of any ALAN file.

What I and @Naheulf noticed in that project was that ISO-8859-1 files which contained only English letters (no special accents) would be auto-detected by most editors as being US-ASCII, which could lead to unexpected encoding if/when a special non-ASCII letter was added (e.g. Windows-1252, or other incompatible fallback encodings).

Yes, I've noticed this too, and I think this is the major risk.

The mere presence of the above comment lines would lead editors' auto-detection to better guesses. Probably similar comments might require some skilful choice of selected special letters which are specific to ISO-8859-1, but the idea of such comments is by itself a simple solution providing a good safe guard, both for end users as well as editors.

That's a good trick! Which also does away with the risk of editors not discovering the correct encoding.

tajmone Apr 21, 2021
Maintainer Author

OK, I'll then try to work an ideal comment that intuitively covers the major special characters of ISO-8859-1 and add it somewhere at the beginning of the various source files (in the comments header, just after the file info).

tajmone Apr 21, 2021
Maintainer Author

And, yes I believe that supporting only ISO-8859-1 internally is just fine, for it would cover all the languages which ALAN is actually capable of supporting. It would interesting though to expand more on the ISO topic in a dedicated Appendix of the ALAN Manual, especially regarding the possibility of using another ISO-8859-X encoding, which would extend the number of supported alphabets (i.e. would it work in the CMD if the correct CodePage is set and a good font is used? Would it work on graphical interpreters like WinARun, Gargoyle, etc?

That aspect of how ALAN handles ISO/Mac/DOS encoding is currently lacking documentation, especially when it comes to what ISO really means, and how Glk based terps are going to handle it — I'm not sure how ALAN encodes the text before passing it to Glk, but if an ISO encoding other than ISO-8859-1 was used, my guess is that ALAN would need a specific option to handle properly the expected chars, also in Unix like shells (which are UTF-8 by default).

A few references on this in the ALAN Design document wouldn't hurt either.

tajmone Jul 25, 2021
Maintainer Author

On ISO-8859 Variants

@thoni56: I suppose that means that if you compile a game in a locale that has some other flavour of ISO, using a particular code point to encode a different character (glyph), that "glyph" will be interpreted as the ISO8859-1 counterpart, and if that was acceptable as a character in a player word, so would the "glyph". This might be correct, or completely wrong.

Indeed, there are some ISO variants that cover glyphs from other languages. Among all the ALAN projects, this is the most likely one to every uncover any such potential conflicts, because you never know ... maybe in the future someone will attempt to port the library to one of those ISO-8859 variants which are not Latin-1.

Since these would still be single-chars encodings, I think that if and when the problem arises in real case scenarios, the solution would actually be quite simple: adding a new Option to cover explicit ISO declarations, which the interpreters could then use to correctly convert the ISO chars to their Unicode counterparts.

This would also be non-breaking change, for in absence of such an option the default ISO-8859-1 would be assumed.

Thoughts on ZSCII

I vaguely recall reading about the old ZSCII encoding that was used for Zork, and that was (and still is) used in the Z-Machine and Glulx. I personally find it a bit entangled, with all those switching chars, but it did manage to cover quite a lot of characters (at least, in its latest incarnations), but it was mostly a device to reduce strings storage in the age of the first home computers with limited memory.

I think that encoding is always going to pose some problems for any IF system. Having to deal with a variable-width encoding adds some non-trivial pains in terms of implementation, except in those languages that were designed to handle all strings as UTF-8 (e.g. Rust). Using UCS-2 or UTF-16 would be a compromise (possibly a reasonable one too), whereas storing each character as a full-fledge Unicode point would be unthinkable (I believe Unicode points can be as long as 5 or more bytes).

The problem is that the needs of most IF systems don't overlap with the needs of Unicode — I mean, every adventure is going to be in one single language, so no one needs access to the all of the Unicode planes at once (not to mention that many of these deal with symbols, non-alphabet glyphs, and even emojis).

A multi-language IF system could probably manage to pack all the required alphabet and punctuation characters into a single byte, but it would have to use table-switching chars to achieve this, for the 0-255 range seems rather limited even if you remove the control chars. And, of course, it would have to also add the various map structures for the conversion (but that doesn't sound like a big deal, and ALAN already uses maps for stings compression).

tajmone · 2021-04-21T11:52:37Z

tajmone
Apr 21, 2021
Maintainer Author

Idea for a Basic ISO Validator

It's been a while that I've been thinking of creating a simple tool that would scan ALAN sources in search of unexpected characters that shouldn't be there.

I have this feeling that the heinous whitespace bug might be due to some corrupted or out-of-place char(s) in the sources (either of the StdLib or the test adventure) because it was present also in the Italian StdLib, but then it disparaged while I was editing the sources. So, probably during my translation work I just deleted a while code block and rewrote it, and the problem went away (and never came back).

When I first ported to Git the StdLib sources I faced a lot of ISO related problems, and had to do quite some manual fixing work to pass the basic ISO validation test, but some dodgy characters might still be left floating around as a result of the various conversions to and from different encodings.

The bug could be due to the presence of a single character out of place — e.g. CR+LF+CR, or a control char that crept into source (not uncommon in files that had been manipulated via the old DOS, which used to introduce strange characters like 0xA0 while redirecting/piping).

In any case, what I had in mind for this tool was to carry out a series of checks, by scanning the files byte-per-byte (as binary data, instead of as strings, one line at the time):

EOL Consistency — an ALAN file should only contain CRLF (under Windows) or LF (under macOS and Linux), and any stray EOL sequence should be reported as an error. In theory, Git already handles that, but nonetheless I'd like my tool to also check for this.
Printable Chars — EOLs aside, an ALAN file (source or game script/transcript) should only contain printable chars, so any control characters should be reported as an error (although valid ISO-8859-1) since they shouldn't be there. Tabs (horizontal Tabs \t) should be an exception, but since in our projects we use only spaces for indentation, any Tab should be reported (at least in --strict mode).
Dubious Chars — There are expectations of which chars an ALAN source or game script/transcript should contain, and some valid printable chars might be tracked as being dubious (via some optional tool switch) for they might indicate the presence of encoding corruption. E.g. characters like NBSP (0xA0), ¤, ¨, ª, ¯, ±, ², ³, ´, µ, ¶, ·, ¸, ¹, º, × and ÷ should be considered as suspicious because even if they are valid ISO chars you wouldn't really expect to find them in most adventure sources (and definitely not in a commands script, since they're hard to type), so it's worth being able to spot them and investigate the source file.
ISO-8859-1 vs Windows-1252 — It has been a widespread misconception that the ISO-8859-1/Latin-1 and Windows-1252 encoding are identical, which is not the case because they differ significantly. It should be possible to pay close attention to non-overlapping chars of these two encodings, in case the file was at some point treated as Windows-1252, possibly with some contextual checks around specific characters.

In theory, this tool could eventually be made context aware based on the file extension — i.e. .alan/.i being ALAN sources; .a3sol being commands scripts (aka solutions); and .a3log being transcripts. Based on the file type, it could be able to handle some context specific checks, without requiring a full fledged parser; e.g. being able to distinguish comments and strings, and changing the expectations based on each context.

From what I read around, since there is no real way to distinguish between the different single-byte encodings, which share many overlapping chars, and there's no way to know what the special chars are intended to represent, a custom tool would need this sort of approach based on expectations of how the specific encoding is generally used.

It might be a shot in the dark, but if this could help me solve the whitespace bug I'd be quite happy just by that, since it's a bug that has been haunting me for a number of years, interfering with the test-suite automation.

2 replies

thoni56 Apr 23, 2021
Maintainer

Are you sure the "whitespace bug" is still present? I mentioned in alan-if/alan#19 (comment) that I found something that might explain it, which I also fixed. That would be a more likely candidate than random characters, I think, especially it would explain why it suddenly showed up in the Italian library too.

Recently, while investigating UTF-8 conversion, I read the following quote

Without explicit external information there is no way to safely decide which encoding a file has.

But you have a lot of ideas what to do ;-)

If you think building that (hobby) project is interesting and worth your time, I'd suggest to start with warnings for dubious characters. I think that would bring a lot of the possible value with small-ish effort. That would be a good start.

tajmone Apr 23, 2021
Maintainer Author

If you think building that (hobby) project is interesting and worth your time,

It's more out of desperation for the whitespace bug, which is making life really hard on the StdLib repository. So, if there's a chance this tool could help in finding a fault in the StdLib sources, it would be worth the effort (but not if the bug was solved within ALAN, as it might actually be the case).

tajmone · 2021-07-25T12:36:19Z

tajmone
Jul 25, 2021
Maintainer Author

Non-Obvious UTF-8 Benefits

@thoni56, I haven't mentioned this so far, but as a result of the new UTF-8 feature we are now able to use many punctuation characters that we couldn't before.

E.g. curly quotes “ ” ‘ ’ and dashes – — which are all valid ISO characters, but most editors won't let you insert them because to produce them you need to resort to Alt-Codes, which are injected as Unicode points. For some reason, most editors don't apply on-the-fly encoding when pasting or injecting chars via Alt-Codes, so these operations simply end up breaking up ISO encoded files.

Luckily, we can now inject all these special chars in UTF-8 sources. I don't know why the UTF-8 conversion works, but I guess it's handled by the OS layer, rather than the editor.

In any case, being able to use smart quotes and dashes in adventures is going to make them look nicer.

The only caveat is that some of these chars might not be correctly represented in the terminal/CMD, since few monospace fonts cover them (surely not the default OS fonts). But you can find some good FOSS monospace fonts that cover lots of Unicode glyphs, including Russian, math symbols, etc.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ISO-8859-1 Encoding: Problems and Validation #3

{{title}}

Replies: 3 comments 8 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Comments to Verify ISO Encoding

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

ISO-8859-1 Encoding: Problems and Validation #3

tajmone Apr 19, 2021 Maintainer

Common ISO Problems

Validating ISO-8859-1?

ISO vs Mac

Replies: 3 comments · 8 replies

thoni56 Apr 19, 2021 Maintainer

tajmone Apr 19, 2021 Maintainer Author

Comments to Verify ISO Encoding

thoni56 Apr 21, 2021 Maintainer

Comments to Verify ISO Encoding

tajmone Apr 21, 2021 Maintainer Author

tajmone Apr 21, 2021 Maintainer Author

tajmone Jul 25, 2021 Maintainer Author

On ISO-8859 Variants

Thoughts on ZSCII

tajmone Apr 21, 2021 Maintainer Author

Idea for a Basic ISO Validator

thoni56 Apr 23, 2021 Maintainer

tajmone Apr 23, 2021 Maintainer Author

tajmone Jul 25, 2021 Maintainer Author

Non-Obvious UTF-8 Benefits

tajmone
Apr 19, 2021
Maintainer

Replies: 3 comments 8 replies

thoni56
Apr 19, 2021
Maintainer

tajmone Apr 19, 2021
Maintainer Author

thoni56 Apr 21, 2021
Maintainer

tajmone Apr 21, 2021
Maintainer Author

tajmone Apr 21, 2021
Maintainer Author

tajmone Jul 25, 2021
Maintainer Author

tajmone
Apr 21, 2021
Maintainer Author

thoni56 Apr 23, 2021
Maintainer

tajmone Apr 23, 2021
Maintainer Author

tajmone
Jul 25, 2021
Maintainer Author