Switching to UTF-8 #29

tajmone · 2021-09-10T01:30:02Z

tajmone
Sep 10, 2021
Maintainer

@thoni56, I've finally managed to create a Ruby method that can invoke ARun and generate a transcript with the same filename as the solutions file (instead of the storyfile) by redirecting ARun's output to file.

The method strips the BOM from the solution file before feeding it to ARun, so the previous problem of the BOM leaking into generated transcript is solved, and without extra file operations — basically Ruby abhors the BOM, so it automatically strips any BOM at read time (doesn't even support natively writing a BOM to file), so I didn't really have to do much, just read the solution and pipe it ARun, the rest is done with the invocation parameters and the redirection.

Now, I've update locally the project Rakefile to use this new method, which makes it better and even slimmer. The problem is that I had to implement the ISO version of the method, since this project is still using ISO encoded files.

I was thinking that we should migrate all ALAN files to UTF-8, so the project will be ready for the upcoming ALAN release, which supports UTF-8 — surely ALAN Beta8 will be released before any of these libraries reaches v1.0.0, and switching to UTF-8 will allow us to start using the same Rake modules in other repositories too (starting with the StdLib repo, now that I finally solved the transcripts problem).

I've also moved the various ALAN and AsciiDoc helpers to external Ruby files, so they can be shared with other repos, independently of their project specific Rakefile (updating these modules will only require copying them to each repo, which is not much work really).

Are you OK with moving on to using UTF-8 here? If you give me the OK, I'll update the repository configurations and re-encode all the ALAN sources and solutions in the repository, but then we'll have to do the same in our dev branches for the Italian and Swedish libraries too (since after rebasing on main they would stop working, most likely).

thoni56 · 2021-09-10T09:01:33Z

thoni56
Sep 10, 2021
Maintainer

I think moving to UTF-8 for this repo is a good move already now. Beta8 is not far away.

1 reply

tajmone Sep 10, 2021
Maintainer Author

Beta8 is not far away.

Great!!

I'm so satisfied by the Rake/Ruby solution to the BOM injection problem when redirecting ARun output to a custom named transcript (see alan-if/alan#32) — it's a simple and elegant solution that doesn't add any overhead to the operation.

Rake has turned out to be a real life saver to the complexities of our ALAN repositories and the need to keep all build systems cross-platform. The fact that we can freely mix Ruby code in Rakefiles is a very powerful feature of Rake, which (IMO) makes it one of the best build automation tools.

Although I'm still mostly improvising when it comes to coding in Ruby, I'm managing to obtain all the desired outcomes so far.

Up to a few months ago, Ruby code looked like hieroglyphs to me, due to all its odd use of punctuation symbols (=>, symbol: and :symbol, method? and method!) and the fact that when not required most punctuation is omitted, which makes Ruby sources look inconsistent to someone who doesn't know the language. As it turns out, these are the very traits that make Ruby so friendly and un-verbose, but it does take some adjustment time when learning the lang.

I'm striving to learn Ruby the best I can; unfortunately with Ruby v3 having been released fairly recently, most online tutorials and commercial books still refer to Ruby v2 (or even v1), which can be misleading since they miss out on some of the new cool features of v3, plus some changes to Ruby's core and Std-lib. So, lacking tutorials specifically written for Ruby 3, I often end up having to sift through the official Ruby API documentation, which is not quite as easy as a step by step tutorial at this stage.

But definitely, both Rake and Ruby are precious tools for everyday tasks; and since we already depend on Ruby so much for Asciidoctor, it makes sense digging deeper into the language. Ruby was designed to be easy to learn, which still holds true; the fact is that its core and Standard Library have grown quite a lot in the course of time, so learning all its native features can take quite some time. Ruby 3 introduces some really amazing new features (albeit some still experimental) especially for concurrency (easy and safe).

tajmone · 2021-09-10T16:58:04Z

tajmone
Sep 10, 2021
Maintainer Author

UTF-8 Conversion: Done!

@thoni56, the conversion worked out really nice. Now both the English and Spanish libraries (and any other ALAN files) are all in UTF-8.

For the ISO to UTF-8-BOM conversion I used PowerGREP 5, a commercial GUI tool which I've purchased some time ago, and which I just discovered it can also handle encoding conversions and validations.

I couldn't find any free tool that can convert adding a BOM (inconv doesn't support BOM in conversions to UTF). If you want me to tweak you Swedish branch I can quickly convert all ALAN files to UTF-8-BOM in a breeze, since the tool has already been setup for ALAN specific operations. In that case, I'd be adding a new commit to your dev branch (and could also check if I can manage to delete the alan_sv/ folder from main branch, without any files being lost in your dev branch when you rebase).

Let me know.

PS: This change from ISO to UTF-8 sources is a monumental step in ALAN history, and it's so nice to take part in it.

5 replies

thoni56 Sep 12, 2021
Maintainer

I think all Alan files in alan_sv is already UTF-8. I think At least I did that already when I imported them.

I had forgotten to rename the directory. But that's done now. And rebased on main, so dev_alan-sv is en par with main except for the directory rename.

I'd be adding a new commit to your dev branch (and could also check if I can manage to delete the alan_sv/ folder from main branch, without any files being lost in your dev branch when you rebase).

I'm not sure I follow. Did we decide to only keep the localization work on separate branches? I still have alan_es and alan_en as well as alan_sv in main. Or do we consider es and en "pre-released" so that's why we have them on main?

I noticed that there's no alan_it but a dev_alan-it branch. Is this what you mean? Then what are our criteria for merging them to main?

So, I'm just saying I don't follow your thinking here. Please, explain. I think you mentioned something about this here, So maybe we should be a bit more specific:

Until dev work has reached the level of the main EN implementation it is not to be shared on main

Maybe? But then as you are also progressing with the dev-en, EN will be a moving target, and any non-"completed", non-EN will prevent EN to move to the next level. (Just making clear that I'm not fully clear about how exactly to do this right now, and perhaps we should just wing it until we have a more stable situation.)

tajmone Sep 12, 2021
Maintainer Author

Or do we consider es and en "pre-released" so that's why we have them on main?

Yes, that was the idea, since these are based on pre-existing libraries.

I noticed that there's no alan_it but a dev_alan-it branch. Is this what you mean? Then what are our criteria for merging them to main?

No, there actually is the dev_alan-it branch, but probably it wasn't there when you looked this morning because I was renaming it locally and pushing it the renamed version on the repo, but after I deleted the old branch Windows become unresponsive, and after half hour I had to forcefully shut down the PC, so it took ages to reboot. But now the dev_alan-it branch is again on.

I just thought that it would make more sense to develop the Italian and Swedish libraries on dev branches, until we have a working draft, then we can merge them in main. This for two reasons:

To avoid cluttering the main branch with a lot of commits that don't result in a functioning library version.
Because I'm planning to invoke Rake on Travis CI, so we need to ensure that main always passes the build. In the Italian dev branch I already have some test files, but they don't always pass the build, e.g. if I'm carrying out big changes, and at this stage I'd rather avoid having to worry about any changes having to pass a build.

Until dev work has reached the level of the main EN implementation it is not to be shared on main

Yes, that was the basic idea — i.e. that new translation should at least meet the target of the original Lib 0.6, but not necessarily all the new changes. I.e. the original Lib 0.6 should be the base reference line — at least, these are the files that I've started to work on, and on which I'm currently focusing.

But then as you are also progressing with the dev-en, EN will be a moving target, and any non-"completed", non-EN will prevent EN to move to the next level. (Just making clear that I'm not fully clear about how exactly to do this right now, and perhaps we should just wing it until we have a more stable situation.)

How's that? The English library is the reference library. E.g. I've just updated the Spanish library to mirror the META VERB changes from the English library.

But the, the Spanish library is a different case from the Italian and Swedish, since that's a pre-existing library.

In any case, feel free to add the Swedish library to main if you prefer to — I was just assuming you moved its development to dev_alan-sv, since I didn't see any commits on main for the Swedish files.

As for the Italian library, I don't think it's ready yet to go into main, because it needs a lot of polishing, there are still many untranslated messages, and I don't want to start writing its README and INDEX files right now, since it's all very unstable at the moment and I haven't really made up my mind yet regarding some naming choices (keep in mind that I'm also copying and pasting a lot of code from the StdLib Italian, which it's worth recycling but needs to be revised, since it comes from a different library altogether).

thoni56 Sep 12, 2021
Maintainer

Or do we consider es and en "pre-released" so that's why we have them on main?

Yes, that was the idea, since these are based on pre-existing libraries.

Fair enough, seems reasonable.

I noticed that there's no alan_it but a dev_alan-it branch. Is this what you mean? Then what are our criteria for merging them to main?

No, there actually is the dev_alan-it branch, but probably it wasn't there when you looked this morning because I was renaming it locally and pushing it the renamed version on the repo

I meant "no alan_it directory, but there is a dev_alan-it branch" so, yeah, I think that is in line with the thinking you have (and with which I concur).

Because I'm planning to invoke Rake on Travis CI, so we need to ensure that main always passes the build. In the Italian dev branch I already have some test files, but they don't always pass the build, e.g. if I'm carrying out big changes, and at this stage I'd rather avoid having to worry about any changes having to pass a build.

That's a good reason.

Until dev work has reached the level of the main EN implementation it is not to be shared on main

Yes, that was the basic idea — i.e. that new translation should at least meet the target of the original Lib 0.6, but not necessarily all the new changes. I.e. the original Lib 0.6 should be the base reference line — at least, these are the files that I've started to work on, and on which I'm currently focusing.

But then as you are also progressing with the dev-en, EN will be a moving target, and any non-"completed", non-EN will prevent EN to move to the next level. (Just making clear that I'm not fully clear about how exactly to do this right now, and perhaps we should just wing it until we have a more stable situation.)

How's that? The English library is the reference library. E.g. I've just updated the Spanish library to mirror the META VERB changes from the English library.

Well, you mention, like below, re-using things from StdLib, and I thought that was for the English library?

So, my thinking was this:

if the rule is that a translation of the library is not allowed on to main until it is up to par with the currently "published" English library
if there is such a translation
and we want to uågrade the English library to use some new feature it probably will have to be published as a new version
then the translation library that was not up to par has to stay off of main until it has implemented that new feature too

And that didn't feel like a good situation. As always I think that if we have something useable and valuable to someone it should be available. But I think we can do that with clear versioning and dependencies. Also, I think it will be rather seldom that a user of a Hebrew library would actually look at the corresponding English version (if the docs for the Hebrew version was good enough). Then it would be out of curiousity rather than necessity.

Again, I situation that has not occured yet, so let's just navigate as we progress with the translations.

But the, the Spanish library is a different case from the Italian and Swedish, since that's a pre-existing library.

In any case, feel free to add the Swedish library to main if you prefer to — I was just assuming you moved its development to dev_alan-sv, since I didn't see any commits on main for the Swedish files.

That was not my the point of my question. I agree with you on the strategy here, so I'll keep the Swedish work on the branch. I just think we should avoid unnecessary work, especially if it might be a risk. I don't think anyone will be confused about an existing alan_sv directory. Except us. And maybe that is a reason in itself to remove that directory on main.

As for the Italian library, I don't think it's ready yet to go into main, because it needs a lot of polishing, there are still many untranslated messages, and I don't want to start writing its README and INDEX files right now, since it's all very unstable at the moment and I haven't really made up my mind yet regarding some naming choices (keep in mind that I'm also copying and pasting a lot of code from the StdLib Italian, which it's worth recycling but needs to be revised, since it comes from a different library altogether).

So, for the Italian translation will the English library be a reference library in functionality only, and the implementation completely different? I'm thinking about how I should proceed with the Swedish translation. My first (partial) attempt is to just "translate" the corresponding files and features from the English. (This should probably be a completely different disucssion, though. If you do answer, please move this part to a new discusison. MIght would be valuable in itself.)

tajmone Sep 12, 2021
Maintainer Author

If you do answer, please move this part to a new discusison. MIght would be valuable in itself.

Discussion moved to #31 (with a list resuming points discussed).

thoni56 Sep 12, 2021
Maintainer

Actually, I meant "an answer to proceeding with a mix of old and new and borrowed code", but that write up was perfect and just what we needed ;-)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switching to UTF-8 #29

{{title}}

Replies: 2 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Switching to UTF-8 #29

tajmone Sep 10, 2021 Maintainer

Replies: 2 comments · 6 replies

thoni56 Sep 10, 2021 Maintainer

tajmone Sep 10, 2021 Maintainer Author

tajmone Sep 10, 2021 Maintainer Author

UTF-8 Conversion: Done!

thoni56 Sep 12, 2021 Maintainer

tajmone Sep 12, 2021 Maintainer Author

thoni56 Sep 12, 2021 Maintainer

tajmone Sep 12, 2021 Maintainer Author

thoni56 Sep 12, 2021 Maintainer

tajmone
Sep 10, 2021
Maintainer

Replies: 2 comments 6 replies

thoni56
Sep 10, 2021
Maintainer

tajmone Sep 10, 2021
Maintainer Author

tajmone
Sep 10, 2021
Maintainer Author

thoni56 Sep 12, 2021
Maintainer

tajmone Sep 12, 2021
Maintainer Author

thoni56 Sep 12, 2021
Maintainer

tajmone Sep 12, 2021
Maintainer Author

thoni56 Sep 12, 2021
Maintainer