Skip to content

Commit

Permalink
Move abnormal code to be after running manually
Browse files Browse the repository at this point in the history
  • Loading branch information
hackerb9 committed Jul 29, 2024
1 parent 324cf0b commit 7a2a2b2
Showing 1 changed file with 72 additions and 69 deletions.
141 changes: 72 additions & 69 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -265,8 +265,7 @@ m100-decomment --> m100-tokenize
| m100-sanity<br/>m100-jumps<br/>m100-decomment<br/>m100-crunch<br/>m100-tokenize | Saves even more RAM, removing whitespace | tokenize -c |
| m100-tokenize | Abnormal code is kept as is | |

<details><summary>Click to see more details about running these
programs manually</summary><p>
<details><summary>Click to see more details about running these programs manually</summary><p><ul>

### m100-tokenize synopsis

Expand Down Expand Up @@ -372,74 +371,9 @@ If you find this to be a problem, please file an issue as it is
potentially correctable using `open_memstream()`, but hackerb9 does
not see the need.

</details> <!-- Running manually -->
</ul></details> <!-- Running manually -->


## Machine compatibility

Across the eight Kyotronic-85 sisters, there are actually only two
different tokenized formats: "M100 BASIC" and "N82 BASIC". This
program (currently) works only for the former, not the latter.

The three Radio-Shack portables (Models 100, 102 and 200), the Kyocera
Kyotronic-85, and the Olivetti M10 all share the same tokenized BASIC.
That means a single tokenized BASIC file _might_ work for any of
those, presuming the program does not use CALL, PEEK, or POKE.
However, the NEC family of portables -- the PC-8201, PC-8201A, and
PC-8300 -- run N82 BASIC, which has a different tokenization format. A
tokenized N82 BASIC file cannot run on an M100 computer and vice
versa, even for programs which share the same ASCII BASIC source code.

### Checksum differences are not a compatibility problem

The .BA files generated by `tokenize` aim to be exactly the same, byte
for byte, as the output from tokenizing on a Model 100 using `LOAD`
and `SAVE`. There are some bytes, however, which can change and should
be ignored when testing if two tokenized programs are identical.

<details><summary>Click to read details on line number pointers...</summary><ul>

A peculiar artifact of the [`.BA` file format][fileformat] is that it
contains pointer locations offset by where the program happened to be
in memory when it was saved. The pointers in the file are _never_ used
as they are recalculated when the program is loaded into RAM.

To account for this variance when testing, the output of this program
is intended to be byte-for-byte identical to:

1. A Model 100
2. that has been freshly reset
3. with no other BASIC programs on it
4. running `LOAD "COM:88N1"` and `SAVE "FOO"` while a host computer sends the ASCII BASIC program over the serial port.

While the Tandy 102, Kyotronic-85, and M10 also appear to output files
identical to the Model 100, the Tandy 200 does not. The 200 has more
ROM than the other Model T computers, so it stores the first BASIC
program at a slightly different RAM location (0xA000 instead of
0x8000). This has no effect on compatibility between machines, but it
does change the pointer offset.

Since two `.BA` files can be the identical program despite having
different checksums, this project includes the `bacmp` program,
described below.

[fileformat]: http://fileformats.archiveteam.org/wiki/Tandy_200_BASIC_tokenized_file "Reverse engineered file format documentation"

</ul></details> <!-- Line number pointers -->

## Why Lex?

This program is written in
[Flex](https://web.stanford.edu/class/archive/cs/cs143/cs143.1128/handouts/050%20Flex%20In%20A%20Nutshell.pdf),
a lexical analyzer, because it made implementation trivial. The
tokenizer itself, m100-tokenize, is mostly just a table of keywords
and the corresponding byte they should emit. Flex handles special
cases, like quoted strings and REMarks, easily.

The downside is that one must have flex installed to _modify_ the
tokenizer. Flex is _not_ necessary to compile on a machine as flex
generates portable C code. See the tokenize-cfiles.tar.gz in the
github release or run `make cfiles`.
<details><summary>Click for details on creating abnormal .BA files.</summary><ul>

## Abnormal code

Expand Down Expand Up @@ -507,6 +441,75 @@ To run this on a Model 100, download
[GOTO10.BA](https://github.com/hackerb9/tokenize/raw/main/degenerate/GOTO10.BA)
which was created using m100-tokenizer.

</ul></details>


## Machine compatibility

Across the eight Kyotronic-85 sisters, there are actually only two
different tokenized formats: "M100 BASIC" and "N82 BASIC". This
program (currently) works only for the former, not the latter.

The three Radio-Shack portables (Models 100, 102 and 200), the Kyocera
Kyotronic-85, and the Olivetti M10 all share the same tokenized BASIC.
That means a single tokenized BASIC file _might_ work for any of
those, presuming the program does not use CALL, PEEK, or POKE.
However, the NEC family of portables -- the PC-8201, PC-8201A, and
PC-8300 -- run N82 BASIC. A tokenized N82 BASIC file cannot run on an
M100 computer and vice versa, even for programs which share the same
ASCII BASIC source code.

### Checksum differences are not a compatibility problem

The .BA files generated by `tokenize` aim to be exactly the same, byte
for byte, as the output from tokenizing on a Model 100 using `LOAD`
and `SAVE`. There are some bytes, however, which can change and should
be ignored when testing if two tokenized programs are identical.

<details><summary>Click to read details on line number pointers...</summary><ul>

A peculiar artifact of the [`.BA` file format][fileformat] is that it
contains pointer locations offset by where the program happened to be
in memory when it was saved. The pointers in the file are _never_ used
as they are recalculated when the program is loaded into RAM.

To account for this variance when testing, the output of this program
is intended to be byte-for-byte identical to:

1. A Model 100
2. that has been freshly reset
3. with no other BASIC programs on it
4. running `LOAD "COM:88N1"` and `SAVE "FOO"` while a host computer sends the ASCII BASIC program over the serial port.

While the Tandy 102, Kyotronic-85, and M10 also appear to output files
identical to the Model 100, the Tandy 200 does not. The 200 has more
ROM than the other Model T computers, so it stores the first BASIC
program at a slightly different RAM location (0xA000 instead of
0x8000). This has no effect on compatibility between machines, but it
does change the pointer offset.

Since two `.BA` files can be the identical program despite having
different checksums, this project includes the `bacmp` program,
described below.

[fileformat]: http://fileformats.archiveteam.org/wiki/Tandy_200_BASIC_tokenized_file "Reverse engineered file format documentation"

</ul></details> <!-- Line number pointers -->

## Why Lex?

This program is written in
[Flex](https://web.stanford.edu/class/archive/cs/cs143/cs143.1128/handouts/050%20Flex%20In%20A%20Nutshell.pdf),
a lexical analyzer, because it made implementation trivial. The
tokenizer itself, m100-tokenize, is mostly just a table of keywords
and the corresponding byte they should emit. Flex handles special
cases, like quoted strings and REMarks, easily.

The downside is that one must have flex installed to _modify_ the
tokenizer. Flex is _not_ necessary to compile on a machine as flex
generates portable C code. See the tokenize-cfiles.tar.gz in the
github release or run `make cfiles`.


## Miscellaneous notes

Expand Down

0 comments on commit 7a2a2b2

Please sign in to comment.