Skip to content

Appendix D: Leximited Format

Dave Ackley edited this page Apr 5, 2017 · 7 revisions

Introduction

Leximited notation -- short for lexicographically delimited -- is designed primarily to format typically small positive numbers so they have two key properties:

  1. A number represented in leximited format is 'self-delimiting'. No trailing space or other non-numeric character is required. Two or more leximited numbers can be placed in immediate sequence without confusion. And,

  2. If numbers represented in leximited are sorted, the resulting order will be the same whether they are sorted numerically or alphabetically ('lexicographically').

For all numbers from 0 to 99999999, the corresponding leximited form is produced by prefixing the decimal representation of the number by the length of that representation.

Simple examples

So for example, the decimal representation of the number zero is "0", and that representation is one character long, so the leximited representation of zero is "10".

Analogously, the leximited representation of 99999999 is "899999999".

Counting up from zero in leximited format looks like this: 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 210, 211, ...

OK, pop quiz:

(1) What is the leximited representation of the number two hundred and ten?

(2) Is 212 a single legal leximited representation? If so, what number does it represent? If not, why not?

No leading zeros

Note that (for all numbers other than 0) we remove any leading zeros before lex-encoding the number. So if we are asked to encode the decimal number 007, for example, we will remove the leading zeros and lex-encode 7, giving the result 17.

Larger numbers

Although leximited shines for encoding short numbers, for completeness it needs a way of representing larger numbers as well. The largest number we can lex-encode using the rules we have so far is one less than a hundred million: 99999999, which in leximited is 899999999.

Here is the rule: To lex-encode numbers longer than eight digits, we first print a '9', then we print the (recursive) leximited encoding of the length of the number, then we print the number.

So for example, to represent two billion (2000000000) in leximited, first we note that it is 10 digits in length. That's greater than eight so we print a '9', then we recursively print 10 in leximited format, and then we print the number itself.

And, what is the leximited encoding of 10? Well, the length of 10 is two, so the lex encoding is '210'.

So to summarize, we have

9 (flag for long number) + 210 (lex-encoding of number length) + 2000000000 (the number itself)

for a final result of 92102000000000 as the leximited representation of two billion.

OK, another pop quiz:

(1) Is 91812345678 a single legal leximited representation? If so, what number does it represent? If not, why not?

(2) Is 919012345678 a single legal leximited representation? If so, what number does it represent? If not, why not?

(3) (Bonus question) What is the smallest number whose leximited representation begins with '99'? (A precise description suffices; writing down the actual number not required.)

Leximited strings

When leximited is applied to arbitrary strings, as opposed to numbers, only property 1, above, applies: The result is self-delimiting because the leximited header encodes the length of the string that follows. But, the leximited representation of the string no longer sorts lexicographically -- instead, leximited strings sort first by length, and lexicographically only for strings of the same length.

The process of lex-encoding a string is the same as that for a number, except the string is never altered -- leading zeros, if any, are not removed, and neither are leading or trailing whitespace, or any other strange byte values that might occur in the string. All that matters is the string length in bytes.

For example the leximited encoding of the string 'foo' is 3foo, and the leximited encoding of 'Bh3!!!' is 6Bh3!!!. Also, the leximited encoding of

a man, a plan, a guy: eleets

is

9228a man, a plan, a guy: eleets

and finally, the leximited encoding of '' -- the empty string, the unique string whose length is zero -- is:

0

.