Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating representation #33

Open
dvanhorn opened this issue Dec 13, 2024 · 1 comment
Open

Updating representation #33

dvanhorn opened this issue Dec 13, 2024 · 1 comment

Comments

@dvanhorn
Copy link
Member

I went down a rabbit hole of exploring an alternative encoding of values. You can see it in the new-tags-wip branch. It has some nice properties, but ultimately I don't think it's right.

It takes advantage of canonical addresses in x86_64 in which the upper 16-bits of an address are always all 0 or 1. This branch assumes they're all 0, which seems reasonable for now. The lsb is used to indicate whether a value is an immediate or a pointer (1 for pointer). The next 15 bits can encode a tag and the remaining bits hold the address. To recover the address, shift to the right 16 bits. To encode an address as a value, shift to the left 16 bits and xor the tag. That has some nice properties:

  • many more type tags available in the pointer; can encode things like mutability in the tag.
  • fewer bits taken away from the space of immediates, the big win being fixnums now have 62 usable bits instead of 60.

The drawbacks are:

  • fairly x86_64 specific.
  • creating a static value is no longer a single instruction because the tagging scheme can't be accomplished at assembly / link time (without some major contortions like linker scripts that aren't appropriate in this setting).

I think the right approach here is to stick with the scheme we've been using, but make some adjustments to handle things like mutability. I think it's also worth thinking about how these choices impact the GC, which needs to be actually updated and maintained for all the languages once it's introduced.

So assume we start with the existing coding scheme: immediates have 000 in the lsbs, leaving 7 tags for "core" pointer types. They are: boxes, conses, strings, vectors, symbols, procedures, and a catch all that uses tagged data instead of tagged pointers, i.e. the pointer points to memory that contains a type descriptor and then data; this is how structures are currently represented.

Fixnums have 0000 in their lsbs, so we can use this byte of memory to encode things like mutability in strings and vectors.

What about boxes? I'm not sure what the right move is here. Perhaps take them out of the core pointer types and have them point to a two (q)word object where the first word tags the data as being a mutable or immutable box. That doubles the storage requirement of a box, which is unfortunate, but how crucial are boxes? At some point, when adding floating point numbers it will probably be more important to have a core pointer for them. But then that complicates the pedagogical story when boxes are introduced because they would have a different strategy from pairs and knowing why relies on things that haven't been covered yet.

Of course, the other extreme here is to uniformly represent heap allocated values as (untagged) addresses and the first word contains the tag. This means 50% space overhead for pairs, which is brutal, but maybe that's the right choice for a compiler class. It also means type tag checking involves a memory dereference. Acknowledge that this isn't ideal and that you'd have to do something better in the future. This choice also make the garbage collector much easier to write since you can completely determine the type of objects from their memory in the heap. This also gets you back to 62-bit fixnums.

@dvanhorn
Copy link
Member Author

Another thing I realized, in the current tagging approach, we always erase the tag (with an xor) before the dereference, but there's really no need to: you can just adjust the offset by the tag, i.e. car changes from:

(seq (assert-cons rax)
     (Xor rax type-cons)
     (Mov rax (Offset rax 8)))

to

(seq (assert-cons rax)
     (Mov rax (Offset rax (- 8 type-cons))))

Also note that our current approach to type tag checking is to move the value to a temporary register, mask, and compare. If the tags were a byte instead of 3 bits, you could just use the byte register alias for a direct compare:

(seq (Cmp 'al type-cons)
     (Jne 'err)
     (Mov rax (Offset rax (- 8 type-cons))))

To get there, you'd need the heap to be 16-byte aligned. This requires wasting a (q)word of memory occasionally and complicates the allocation of vectors, strings, etc., but might be worth considering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant