-
Notifications
You must be signed in to change notification settings - Fork 500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[type-layout] Document bit validity of structs and padding #1630
base: master
Are you sure you want to change the base?
Conversation
This intentionally omits discussions of enums and unions. Enum bit validity is well-defined, but it is difficult to articulate concisely. Union bit validity is still not nailed down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments. The rules arround structs specifically are good though.
Note that the total set of possible bit patterns for any given byte has 257 | ||
elements - 256 initialized bit patterns and one "uninitialized" bit pattern. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not include provenance-carrying bytes, which are complicated.
I'd either briefly mention them, or omit this paragraph entirely.
Each Rust type has "bit validity", which is the set of bit patterns which may | ||
appear in a value of that type. It is [undefined behavior] to produce a value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be better in its own section about bit validity in general, rather than introducing the concept in its entirety in a section specifically about composite types.
Padding bytes have no bit validity requirement: it is always well-defined | ||
to write any byte value - including an uninitialized byte - to any byte of | ||
padding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be useful to mention that padding bytes will get overwritten whenever a struct value is moved. I don't remember if we mentioned it elsewhere in the chapter though - if it's already there, I don't think it needs to be duplicated here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there are no existing docs for typed copies (cc @RalfJung). Perhaps we could add a section documenting them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we ultimately shouldn't document bit validity at all. Bit validity is a derived concept. What actually matters is
- decoding a low-level representation into a high-level value
- and inverting that: encoding the value back into a low-level representation
"bit validity" is just the precondition for decoding to be well-defined. If we define decoding, that implicitly defines bit validity as well. But the decode/encode pair also explains other things like padding (and sometimes provenance) being lost on a typed copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO bit validity (at least as I've used it) is a tad stronger than that: it also permits unsafe code to assume that it won't see certain values produced by safe Rust. That implies that it also constrains what low-level representations can be produced by the "encoding" operation you described.
I don't see any reason why explicitly documenting encoding and decoding separately is a problem. I'd just want to make sure we can still derive the stronger notion of bit validity that I described from whatever guarantees are made regarding encoding and decoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not permit that, no, as we can relax validity invariants.
Does that hold for bit validities which are explicitly documented? E.g., would we consider unsafe code which assumes that a bool
can only be 0 or 1 [1] to be unsound?
[1] https://doc.rust-lang.org/reference/types/boolean.html:
The value false has the bit pattern 0x00 and the value true has the bit pattern 0x01. It is undefined behavior for an object with the boolean type to have any other bit pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO bit validity (at least as I've used it) is a tad stronger than that: it also permits unsafe code to assume that it won't see certain values produced by safe Rust. That implies that it also constrains what low-level representations can be produced by the "encoding" operation you described.
That's not a good idea. Whenever unsafe code permits inputs to not satisfy the safety invariant, that code should document what it actually requires. We can and will change validity rules to allow more values, which removes UB and is hence not considered a breaking change. If someone defines an unsafe function with precondition "anything goes, even if it breaks the safety invariant, as long as it satisfies the validity invariant", that unsafe code can become unsound through such a spec change.
I don't see any reason why explicitly documenting encoding and decoding separately is a problem.
It's fully redundant, and therefore makes it harder to form a mental model of this part of unsafe Rust. It makes it seem like one has to learn the encode/code part and the bit validity rules, when that is not actually necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E.g., would we consider unsafe code which assumes that a bool can only be 0 or 1 [1] to be unsound?
That's a question about safety invariants, which are a type system concern, not about validity invariant (which are an opsem concern). It is easier to consider this question with &str
since there, validity != safety: it is documented that this is valid UTF-8, so unsafe code may rely on that. But this has nothing to do with validity, and it is in fact not immediate UB to violate the UTF-8 property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any reason why explicitly documenting encoding and decoding separately is a problem.
It's fully redundant, and therefore makes it harder to form a mental model of this part of unsafe Rust. It makes it seem like one has to learn the encode/code part and the bit validity rules, when that is not actually necessary.
Sorry, to clarify, I was agreeing with you here 🙂 Namely, that (for our purposes) just defining encoding/decoding and not mentioning bit validity is fine.
IMO bit validity (at least as I've used it) is a tad stronger than that: it also permits unsafe code to assume that it won't see certain values produced by safe Rust. That implies that it also constrains what low-level representations can be produced by the "encoding" operation you described.
That's not a good idea. Whenever unsafe code permits inputs to not satisfy the safety invariant, that code should document what it actually requires. We can and will change validity rules to allow more values, which removes UB and is hence not considered a breaking change. If someone defines an unsafe function with precondition "anything goes, even if it breaks the safety invariant, as long as it satisfies the validity invariant", that unsafe code can become unsound through such a spec change.
E.g., would we consider unsafe code which assumes that a bool can only be 0 or 1 [1] to be unsound?
That's a question about safety invariants, which are a type system concern, not about validity invariant (which are an opsem concern). It is easier to consider this question with
&str
since there, validity != safety: it is documented that this is valid UTF-8, so unsafe code may rely on that. But this has nothing to do with validity, and it is in fact not immediate UB to violate the UTF-8 property.
Okay so IIUC, code which assumes (for soundness) that bools are 0x00
or 0x01
is sound, but only in virtue of the fact that bools have a safety invariant of only being 0 or 1? I think that's fine for our purposes.
I do have a more general concern, though.
It seems to me that, as it currently stands, we don't clearly specify in documentation which things are intended to document validity and which are intended to document safety, and furthermore, we don't clearly specify which are merely preconditions (which we could relax in the future) and which are postconditions (which would be a breaking change to relax).
As an example, consider the Reference page on bools:
The value false has the bit pattern
0x00
and the value true has the bit pattern0x01
. It is undefined behavior for an object with the boolean type to have any other bit pattern.
Clearly this is documenting a precondition for soundness, but is it also documenting a safety invariant? Saying "the value false has the bit pattern 0x00
and the value true has the bit pattern 0x01
" seems to me to be a promise, but maybe it wasn't intended that way.
The stdlib docs are even more nebulous:
If you cast a bool into an integer,
true
will be 1 andfalse
will be 0.
Is there some nomenclature convention I'm missing here, or is it just that we haven't gotten around to making our language consistent? Not that it's specifically your responsibility to own this or anything, but I think it's worth stating for the record: the ambiguity of the language - and especially ambiguity which allows preconditions to be misinterpreted as postconditions or invariants - makes it hard for external authors to be confident that they're writing unsafe
code correctly. Writing unsafe
code is my full-time job and I still keep running into situations where I thought I understood what was being promised only to learn that I was misreading the text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so IIUC, code which assumes (for soundness) that bools are 0x00 or 0x01 is sound, but only in virtue of the fact that bools have a safety invariant of only being 0 or 1? I think that's fine for our purposes.
Yes.
It seems to me that, as it currently stands, we don't clearly specify in documentation which things are intended to document validity and which are intended to document safety,
Indeed. So far we haven't stably committed to this dichotomy of invariants, and more importantly, we haven't stably committed to how these two invariants are called. Personally I regret calling them "validity invariant" and "safety invariant"; I now think that "language invariant" and "library invariant" are better terms. They go along nicely with the terms "language UB" and "library UB" that we have also been using already (but also not documented stably yet -- mostly because nobody had the time to write that text, I think).
I opened rust-lang/unsafe-code-guidelines#539 to hopefully make progress on this question.
It does not permit that, no, as we can relax validity invariants.
…On Thu, Oct 10, 2024, 21:20 Joshua Liebow-Feeser ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In src/type-layout.md
<#1630 (comment)>:
> +Padding bytes have no bit validity requirement: it is always well-defined
+to write any byte value - including an uninitialized byte - to any byte of
+padding.
IMO bit validity (at least as I've used it) is a tad stronger than that:
it also permits unsafe code to assume that it won't see certain values
produced by safe Rust. That implies that it also constrains what low-level
representations can be produced by the "encoding" operation you described.
I don't see any reason why explicitly documenting encoding and decoding
separately is a problem. I'd just want to make sure we can still derive the
stronger notion of bit validity that I described from whatever guarantees
are made regarding encoding and decoding.
—
Reply to this email directly, view it on GitHub
<#1630 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABGLD22HV3WP7ABLDQLOSETZ24RVTAVCNFSM6AAAAABOZMPY6WVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDGNRRG4ZTIOJZGM>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
@rustbot author |
☔ The latest upstream changes (possibly bf115a4) made this pull request unmergeable. Please resolve the merge conflicts. |
This intentionally omits discussions of enums and unions. Enum bit validity is well-defined, but it is difficult to articulate concisely. Union bit validity is still not nailed down.