-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC for a formalized notion on where to enforce reference propertes in MIR #2631
Conversation
cc @rust-lang/wg-unsafe-code-guidelines |
As an observer with no experience in the compiler codebase or background in type theory, everything in this RFC seems like either a restatement of something in #2582, or an implementation detail that we wouldn't normally want in an RFC. I guess that means I'm not entirely sure what this RFC is? |
Ralf Jung suggested writing it up in this format, and I have no experience how this could otherwise be proposed to the language team. Also, this formalization including additional lint opportunities could be used as justification for surface level syntax changes that provide the guarantees outlined herein in a stable manner and remove the requirement for |
Thanks for writing this up! I am a bit surprised because I thought after your last comment that we were in agreement, but okay. it's always great to talk about something concrete. :) However, I'm afraid I don't agree with the approach taken here. The answer to "where to enforce reference properties" is fairly simple, and it has nothing to do with references. There is the more general principle of a validity invariant, and it is quite simple: on every assignment statement in MIR, the validity invariant (defined by the type of the assignment) must hold. Period. The entire point of #2582 is to enable us to follow that approach for references the same way we do for any other type. I think all of this machinery defined here with It seems you do not agree with that approach, and I am still trying to understand why. Let me analyze your two examples and share my thoughts on them. Example 2You have the following example in the surface language (I replaced an #[repr(packed)]
struct Foo { _layout: u8, a: u16, b: u16, };
fn not_very_defined(input: *mut Foo, which: bool) -> *const usize {
// reference properties unused in all other usage.
let whole = unsafe { &*input };
if which {
// Cast to pointer, raw unused.
&whole.a
} else {
// Cast to pointer, raw unused.
&whole.b
}
} Under #2582, this code has defined behavior if we assume that // This program has defined behavior
let f = Foo { _layout: 0, a: 1, b: 2};
not_very_defined(&f, true);
not_very_defined(&f, false); The reason this is the case is hat the references created to the fields are immediately coerced to a raw pointer, and hence
The assignment of the result of taking this reference (first line in the MIR snippet above) happens at type So, for this example this RFC changes nothing. EDIT: Actually I realized there is a chance type inference might do the cast after the conditional. Not sure what it will do. However, if it does, then the lint from #2582 will kick in. This is certainly a good testcase, but I think we got it covered either way in the previous RFC. Example 1The other example is: let x = unsafe {
// Desugared version. `raw` is reference with `raw` status.
let raw = &(*packed).field;
// Cast with active raw status, raw pointer cast.
raw as *const _
}; It is correct that this example has UB under #2582. It translates to something like the following MIR: raw = &(*packed).field;
_tmp = raw as *const _;
_0 = _tmp; The first line is an assignment at type This was explicitly discussed in the previous RFC, and we reached consesus that making the semantics more complex to make the above code not UB is counterproductive: it is better to keep the semantics simple, easier to understand and then have a lint to tell people to add a cast if it looks like they want one. Notice that per the other RFC, your example would get linted: all uses of Lints get mentioned in this RFC as well, but TBH I am a bit confused about their role. The purpose of a lint is explicitly NOT to change what the MIR code does or how MIR gets constructed -- all behavior and all UB remains the same. The purpose of the lint is just to help the programmer write code that does not have UB under the semantics I described in the second paragraph of this post. So, from what I can see, the only point where this RFC differs from #2582 is a point that was already discussed, and where the consensus was to keep the semantics simple and have a lint instead. Did I miss anything? |
|
||
Since this is the only use of `x`, and `x` is never settle but originated in an | ||
unsafe block, this is a strong code smell. Specifically, it would be strictly | ||
safer for this reference to never leave the unsafe code block. By tracking the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be suggesting that the compiler ascribes significance to the boundaries of an unsafe block, which is untrue. The unit of unsafety with regards to optimization is traditionally the entire module that contains an unsafe
block.
IIUC, the other RFC does treat this as UB, but only because of the intermediate let
. unsafe { &packed.field } as *const _
should be allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that it explicitely states this. But unsafe { }
is a block expression with it's own type-inference and all, so I don't see the significant difference to other block statements such as if-else
in Example 2.
#[repr(packed)]
struct Foo { _layout: u8, a: u16, b: u16, };
fn not_very_defined(input: *mut Foo, which: bool) -> *const usize {
// reference properties unused in all other usage.
let whole = unsafe { &*input };
if which {
// Cast to pointer, raw unused.
&whole.a
} else {
// Cast to pointer, raw unused.
&whole.b
}
}
If the definedness does not extend to other block statements, I find this a very dangerous special case in #2582 . And as noted by @RalfJung , this seems to not be the intention, here in an Edit under Example 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because it's at a block boundary, it becomes a true reference value - it's somewhat equivalent to a function return (and therefore is UB)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be my interpretation of the other rfc as well. However, in this rfc I intended to keep the raw
status around for all basic blocks of the current function. Thus, function return makes it a true reference value but the block boundary would not yet invoke undefined behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair, and I don't think that's ureasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So whether or not such a program has UB depends on a type that was written down far away (if we imagine this function was a bit longer). I would never accept such a program during code review, and always ask people to add the explicit cast at the place where it is needed. Otherwise, (a) it is way too hard to understand why this program is correct, and (b) someone changing the return type might not even be aware that they just introduced UB.
@RalfJung Thanks for the thorough feedback. I must admit that I had read the For example 2, it was intentional to consider the case where
Just because the path inside the object is not static, the code duplicates all pointer dereferencing, all explicit casts, and all convenience such as coercion is gone. I'd say that is a considerably increase in the mental burden of reading unsafe code. So, the only safe pattern requires the most typing and is harder to parse, yet its functionality should in my eyes be one of the safest uses of unsafe? That sounds quite strange, and I'm just a bit worried that this fact encouraged library writers to wrongly use the eaiser but wrong pattern. To address the comment of a lint making this obvious, since the other rfc proposes no explicit mechanisms for lints, without the lint functionality we can not judge the impact on the ecosystem yet. Thus, I'd rather see the compiler err on the safe side of not triggering UB until such an analysis can be done.
These are probably best addressed together? The main difference between HIR->MIR under this rfc and MIR now is that this proposes to sometimes give a surface level reference-type a pointer-type in MIR temporarily. The lint mechanism does not change any of this. The lint finds cases where code is written in a way such that having point representation could be unecessary at all. And as you noted yourself, the lints have a very similar outcome: to encourage writing a direct
If the semantic changes are too broad, alternatively one could view the |
|
||
Assume `packed: *const T`, `(*packed).field` is unaligned. All examples are | ||
assumed to be inside a single unsafe block each for the safe of brevity. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example (with the three code blocks) is confusing and not consistent with the rest of the text and will be reworked. Sorry for not noticing this before the PR...
|
||
Since this is the only use of `x`, and `x` is never settle but originated in an | ||
unsafe block, this is a strong code smell. Specifically, it would be strictly | ||
safer for this reference to never leave the unsafe code block. By tracking the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that it explicitely states this. But unsafe { }
is a block expression with it's own type-inference and all, so I don't see the significant difference to other block statements such as if-else
in Example 2.
#[repr(packed)]
struct Foo { _layout: u8, a: u16, b: u16, };
fn not_very_defined(input: *mut Foo, which: bool) -> *const usize {
// reference properties unused in all other usage.
let whole = unsafe { &*input };
if which {
// Cast to pointer, raw unused.
&whole.a
} else {
// Cast to pointer, raw unused.
&whole.b
}
}
If the definedness does not extend to other block statements, I find this a very dangerous special case in #2582 . And as noted by @RalfJung , this seems to not be the intention, here in an Edit under Example 2.
Whether or not being initialized matters for references is open, and my take is that it does not -- see rust-lang/unsafe-code-guidelines#77. Regarding the blocks, I don't know if type inference will propagate the raw pointer type inwards and trigger coercions inside the block, or whether it will propagate a reference type outwards and trigger coercions outside the block. But anyway programs shouldn't depend on such subtle syntactic details, they should explicitly use raw pointers and not rely on coercions in such cases.
I can see where you are coming from. However, I think the mental burden when reading code is even higher with your proposal: now I need to make a dataflow analysis to make sure that this is actually a raw pointer disguised at reference type, not a real reference. The point here is to make sure that no references get accidentally created. When reviewing code, I want to be able to see this locally. With your proposal, I have to check every single use of this variable and make sure they are all raw pointers -- if a function call is involved, I may even have to go look up that function's signature. This is way too subtle. Also notice that your example is somewhat unrealistic, I think -- it is really dangerous to have a function return unaligned raw pointers. If I use your function as follows let f = Foo { _layout: 0, a: 1, b: 2};
let ptr = not_very_defined(&f, true);
*ptr = 13; then I have UB under all proposals, because the last write is an unaligned one. But leaving that aside, I think that it is important to see immediately that no references get created. The second version of this example that you showed is actually much clearer than the one you claim as superior in the RFC: #[repr(packed)]
struct Foo { _layout: u8, a: u16, b: u16, };
fn not_very_defined(whole: *mut Foo, which: bool) -> *const usize {
unsafe {
if which {
&(*whole).a as *const _
} else {
&(*whole).b as *const _
}
}
} Here, I can see immediately that there will be no references. With your proposed code, doing the same is much more complicated. Now, I think what you are complaining about is that the above is not ergonomic enough. So the first thing you do is add that Maybe you are complaining that Next, I think you are saying that the syntax for creating a raw reference is somewhat clumsy, and you don't like having to write it twice here:
That's why we have a lint, which would fire in both of your examples. The analysis for this is easy, so I am not sure what your problem is with the lint. To turn your argument around: we shouldn't add complicated rules for UB into our language until we are really sure that there is no other way (better syntax, better linting). It is better to have simple rules, even if these rules make slightly more programs UB.
I don't understand what you are trying to say here. The lint on my RFC and the semantics in yours have a similar outcome, except that your RFC complicates the language definition, meaning every single MIR optimization as well as the translation to LLVM needs to worry about this -- as well as everyone else who ever wants to think precisely about the behavior of Rust programs -- while my RFC confines such concerns to the implementation details of a lint, and nobody else has to think about it. That is, of course, a huge difference. I am proposing a simple semantics together with a lint to nudge people towards writing correct code in that semantics so the programmers get the benefit of avoiding mistakes and explicitly documenting what happens, which helps for reviews; and compiler authors and tool developers get the benefit of a simple semantics. You are, I think, proposing a language semantics change which means programs that get the lint under my proposal are UB-free under yours, at the same adding a huge burden to reviewers and anyone else reading code, and complicating MIR passes, LLVM translation and program analysis. To summarize my position: When it comes to unsafe code, keeping the semantics as simple as possible is even more important than in safe code. I am already concerned that also creating raw references in case of implicit coercions is "too magic": when we have |
I should probably expand a bit on this: there is an alternative that has almost the same effect as this RFC but without all the downsides, which is to do the dataflow analysis statically that would otherwise be used for the lint, and use that to inform HIR -> MIR lowering about whether to insert
That does not reflect my understanding of the discussion: we talked about doing what I just described in the previous paragraph, and abandoned it not because it was too vague but because it was too complicated to be put into the definition of what a program does. Admittedly we did not talk about something like your proposal, but at least as far as I am concerned it doesn't change anything about being "too much magic", while having even bigger disadvantages in terms of being able to understand and analyze MIR. |
@RalfJung Thank you again for the extended analysis. You made some good points about readability and clear semantics. I now understand that my idea was comparatively rash and rather narrow with too few considerations of the MIR ecosystem and wrongly implied correctness.
It was specifically intended not to make any more stable guarantees about the safeness of such code than currently, but the appearance of that guarantee (since not exploited by rustc and therefore never caught by e.g.
I think now we finally are more or less in agreement. Is there anything to do to close the pr? |
I suppose the best way is to get in contact with the compiler team, I am not sure how they'd usually organize implementing the lint.
Feel free to close it! |
Formalizes a local analysis during HIR->MIR lowering to track references on which
the compiler should not enforce reference properties. This complements MIR
level changes through which taking the address of a subobject pointed to by a
pointer does not accidentally require or guarantee reference properties on
those pointers.
Rendered
Cc @RalfJung @rust-lang/wg-unsafe-code-guidelines
This does not change any surface syntax and may thus not require an RFC. This is a nicer writeup of initial thoughts in #2582 and complements it, should it be merged (likely).