Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC for a formalized notion on where to enforce reference propertes in MIR #2631

Closed
wants to merge 8 commits into from

Conversation

HeroicKatora
Copy link

Formalizes a local analysis during HIR->MIR lowering to track references on which
the compiler should not enforce reference properties. This complements MIR
level changes through which taking the address of a subobject pointed to by a
pointer does not accidentally require or guarantee reference properties on
those pointers.

Rendered

Cc @RalfJung @rust-lang/wg-unsafe-code-guidelines

This does not change any surface syntax and may thus not require an RFC. This is a nicer writeup of initial thoughts in #2582 and complements it, should it be merged (likely).

@jonas-schievink
Copy link
Contributor

cc @rust-lang/wg-unsafe-code-guidelines

@Ixrec
Copy link
Contributor

Ixrec commented Jan 27, 2019

As an observer with no experience in the compiler codebase or background in type theory, everything in this RFC seems like either a restatement of something in #2582, or an implementation detail that we wouldn't normally want in an RFC. I guess that means I'm not entirely sure what this RFC is?

@jonas-schievink jonas-schievink added T-lang Relevant to the language team, which will review and decide on the RFC. T-compiler Relevant to the compiler team, which will review and decide on the RFC. A-unsafe Unsafe related proposals & ideas labels Jan 27, 2019
@HeroicKatora
Copy link
Author

Ralf Jung suggested writing it up in this format, and I have no experience how this could otherwise be proposed to the language team. Also, this formalization including additional lint opportunities could be used as justification for surface level syntax changes that provide the guarantees outlined herein in a stable manner and remove the requirement for unsafe code blocks. That would definitely need a formal RFC, and this makes the text and definitions for such available.

@Centril Centril added A-references Proposals related to references A-machine Proposals relating to Rust's abstract machine. A-raw-pointers Proposals relating to raw pointers. labels Jan 27, 2019
@RalfJung
Copy link
Member

RalfJung commented Jan 29, 2019

Thanks for writing this up! I am a bit surprised because I thought after your last comment that we were in agreement, but okay. it's always great to talk about something concrete. :)

However, I'm afraid I don't agree with the approach taken here. The answer to "where to enforce reference properties" is fairly simple, and it has nothing to do with references. There is the more general principle of a validity invariant, and it is quite simple: on every assignment statement in MIR, the validity invariant (defined by the type of the assignment) must hold. Period. The entire point of #2582 is to enable us to follow that approach for references the same way we do for any other type.

I think all of this machinery defined here with settling and raw is unnecessary.

It seems you do not agree with that approach, and I am still trying to understand why. Let me analyze your two examples and share my thoughts on them.

Example 2

You have the following example in the surface language (I replaced an A by Foo and *mut usize by *const usize to make this compiler):

#[repr(packed)]
struct Foo { _layout: u8, a: u16, b: u16, };
fn not_very_defined(input: *mut Foo, which: bool) -> *const usize {
    // reference properties unused in all other usage.
    let whole = unsafe { &*input };
    if which {
        // Cast to pointer, raw unused.
        &whole.a
    } else {
        // Cast to pointer, raw unused.
        &whole.b
    }
}

Under #2582, this code has defined behavior if we assume that input is a proper reference to a Foo:

// This program has defined behavior
let f = Foo { _layout: 0, a: 1, b: 2};
not_very_defined(&f, true);
not_very_defined(&f, false);

The reason this is the case is hat the references created to the fields are immediately coerced to a raw pointer, and hence &whole.{a,b} gets translated (in MIR) to &raw whole.{a,b}:

_tmp = &raw whole.a;
_0 = _tmp; // this is determining the return type

The assignment of the result of taking this reference (first line in the MIR snippet above) happens at type *const usize. Hence, per the general principle above, we require the validity invariant of *const usize to hold -- which makes no alignment requirements.

So, for this example this RFC changes nothing.

EDIT: Actually I realized there is a chance type inference might do the cast after the conditional. Not sure what it will do. However, if it does, then the lint from #2582 will kick in. This is certainly a good testcase, but I think we got it covered either way in the previous RFC.

Example 1

The other example is:

let x = unsafe {
    // Desugared version. `raw` is reference with `raw` status.
    let raw = &(*packed).field;
    // Cast with active raw status, raw pointer cast.
    raw as *const _
};

It is correct that this example has UB under #2582. It translates to something like the following MIR:

raw = &(*packed).field;
_tmp = raw as *const _;
_0 = _tmp;

The first line is an assignment at type &_, hence we require the validity invariant for &_ to hold, which (among other things) requires the reference to be aligned. Notice how "being aligned" just falls out of the general principle that at every assignment, the validity invariant must hold!

This was explicitly discussed in the previous RFC, and we reached consesus that making the semantics more complex to make the above code not UB is counterproductive: it is better to keep the semantics simple, easier to understand and then have a lint to tell people to add a cast if it looks like they want one.

Notice that per the other RFC, your example would get linted: all uses of raw only need a raw pointer, and hence the lint would suggest to cast raw to a raw pointer immediately to avoid unnecessarily creating an intermediate reference.

Lints get mentioned in this RFC as well, but TBH I am a bit confused about their role. The purpose of a lint is explicitly NOT to change what the MIR code does or how MIR gets constructed -- all behavior and all UB remains the same. The purpose of the lint is just to help the programmer write code that does not have UB under the semantics I described in the second paragraph of this post.


So, from what I can see, the only point where this RFC differs from #2582 is a point that was already discussed, and where the consensus was to keep the semantics simple and have a lint instead. Did I miss anything?


Since this is the only use of `x`, and `x` is never settle but originated in an
unsafe block, this is a strong code smell. Specifically, it would be strictly
safer for this reference to never leave the unsafe code block. By tracking the
Copy link

@ExpHP ExpHP Jan 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be suggesting that the compiler ascribes significance to the boundaries of an unsafe block, which is untrue. The unit of unsafety with regards to optimization is traditionally the entire module that contains an unsafe block.

IIUC, the other RFC does treat this as UB, but only because of the intermediate let. unsafe { &packed.field } as *const _ should be allowed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that it explicitely states this. But unsafe { } is a block expression with it's own type-inference and all, so I don't see the significant difference to other block statements such as if-else in Example 2.

#[repr(packed)]
struct Foo { _layout: u8, a: u16, b: u16, };
fn not_very_defined(input: *mut Foo, which: bool) -> *const usize {
    // reference properties unused in all other usage.
    let whole = unsafe { &*input };
    if which {
        // Cast to pointer, raw unused.
        &whole.a
    } else {
        // Cast to pointer, raw unused.
        &whole.b
    }
}

If the definedness does not extend to other block statements, I find this a very dangerous special case in #2582 . And as noted by @RalfJung , this seems to not be the intention, here in an Edit under Example 2.

Copy link

@strega-nil strega-nil Jan 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because it's at a block boundary, it becomes a true reference value - it's somewhat equivalent to a function return (and therefore is UB)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be my interpretation of the other rfc as well. However, in this rfc I intended to keep the raw status around for all basic blocks of the current function. Thus, function return makes it a true reference value but the block boundary would not yet invoke undefined behaviour.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair, and I don't think that's ureasonable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So whether or not such a program has UB depends on a type that was written down far away (if we imagine this function was a bit longer). I would never accept such a program during code review, and always ask people to add the explicit cast at the place where it is needed. Otherwise, (a) it is way too hard to understand why this program is correct, and (b) someone changing the return type might not even be aware that they just introduced UB.

@HeroicKatora
Copy link
Author

@RalfJung Thanks for the thorough feedback.

I must admit that I had read the consensus comment and others of similar sentiment as: We'll do it simple because the other alternatives are too vague. The point of this RFC was to turn this around, to make this alternative the least vague. Very possible that I have missed the point here.


For example 2, it was intentional to consider the case where input does not yet point to initialized Foo, for example because this is part of a construction? And your edit also captures another important point, I wasn't sure how this was intended under #2582 . So, for handling the uninitialized cases as well, one has to write the kind of awkward:

fn not_very_defined(whole: *mut Foo, which: bool) -> *const usize {
    unsafe {
        if which {
            &(*whole).a as *const _
        } else {
            &(*whole).b as *const _
        }
    }
}

Just because the path inside the object is not static, the code duplicates all pointer dereferencing, all explicit casts, and all convenience such as coercion is gone. I'd say that is a considerably increase in the mental burden of reading unsafe code. So, the only safe pattern requires the most typing and is harder to parse, yet its functionality should in my eyes be one of the safest uses of unsafe?

That sounds quite strange, and I'm just a bit worried that this fact encouraged library writers to wrongly use the eaiser but wrong pattern. To address the comment of a lint making this obvious, since the other rfc proposes no explicit mechanisms for lints, without the lint functionality we can not judge the impact on the ecosystem yet. Thus, I'd rather see the compiler err on the safe side of not triggering UB until such an analysis can be done.


There is the more general principle of a validity invariant, and it is quite simple: on every assignment statement in MIR, the validity invariant (defined by the type of the assignment) must hold. […] Lints get mentioned in this RFC as well, but TBH I am a bit confused about their role. The purpose of a lint is explicitly NOT to change what the MIR code does or how MIR gets constructed -- all behavior and all UB remains the same.

These are probably best addressed together? The main difference between HIR->MIR under this rfc and MIR now is that this proposes to sometimes give a surface level reference-type a pointer-type in MIR temporarily. The lint mechanism does not change any of this. The lint finds cases where code is written in a way such that having point representation could be unecessary at all. And as you noted yourself, the lints have a very similar outcome: to encourage writing a direct &place.field as *const_ until the time where better surface syntax is available.


So, from what I can see, the only point where this RFC differs from #2582 is a point that was already discussed, and where the consensus was to keep the semantics simple and have a lint instead. Did I miss anything?

If the semantic changes are too broad, alternatively one could view the raw status tagging in HIR as some sort of explicit proposal for a mechanism for the implementation of all the items proposed in the other rfc 🤔 ? In accordance with the unorthodox way this rfc started from a comment in the other thread, the outcome of only making lint behaviour and lowering mechanisms in the other rfc more concrete is desirable for me on its own.


Assume `packed: *const T`, `(*packed).field` is unaligned. All examples are
assumed to be inside a single unsafe block each for the safe of brevity.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example (with the three code blocks) is confusing and not consistent with the rest of the text and will be reworked. Sorry for not noticing this before the PR...


Since this is the only use of `x`, and `x` is never settle but originated in an
unsafe block, this is a strong code smell. Specifically, it would be strictly
safer for this reference to never leave the unsafe code block. By tracking the
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that it explicitely states this. But unsafe { } is a block expression with it's own type-inference and all, so I don't see the significant difference to other block statements such as if-else in Example 2.

#[repr(packed)]
struct Foo { _layout: u8, a: u16, b: u16, };
fn not_very_defined(input: *mut Foo, which: bool) -> *const usize {
    // reference properties unused in all other usage.
    let whole = unsafe { &*input };
    if which {
        // Cast to pointer, raw unused.
        &whole.a
    } else {
        // Cast to pointer, raw unused.
        &whole.b
    }
}

If the definedness does not extend to other block statements, I find this a very dangerous special case in #2582 . And as noted by @RalfJung , this seems to not be the intention, here in an Edit under Example 2.

@RalfJung
Copy link
Member

RalfJung commented Jan 31, 2019

For example 2, it was intentional to consider the case where input does not yet point to initialized Foo, for example because this is part of a construction?

Whether or not being initialized matters for references is open, and my take is that it does not -- see rust-lang/unsafe-code-guidelines#77.

Regarding the blocks, I don't know if type inference will propagate the raw pointer type inwards and trigger coercions inside the block, or whether it will propagate a reference type outwards and trigger coercions outside the block. But anyway programs shouldn't depend on such subtle syntactic details, they should explicitly use raw pointers and not rely on coercions in such cases.

Just because the path inside the object is not static, the code duplicates all pointer dereferencing, all explicit casts, and all convenience such as coercion is gone. I'd say that is a considerably increase in the mental burden of reading unsafe code.

I can see where you are coming from. However, I think the mental burden when reading code is even higher with your proposal: now I need to make a dataflow analysis to make sure that this is actually a raw pointer disguised at reference type, not a real reference.

The point here is to make sure that no references get accidentally created. When reviewing code, I want to be able to see this locally. With your proposal, I have to check every single use of this variable and make sure they are all raw pointers -- if a function call is involved, I may even have to go look up that function's signature. This is way too subtle.

Also notice that your example is somewhat unrealistic, I think -- it is really dangerous to have a function return unaligned raw pointers. If I use your function as follows

let f = Foo { _layout: 0, a: 1, b: 2};
let ptr = not_very_defined(&f, true);
*ptr = 13;

then I have UB under all proposals, because the last write is an unaligned one.

But leaving that aside, I think that it is important to see immediately that no references get created. The second version of this example that you showed is actually much clearer than the one you claim as superior in the RFC:

#[repr(packed)]
struct Foo { _layout: u8, a: u16, b: u16, };
fn not_very_defined(whole: *mut Foo, which: bool) -> *const usize {
    unsafe {
        if which {
            &(*whole).a as *const _
        } else {
            &(*whole).b as *const _
        }
    }
}

Here, I can see immediately that there will be no references. With your proposed code, doing the same is much more complicated.

Now, I think what you are complaining about is that the above is not ergonomic enough. So the first thing you do is add that input variable to get auto-deref, saving 3 characters by adding more than 20. Not sure what you are gaining by this - basically, you introduced auto-deref for raw pointers! But there is a reason that we don't do auto-deref for raw pointers (we want you to think about whether this pointer is actually safe to dereference here), and you are entirely subverting that.

Maybe you are complaining that (*foo).field is clumsy syntax. I agree! But the solution to clumsy syntax is to improve the syntax, not to complicate the semantics. I think we should have a postfix * operator, maybe something like whole.*.a or whole*.a or so?

Next, I think you are saying that the syntax for creating a raw reference is somewhat clumsy, and you don't like having to write it twice here: &foo as *const _. Again I agree the syntax is not great, but again the solution to clumsy syntax is not to make a mess out of our semantics. If you have a proposal for a better syntax, I'd be happy to discuss that -- which is also something that was already said during the other RFC. Note that we can add a better syntax in the future, my RFC is compatible with that.

That sounds quite strange, and I'm just a bit worried that this fact encouraged library writers to wrongly use the eaiser but wrong pattern.

That's why we have a lint, which would fire in both of your examples. The analysis for this is easy, so I am not sure what your problem is with the lint. To turn your argument around: we shouldn't add complicated rules for UB into our language until we are really sure that there is no other way (better syntax, better linting). It is better to have simple rules, even if these rules make slightly more programs UB.

These are probably best addressed together? The main difference between HIR->MIR under this rfc and MIR now is that this proposes to sometimes give a surface level reference-type a pointer-type in MIR temporarily. The lint mechanism does not change any of this. The lint finds cases where code is written in a way such that having point representation could be unecessary at all. And as you noted yourself, the lints have a very similar outcome: to encourage writing a direct &place.field as *const_ until the time where better surface syntax is available.

I don't understand what you are trying to say here. The lint on my RFC and the semantics in yours have a similar outcome, except that your RFC complicates the language definition, meaning every single MIR optimization as well as the translation to LLVM needs to worry about this -- as well as everyone else who ever wants to think precisely about the behavior of Rust programs -- while my RFC confines such concerns to the implementation details of a lint, and nobody else has to think about it. That is, of course, a huge difference. I am proposing a simple semantics together with a lint to nudge people towards writing correct code in that semantics so the programmers get the benefit of avoiding mistakes and explicitly documenting what happens, which helps for reviews; and compiler authors and tool developers get the benefit of a simple semantics. You are, I think, proposing a language semantics change which means programs that get the lint under my proposal are UB-free under yours, at the same adding a huge burden to reviewers and anyone else reading code, and complicating MIR passes, LLVM translation and program analysis.

To summarize my position: When it comes to unsafe code, keeping the semantics as simple as possible is even more important than in safe code. I am already concerned that also creating raw references in case of implicit coercions is "too magic": when we have bar(&foo), I'd always demand during review that we write bar(&foo as *const _) instead to not make behavior depend on the signature of bar. The case of let x: *const T = &foo is a good one though so I agreed we want to include coercions into our definition. I am convinced that moving even more towards magic is a big mistake, and doing this with a complicated MIR semantics (instead of a complicated MIR translation) is an even bigger mistake. We should not shift the burden of bad syntax to the semantics, where every single compiler pass will have to remember an unnecessary complex set of rules.

@RalfJung
Copy link
Member

doing this with a complicated MIR semantics (instead of a complicated MIR translation) is an even bigger mistake

I should probably expand a bit on this: there is an alternative that has almost the same effect as this RFC but without all the downsides, which is to do the dataflow analysis statically that would otherwise be used for the lint, and use that to inform HIR -> MIR lowering about whether to insert &raw operations or not. All of your examples would be legal under that proposal, and yet MIR semantics are exactly the same as in my proposal. This is something we discussed during my RFC, and we agreed it was "too much magic".

I must admit that I had read the consensus comment and others of similar sentiment as: We'll do it simple because the other alternatives are too vague.

That does not reflect my understanding of the discussion: we talked about doing what I just described in the previous paragraph, and abandoned it not because it was too vague but because it was too complicated to be put into the definition of what a program does.

Admittedly we did not talk about something like your proposal, but at least as far as I am concerned it doesn't change anything about being "too much magic", while having even bigger disadvantages in terms of being able to understand and analyze MIR.

@HeroicKatora
Copy link
Author

@RalfJung Thank you again for the extended analysis. You made some good points about readability and clear semantics. I now understand that my idea was comparatively rash and rather narrow with too few considerations of the MIR ecosystem and wrongly implied correctness.

I have to check every single use of this variable and make sure they are all raw pointers -- if a function call is involved, I may even have to go look up that function's signature.

It was specifically intended not to make any more stable guarantees about the safeness of such code than currently, but the appearance of that guarantee (since not exploited by rustc and therefore never caught by e.g. miri) can easily be mistaken for it. That probably leaves two points standing:

  1. Using the reference tracking as proposed for lints only. Since it's a monotonic lattice function it could potentially be quite efficient. Any points on how one could contribute something here?
  2. The critique of current abundance of unsafe for (fairly) common tasks
  • If you have a proposal for a better syntax, I'd be happy to discuss that -- which is also something that was already said during the other RFC. Note that we can add a better syntax in the future, my RFC is compatible with that.

  • I would hope so as well, but yes that is much less clear. I'd also be happy to discuss, internals forum or discord.

I think now we finally are more or less in agreement. Is there anything to do to close the pr?

@RalfJung
Copy link
Member

  1. Using the reference tracking as proposed for lints only. Since it's a monotonic lattice function it could potentially be quite efficient. Any points on how one could contribute something here?

I suppose the best way is to get in contact with the compiler team, I am not sure how they'd usually organize implementing the lint.

I think now we finally are more or less in agreement. Is there anything to do to close the pr?

Feel free to close it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-machine Proposals relating to Rust's abstract machine. A-raw-pointers Proposals relating to raw pointers. A-references Proposals related to references A-unsafe Unsafe related proposals & ideas T-compiler Relevant to the compiler team, which will review and decide on the RFC. T-lang Relevant to the language team, which will review and decide on the RFC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants