-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Representation of and operations on pointers and usize #251
Comments
Yes, all our integer types are.
These are contingent on details of pointer provenance and how it interacts with integer-pointer cats... so the answer is "we don't know". FWIW, that is also the answer for LLVM IR, and likewise in C/C++. (See this for some of the recent work on the C/C++ side, but I hope we can find something cleaner for Rust.) The opposite direction,
No, certainly not. For example, if
Yes. |
Thanks for the links I'll look at them later.
IIRC, pointers actually have trap representations on certain architectures in C. Since raw pointers in Rust are somewhat relaxed compared to C pointers, can we guarantee that any bit pattern is a valid raw pointer in Rust? |
Oh sorry, I misread. So regarding your question then, I think the answer is yes -- everything that's a valid integer will also be a valid pointer. The other way around might not be true though, that is again about pointer provenance. But really the proper answer is that Rust doesn't have a spec yet that can answer such detailed questions, sorry. :/
No -- see #71 for the related discussions on validity of integers. ("Trap representations" are not a thing in Rust, but we have a related concept of "invalid values".) If uninitialized integers are invalid, then uninitialized raw pointers will also be invalid. |
Forgive my naivete, but what are the blockers behind both directions being trivial? If raw pointers are just treated as integers, then these casts become trivial to specify, and all the interesting work goes into |
Pointers in LLVM are not integers, so we cannot make them integers in Rust either. Potentially we could compile raw pointers differently, not using LLVM pointers, but I am not sure how well that works (and LLVM semantics in this area are so unclear, not to say buggy, that there is no telling if that would actually help). |
Oh also, Rust functions like I like the idea of only references having provenance, but I think it is unfortunately unrealistic. It would be good to have more data on this though, like numbers for what the perf impact would be if we used LLVM integers to compile Rust raw pointers. Also see these provenance-related discussions in the UCG. |
I will grant that pointers are complicated, but it seems that pointer provenance already gets wrapped up in discussions about "what's in a byte" even when only talking about integer types. So it seems like we can just make raw pointers exactly as "provenant" as |
I am working under the assumption that The big advantage of pointers is that their only "arithmetic" is "add an integer", so it is easy to say what happens to provenance: the result has the same provenance as the left input (the only pointer input). But provenance needs to end when pointers are cast to integers, and something needs to be done about pointers being transmuted to integers. The easiest option would be to declare it UB, and C's strict aliasing even almost does that. It could also behave like uninitialized integers, but it is unclear if that is any better. This C++ proposal has more detail; and here is even more material.
Quite a few people have thought about these problems for years; there does not seem to be an easy solution that we can "just" use. ;) |
What is the reason that all this mess has to be in the spec instead of just in the compiler though? SB already gives a semantics that tracks where you are or are not allowed to write, and once you are in raw pointer land most of the value tracking turns off. If the compiler can infer some provenance information, great, but the spec doesn't need to play that game. Put another way, what is a program that should be UB that SB says is fine, and relies on tracking pointer provenance through raw pointers and/or integers?
Isn't there an LLVM bug (that you filed, I think) that does this with pointers? |
Cases where this might not be true include:
|
I would hope that we can at least ensure that it has the same layout pattern as |
Because so far nobody has been able to propose a semantics that can hide this from the spec, but still do all the desired optimizations and at the same time provide pointer-integer casts.
It does, though. Stacked Borrows relies on this provenance information, and that effect cannot be entirely hidden from the semantics in a language with pointer-integer casts.
Stacked Borrows "gives up" on raw pointers (treating them as having much less provenance -- but there is still some provenance left, namely to track the allocation ID that the pointer originally pointed to), but that is intended as a temporary stepping-stone. #86 contains some examples of optimizations that we want to do, and that LLVM does, but that SB currently fails to support because it forgets provenance for raw pointers. I already mentioned
Do you mean this one? Yes, LLVM does this with pointers and that's wrong. The goal is to make it wrong for pointers but right for integers. |
I do not believe so. I've taken a quick look at annex J (unspecified/undefined/implementation-defined behavior) of C11:
People who want to write unsafe code often consult the C standard to see what works and what doesn't work. I believe it would be good to have an informal document that goes through the C standard and compares C semantics to the current Rust semantics. Besides the differences regarding pointers, another example are unions which do not have an active field in Rust and can therefore be used to implement transmute. PS: I haven't yet have time to look at your links. |
The fact remains that both Note that raw pointers not having provenance does not really solve any of the hard problems, it just moves them around. Everything that is currently tricky about casting and transmuting between raw pointers and integers, then becomes tricky about casting and transmuting between raw pointers and references. We have to figure out answers to these questions anyway, we have to find some good way to handle the "boundary" between what has provenance and what does not. Where we put that boundary is mostly orthogonal. |
Actually, those two functions seem like the main argument for keeping provenance, and arguing in this way is very clearly letting LLVM design decisions "leak" into Rust, because we originally added the function because LLVM has it, but LLVM has weird semantics for it so now we have to support their weird semantics. Currently, With numbers on this we could make a more informed judgment about whether it is worthwhile to have two distinct pointer tracking mechanisms in rust (SB and pointer provenance). |
I am all in favor of gathering the data, but do not have the experience needed to do so. (Note that one would also have to change how
Note that SB's pointer tags are a form of pointer provenance, so terminology is a bit mixed up in that sentence. See here for a definition of provenance. C-style "tracking of the original allocation" is just one example of provenance; Stacked Borrows tags are another. Also, the discussion was about whether raw pointers should have any provenance or not. In your last sentence you are making a totally different statement. I do not know if using SB tags can entirely supplement tracking allocation IDs (for references or raw pointers) -- and even if they can in principle, I do not know if it is possible to do this in a way that is compatible with LLVM. Furthermore, if SB can supplement allocation IDs, then once raw pointers have SB, it could do so there as well! So we could have raw pointers with provenance, and still have only one form of provenance. To conclude, I see very little connection between the discussion we had before, and "do we need both SB tags and allocation IDs". You made a huge jump there in you reasoning that I think is not backed by arguments. I don't even see having allocation IDs as a problem, they are not particularly complicated. What is the problem you are trying to solve? All the time it seemed like you wanted to solve the issues around casting between raw pointers and usize, which are caused by raw pointers having provenance. So this has nothing to do with whether we have allocation IDs elsewhere in the semantics or not. Also see my argument above that as long as we have any kind of provenance (SB tags or alloc IDs) anywhere, this problem remains. So even if we only have SB tags and even if raw pointers have no provenance, we still have to solve this problem. |
I realize this; but SB pointer tags only operate on references, not raw pointers. As I've tried to say, I would really rather have pointers be integers (perhaps with some kind of undef but otherwise just a bit pattern), because tracking pointer chunks through arbitrary mathematical operations is pretty clearly a doomed enterprise (and one that leaves the spec in a shambles). Having SB for references and C style pointer provenance for raw pointers seems like the worst of both worlds.
Hm, this is an interesting thought. My gut says it should be possible, since an SB borrow implies that the reference has the same LLVM style pointer provenance as the source of the borrow, but it deserves a full exploration, probably in an issue of its own.
Is this referencing an extension of SB where raw pointers also get tags? I don't have a very good conception of what this would look like, and a suspicion that it is just as hard to get right as C style pointer provenance.
What I'm trying to solve is to solve the issues around ptr-to-int casts by declaring raw pointers to be integers with no provenance. References on the other hand are tracked by SB and so have a form of provenance. In my previous remarks I use the term SB for that and reserve "pointer provenance" to refer to C / LLVM alloc ID style provenance, but yes I take your point that SB is also pointer (or rather reference) provenance.
referring to:
Once the boundary is drawn, it seems that the questions mostly answer themselves. If you cast a reference to a pointer, you lose the tag. If you cast a pointer to a reference, you get a reference with "raw" tag, which acts like a pointer until/unless it is retagged. SB already supports these operations, so I'm not sure what the hard part is. The main "issue" is that by making a choice, this spec is definite enough to determine a set of allowed optimizations, and so we can look at whether our LLVM lowerings are allowed. The most notable non-validated lowerings are |
As I said repeatedly, that is just the status quo and was never meant to be the final result. But without things like rust-lang/rust#64490, it is hard to write code correctly otherwise. Also, note that even currently this is only mostly true -- you can transmute a reference to a raw pointer, and then you have a raw pointer with a proper tag. So optimizations cannot assume things like "raw pointers have no tag".
Yes that's what I was referring to. And indeed the problems around ptr-int-casts are similar to the ones in C. But the same is true in the status quo, that just moves the problem from the raw-ptr / int boundary to the reference / raw-ptr boundary.
If it would be that easy, we could do the same by drawing a clear boundary between raw ptrs and integers and saying that the casts do all the necessary adjustments. But this proposal solves nothing as it leaves the hard question unanswered: what if you transmute across the boundary? That is the unsolved issue around raw ptrs with provenance, and it is just as unsolved when references have provenance but raw ptrs have not. In fact, we can ignore raw pointers for this. We seem to agree that references have provenance but integers do not. Now, what is the behavior of transmuting a reference to an integer, or (equivalently) doing a
unsafe fn foo(x: *mut usize) {
let val = *x;
*x = val; // We should be able to remove this store, but if the load strips provenance, we cannot.
} |
I might be missing all the implications of this, but I would like to say that transmutes also strip provenance. I recall you arguing somewhere that transmutes can't change the value at all, but I forget the details of this, and at least in the simple case it seems like casting a If this isn't workable, it might also be possible to defer the casting, getting a value with provenance sitting in a struct which nominally is supposed to have a raw pointer or integer value at that position, and the provenance is stripped only lazily, when that field is actually loaded.
unsafe fn foo(x: *mut usize) {
let val = *x;
*x = val; // We should be able to remove this store, but if the load strips provenance, we cannot.
} As long as provenance is stripped eagerly, there should be no problems justifying this, the value being written has no provenance but the original value didn't have any either. But if provenance is stripped lazily this is more complex, because this actually has an effect on the memory, assuming that the usize value there was actually a reference with a tag. One way to sidestep the problem is to have an operation Originally, I wanted to say that we should be able to just randomly clear provenance in memory whenever we want, but I guess this can make non-UB code UB. What are the arguments for provenance existing in memory? Up until now I've mostly been thinking about reference values having tags, but I guess SB also allows them to be stored in memory, still with tags, in the manner of your "what's in a byte" post. I recognize the need for the borrow stack, but what goes wrong if all stores to memory just turn into 0's and 1's? It seems a bit bizarre that rust has untyped memory yet allows storing provenance in memory like that. (Forgive the naive questions, I haven't been thinking about this problem as long as you so I'm sure my questions and solutions are naive.) |
That is the third option then. And yes I think you are missing the implications. ;)
Do you mean "transmuting a &mut T to a *mut T"? FWIW, most of this discussion is already carried out in this paper that I referenced above. Ctrl-F "Type Punning". Since you seem interested in this topic, I think it would be a good idea to do some background reading to learn about the options people already thought about and why they do not work. :)
I don't know what "eager provenance stripping" is... are you saying casting a reference should affect the memory it points to? That has no chance at all to work, it means casts have side-effects so they cannot be reordered any more. They also cannot be removed any more even if their result is unused as that would remove their side-effects. This is the same reason that this earlier model on int-ptr-casts does not work: casts must be pure operations not affecting memory and not depending on memory, otherwise optimizations around casts are too severely limited.
When you have a value |
You are right, this doesn't work.
Yes and no. You get back a "weaker" version of the value, about which you can prove less optimizations, but you can still do store forwarding. I guess it's equivalent to casting your reference to an integer as you write it and back to a reference when you read it, losing the tag. And here also you can use an intrinsic operation to strip provenance at the value level: that is, fn foo(ptr: *mut &usize) {
let x = 3;
let y = &x;
*ptr = y;
bar(*ptr);
} with store forwarding becomes fn foo(ptr: *mut &usize) {
let x = 3;
let y = &x;
*ptr = y;
bar(strip_provenance(y));
} where |
I am very doubtful that these strip_provenance operations will not be huge optimization barriers everywhere. At this point everything we can say about this semantics is highly speculative. Also this makes "the content of a local variable" something special, but it really is not -- |
I've read this paper now. In 4.9 loading an usize from a pointer-to-pointer was defined to yield poison. I assume this is also what you are referring to here. This seems rather strange since the safe-transmute RFC will most likely make transmuting |
Indeed, it is rather unsatisfying. But the other alternatives on the table are arguably worse. The safe-transmute RFC should probably hold back on explicitly blessing such transmutes... |
I suggest you make a comment to that effect there. I do not think many people there are aware of these issues. |
What are the problems with this model:
When store forwarding replaces an integer load/store, it must delete any provenance information at the target memory location instead of removing the operation entirely. |
Depending on what exactly you mean by this, this invalidates dead store elimination: unsafe fn foo(x: *mut usize) {
let val = *x;
*x = val; // deletes provenance in target location
} If we remove the store, that means the transformed program has more provenance information, which can introduce UB into a previously UB-free program.
What does it mean to "ignore" provenance? If you have two integers
There is no proposal for an aliasing model yet that actually satisfies this condition. If you have a valid safe Rust program and delete the provenance (Stacked Borrows tag) of some reference, it can become UB. |
The dead store cannot be removed from the IR, but at some point you're going to be taking the IR to actual machine code, at which point those stores can be removed. (Since provenance doesn't actually exist on real hardware)
If
Surely as soon as FFI is involved you're going to have to deal with pointers whose provenance it unknown. For example, let's say there's a C function It would be suprising to me if I could cause a program to exhibit UB simply by converting uses of |
Storing the result would be fine, but what if you instead take the result and transmute it back to a reference and use it? You could get into a situation where it is not UB if the provenance is erased by the |
Sure, but that will likely still be a problem for later optimization passes on the IR -- "this piece of code does not write to memory" is a useful property.
As @digama0 already said -- but what if I transmute the integer to a pointer? Then
In current Stacked Borrows, when a reference is cast to a pointer, the relevant memory gets marked as "accessible to pointers with unknown provenance". I hope to move this change to ptr-to-int casts at some point which probably better models LLVM behavior, but as this discussion shows, LLVM behavior is unclear. It is absolutely imperative that pointers with unknown provenance can not access memory that an |
I haven't read the rest of the conversation but I'm sure that this is not the case in current rust code if we apply the provenance rules from your paper. |
So this is UB? extern "C" {
fn identity(x: &mut i32) -> &mut i32;
}
fn bad(x: &mut i32) {
*identity(x) = 42;
} |
Indeed, and that's why -- quoting from my previous message -- "when a reference is cast to a pointer, the relevant memory gets marked as 'accessible to pointers with unknown provenance'".
It is UB if it forgets provenance. But then really you should not call it Defining precisely how FFI works is really complicated (and I am not an expert, I know just enough to avoid the topic), so I don't think we want to go there. I also don't see how it helps the discussion. If you want to talk about linking and FFI precisely, you need a shared memory model of the two sides that get linked -- so if the linking happens while provenance still exists, both sides need to be aware that there is provenance, and define what happens to it. If the linking happens on the assembly level, there is no provenance and thus no provenance-related problems. We have to assume that C code does not have any capabilities to affect Rust code in a way that goes beyond what Rust code can do (otherwise, sound linking is simply impossible -- every way in which the foreign code could affect the Rust Abstract Machine needs to be explicitly account for in said Abstract Machine). So any counterexample you have in mind, you should be able to write entirely in Rust. FFI is a red herring. |
Sure, I was just using FFI as an example of a black-box, something into which the compiler has no visibility. It sounds like you're saying that in these cases you need to mark the memory as "accessible to the whole program". If so, that's what I meant by "forgetting the provenance" of a pointer. |
Not really... that's just "an arbitrary sequence of Abstract Machine instructions, but we don't know which". However we know which instructions the Abstract Machine has, so we still can put some bounds of what that arbitrary code might do. So, the compiler has to assume that any value passed to that black box is available via some global variable to everything else. But on the Abstract Machine level, there's no "forgetting the provenance" happening here. |
How is that a meaningful distinction? If (in Rust, and under the "integers don't have provenance but pointers do" rules) I cast a pointer to an integer and back. It would seem to me to be exactly equivalent to passing the pointer into a black-box function and receiving a new pointer back. The black-box example must be as weak as the cast, since the black-box could by definition do anything (including cast). Finally, we know that replacing In both cases, we still know that it cannot alias with some other pointers (which we know did not escape to some global variable, or be casted to an integer). This is all that I mean by saying the provenance is unknown. It does not seem necessary to me to pick in advance and say that "all pointers have provenance and all integers do not". We can say that we have provenance information for some pointers and for some integers. We can define that (most) operations on integers return a new integer without provenance. We can definite at the IR level an operation (like black-box) which destroys that provenance information, and we can define optimizations (like the equality optimization) for both integers and pointers, as long as that optimization first wraps the inputs in Admittedly, this replacement will have a cost (since provenance is useful for other optimizations) but that cost can be weighed as part of the trade-off in applying the optimization. It's just the same old "what order do we apply optimizations in" problem. At least the optimizations are valid whatever the order. |
I think there is a huge distinction between what we "bake in" to the Abstract Machine (and implement in Miri), and the reasoning principles that follow from that.
I cannot see any way in which these two would be "equivalent" -- so, I do not understand what notion of "equivalence" you mean. The standard notion of equivalence on programs is "contextual equivalence", which basically means the two pieces of code are equivalent if you can replace one by the other without changing program behavior in any way that can be observed from the outside. Under this notion, an int-ptr-roundtrip is clearly very distinct from an arbitrary function... the function might e.g. add 3 to the pointer address, which the int-ptr-roundtrip will never do.
No I think that's wrong. In particular, optimizations such as equality-test-based GVN (the first optimization in my blog post) rely on integers never having provenance. If there is even the possibility of a value of integer type having provenance, the optimization is wrong in the general case (the optimization only remains correct if we can prove that the integers being tested here do not have provenance, but of course that will be de-facto impossible in the general case). One way to view this is that by adding the possibility of integers having provenance, the set of possible values of integer type is increased, so each optimization working on such values now needs to also consider the new values that were added. (This is similar to how the mere existence of LLVM's But I also lost track of the problem we are trying to solve here. |
I have a few questions regarding
usize
and raw pointers for sizedT
:usize
guaranteed to have the same layout as one of theuN
?usize
guaranteed to have no padding bits?usize as *mut T as usize
the identity function onusize
?usize as *mut T
guaranteed to be the same astransmute::<usize, *mut T>
?ptr::read_unaligned::<*mut T>
safe for all arguments for whichptr::read_unaligned::<[u8; sizeof(*mut T)]>
is safe?usize-literal as *mut T
guaranteed to have no special behavior? (e.g. for0usize
)The current documentation
does not answer these questions afaict.
The text was updated successfully, but these errors were encountered: