-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CometBuffer
can potentially lead to concurrent modification of a held buffer (aka is "Unsound" in Rust terms)
#1035
Comments
I don't know why this is tagged as bug... |
Apologies, I presumed that a security vulnerability was a bug... |
CometBuffer
can potentially lead to concurrent modification of a held buffer (aka is "Unsound" in Rust terms)
As you mentioned the doc of CometBuffer already documents the unsafe behavior. Not sure why you also said it is not documented. I also don't know why it was being dismissed. We have dealt with that with deep copying the arrays in the necessary cases. The fact is that the |
I tried to update the title of this PR to better reflect what it is reporting. |
I am trying to report that your codebase has a critical bug in its handling of memory, without causing unnecessary angst. Modifying buffers behind the back of the Rust compiler is undefined behaviour, and can trivially lead to out of bounds memory access... This is typically considered a serious security defect...
CometBuffer does document that it has unsafe invariants, this isn't in and of itself an issue. Yes, it is unsound which Rust purists will complain about, but provided nothing uses it in a way that triggers undefined behaviour it isn't the end of the world. The problem is #1030 highlights that it is not being used in a manner that avoids undefined behaviour. This is what I think should be documented, highlighted, and arguably fixed. Edit: to expand on this, if, for example, comet were instead using |
As I mentioned earlier, we did effort to avoid it for the cases we know... |
I believe that Comet scan is developed long before the mutable API comes out... |
I wonder if there might be be some way to use |
The scan code is not developed by me. I guess that may not work as the CometBuffer internally doesn't use pointer like |
So basically it is safe predicated on undocumented assumptions about third-party code. This is at best optimistic 😅 I dunno, I'm going to disengage at this point but I have to say I have found the response here deeply disappointing and tbh rather troubling. I had hoped there'd at least be an acknowledgement that this API is hard to use safely, but instead the response is to suggest the problem lies in the third-party code for not conforming to your assumptions, and that you'll just add some extra copies to "workaround" this... There are ways to do this safely and I would have hoped that would be seen as a north star, even if not a priority at the moment... |
Well, I don't presume to know what is / isn't the right design for other systems. I think this ticket will serve as a good discussion for how to potentially improve the situation and I don't find the discussions or responses troubling. |
Makes sense -- thus it sounds like removing this assumption is likely a non trivial effort What I personally hope / plan to do is to is work with people make the Current obsession is making pushdown predicates even faster with @XiangpengHao and @tustvold: apache/arrow-rs#5523 |
I do think that some valid concerns have been raised here. The questions at the top of my mind right now are:
|
Integration with DataFusion Parquet reader needs to break some designs around current native reader like I mentioned in #1031. For now, it is not a easy transition as simple as to replace A with B. There will be a huge effort on this, not to mention that a JVM reader frontend is important for some cases. |
I just took another look at the related code. Arc / ref counting check won't be effective here, is also because these arrays are passed across JVM/native through C Data interface. Even I try to add an Arc to guard exclusive modification on the buffer data. Once we export the arrays through JVM and import into native, imported buffers won't get the Arc ref counts back but start with new Arc pointers. We cannot get Arc working through import/export arrays. |
This isn't an issue provided the JVM doesn't free or modify the buffers whilst any native code has them. arrow-rs will always treat externally shared memory as immutable, it requires the other participants to do likewise. Once the arrow reference count reaches 0 it will call release, and this can decrement the Java side reference count, or whatever data structure it is using to track this. This is how FFI with pyarrow and arrow-cpp works |
@alamb suggested if we can use Arc to check if there are no remaining references to the buffer before it is reused. What I said above is that this cannot work as we can not keep Arc reference count through JVM/native. I don't know how it is related to what you are talking about. Of course that's how FFI works, but it doesn't work for the proposed check. |
I interpreted "I wonder if there might be be some way to use Arc / ref counting " as referring to the ref-counting machinery present in Edit: To expand further as there seems to be a lot of confusion here, the method linked in the original issue description, just needs to pass something instead of |
Huh? When exporting the array from native to JVM, I believe JVM doesn't care about this |
Indeed, currently in Comet the JVM doesn't use this parameter, and that is the problem 😅 Contrast this with, for example, how arrow-rs handles receiving data over the C Data interface, e.g. for interop with arrow-cpp or pyarrow. It first constructs an FFI_ArrowArray to manage the allocation, and then passes this as the |
I believe that the reason why |
Nope that is precisely what it is for - https://arrow.apache.org/docs/format/CDataInterface.html#memory-management Edit: I realize you could have been saying why Allocation is used in Comet, and yes you are right, however, by using it in this way the JVM has no way to know when it can safely free or reuse the buffer, which is my whole point. If you instead did something similar to what the C Data interface does and gave it an actual allocation object you wouldn't have this issue |
I feel that we still are not on the same page...
It clearly states that the data memory is allocated and maintained by the producer. The consumer must not interfere with the lifetime of the memory allocated by the producer. That is exactly what I said above. |
Okay. I posted the above reply before seeing your addition to previous comment. |
I'm still not sure what you referred to. It is clear to me that when I also feel it somehow distracts from what was discussed at the beginning.
This is what @alamb suggested at the beginning. However, as I posted above, Arc ref counting won't work across JVM/native. We cannot rely on it to do the check for remaining references.
But you claimed that is not an issue. I don't know why. And, Comet doesn't free or modify the buffers in JVM. |
Btw, to clarify this, although this is not focus but I think there is confusion. Comet doesn't implement JVM Arrow FFI. It is from Java Arrow. I remember Java Arrow has similar thing like the arrow-rs |
Ok I will try one final time to clarify from first principles:
|
No, It is not JVM reusing the buffer, but the native reader reuses it. But I get your point. Let's replace it with native parquet reader.
Okay. I got what you said. I believe that the confusion began from #1035 (comment), #1035 (comment) and #1035 (comment). I thought you were suggested to provide custom Allocation to JVM and it is confused as JVM doesn't take it when importing arrays through FFI. What you are suggesting can be simplified to one sentence. Providing a custom Allocation when CometBuffer exports arrow buffer, and using it to detect if exported buffer is dropped from the importer. It sounds making sense. I will give it a try. Thanks. |
Describe the bug
It was brought to my attention in apache/arrow-rs#6616 that comet is currently violating the aliasing rules of the Rust compiler. In particular it is mutating memory without exclusive ownership.
The docs on CometBuffer actually call out that the type is unsound - https://github.com/apache/datafusion-comet/blob/main/native/core/src/common/buffer.rs#L166.
This is the underlying cause of #1030, which is a relatively harmless manifestation of what is ultimately undefined behaviour.
Even putting aside that UB effectively means the program could do literally anything, the exact scenario in #1030 could easily lead to out of bounds memory access, e.g. by unmasking a dictionary key that was previously null and now points to some arbitrary memory location.
I debated filing this ticket, as I wasn't sure how it would be received, but I think it is a sufficiently critical vulnerability that should at the very least be tracked / documented. The way it was being dismissed made me honestly a little uncomfortable. Ultimately CometBuffer is unsound, and there is a concrete example of this unsoundness leading to undefined behaviour in #1030.
Steps to reproduce
Expected behavior
No response
Additional context
FYI @viirya
The text was updated successfully, but these errors were encountered: