-
Notifications
You must be signed in to change notification settings - Fork 831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add shrink_to_fit
to Array
#6360
Comments
I've been thinking a bit on how to implement this. It seems to be a lot of work, but doable. Ideally we should be able to take any One thing we need to ensure is to only allocate and copy memory if the desired capacity is lower than the current capacity. That is:
|
fields: Vec<ArrayRef>, |
There is not easy way to opt-out of this, though a user could use a heuristic like:
fn maybe_shrink(array: ArrayRef) -> ArrayRef {
if array.get_buffer_memory_size() < 1024 {
array // not worth the overhead of calling shrink_to_fit
} else {
array.shrink_to_fit()
}
}
Perhaps this is good enough?
Alternative designs
I don't really have any, except for the one in #6300
I am new to the code base though, so I might be missing something.
Request for feedback
Thoughts @alamb @tustvold @teh-cmc ?
If the fn shrink_to_fit(&self) -> ArrayRef
plan sounds good, I can start implementing it.
What if you made impl Array for BooleanArray {
…
fn shrink_to_fit(self) -> Self {
Self{
values: self.values.shrink_to_fit(),
nulls: self.nulls.as_ref().map(|b| b.shrink_to_fit()),
}
}
} Then if you already had an owned array no clone would be required at all. It would take some finagling to get this out of an ArrayRef - maybe something like let arr: ArrayRef = ...;
if let bool_array_arc: Arc<BooleanArray> = arr.downcast() {
bool_array_arc.unwrap_or_clone().shrink_to_fit()
} Though I suppose you still need to rewrap into an Arc to get an array ref 🤔 |
If you made the API consuming then you wouldn't force a clone if you were working with the array types directly (though if you started with an ArrayRef and wanted to end with an ArrayRef it would probably end up with the same amount of allocations as your proposal) |
You are right, a consuming API seems better, on all levels. I'll probably start work on this later this week. |
Actually, no. We can make |
I wonder if we should give some thought to how this may integrate with types such as DictionaryArray or StringViewArray that can be compacted/GC. I worry a little that a shrink_to_fit API might be constraining, and whether we would be better adding a first-class compact kernel within e.g. arrow-select? This would then be able to take an options struct to control its behaviour. |
The actual problem that we need to solve is the capacity of repeated I don't know the arrow codebase well enough to even know what |
It sounds like #6692 might be what you're actually wanting as a long-term API, although this will require someone very familiar with the arrow codebase to meaningfully action.
Pretty much, there are macros to make this slightly less obnoxious.
Pretty much all the kernels do some variation of this That all being said, shrink_to_fit can just implement the basic logic, and if/when we add a compact kernel that can simply call shrink_to_fit. This could allow progress to be made on this now. Regarding the consuming vs non-consuming versions, etc... one idea might be to just have:
And then use |
I have a working version of
|
I think |
Yeah, I thought we might be able to get around this using Arc::get_mut for ArrayRef, but this won't work for the typed wrappers which only have access to an immutable ref. I think a compact kernel may be the only consistent way to proceed, it is certainly the approach that is more consistent with other functionality |
The new approach of The only thing left to implement is to handle the case of a shared A) ignore it (if it is shared, we can't free up memory) I'm actually happy with any of these three paths. Please take a look at #6790 and tell me what you think. I find it a rather clean approach, as it doesn't require the dynamic downcasting of |
We keep a lot of Arrow data in long-lived memory storage, and therefore must be guaranteed that the data has been optimized for space before it gets committed to storage, regardless of how it got built or how it got there.
This would also provide a strongly typed non-trait version, similar to what is done e.g. for
slice
.See this PR for more background context:
MutableBuffer::into_buffer
leaking its extra capacity into the final buffer #6300The text was updated successfully, but these errors were encountered: