-
Notifications
You must be signed in to change notification settings - Fork 847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support interval multiplication and division by arbitrary numerics #6906
base: main
Are you sure you want to change the base?
Conversation
arrow-arith/src/numeric.rs
Outdated
if right == 0 { | ||
return Err(ArrowError::DivideByZero); | ||
} | ||
Ok((left as f64 / right as f64).round() as i32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why in floats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid premature truncation in case this is an i32?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for division, would left / right as Self
or something like that work?
ie remaining in integer arithemtics. for example for math like 10000000000000000000000000000000000001 / 1
going thru f64 will change the value.
Might I suggest we just add support for f64 and i32 to start off with, this would drastically cut down on codegen, and make the code easier to parse. DF's type coercion machinery should be able to handle this transparently for users. In practice people will be performing these operations with scalars, and so this type coercion is effectively free. As an aside, coercing decimals to floats will yield incorrect answers |
that's exactly what would happen if |
if !value.is_finite() || value > i32::MAX as f64 || value < i32::MIN as f64 { | ||
Err(ArrowError::ComputeError( | ||
"Division result out of i32 range".to_string(), | ||
)) | ||
} else { | ||
Ok(value as i32) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a builtin faillible version of float_value as i32
that we could use instead of implementing the logic ourselves?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is approached in the way I described above you won't be converting floats to integers. When you get a floating point array representing the number of milliseconds and you need to convert that to IntervalDayTimeType
you will be dividing the input to MILLIS_IN_A_DAY and round the rest which comfortably fits in a 32-bit integer.
I think we should just support f64, i64 for now. This is all that is actually implemented anyway by this PR, we're just doing implicit coercion.
Right, my point is this PR is not implementing I'd recommend we leave out decimal support for now, and if people want it we can add support as a separate PR (it will be non-trivial). |
Sounds good, I'll make the change! Could you please clarify the comment about codegen? At which point is the codegen invoked?
You mean coercion from, say, f32 to f64? What are you referring to in this sentence? |
Rust relies on monomorophisation of generics, i.e. each distinct parametrisation of a generic generates a "copy" of that code. This can quickly add up and bloat binary size and compile times. See https://rustc-dev-guide.rust-lang.org/backend/monomorph.html
DataFusion has existing logic to coerce inputs to those compatible with a function signature, if we provide the ability to multiply |
Note, in the last commit, I've implemented i32 (but not i64) per:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestions require a whole different approach, so you might wait for @tustvold to confirm this is the right direction. This is my first arrow-rs
review, but this is exactly how I would approach in Arrow C++ with C++ templates that blow up in binary size similarly to Rust traits.
(Interval(unit), rhs) if rhs.is_numeric() && matches!(op, Op::Mul | Op::MulWrapping) => | ||
match unit { | ||
YearMonth => interval_mul_op::<IntervalYearMonthType>(op, l, l_scalar, r, r_scalar), | ||
DayTime => interval_mul_op::<IntervalDayTimeType>(op, l, l_scalar, r, r_scalar), | ||
MonthDayNano => interval_mul_op::<IntervalMonthDayNanoType>(op, l, l_scalar, r, r_scalar), | ||
}, | ||
(lhs, Interval(unit)) if lhs.is_numeric() && matches!(op, Op::Mul | Op::MulWrapping) => | ||
match unit { | ||
YearMonth => interval_mul_op::<IntervalYearMonthType>(op, r, r_scalar, l, l_scalar), | ||
DayTime => interval_mul_op::<IntervalDayTimeType>(op, r, r_scalar, l, l_scalar), | ||
MonthDayNano => interval_mul_op::<IntervalMonthDayNanoType>(op, r, r_scalar, l, l_scalar), | ||
}, | ||
(Interval(unit), rhs) if rhs.is_numeric() && matches!(op, Op::Div) => | ||
match unit { | ||
YearMonth => interval_div_op::<IntervalYearMonthType>(op, l, l_scalar, r, r_scalar), | ||
DayTime => interval_div_op::<IntervalDayTimeType>(op, l, l_scalar, r, r_scalar), | ||
MonthDayNano => interval_div_op::<IntervalMonthDayNanoType>(op, l, l_scalar, r, r_scalar), | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes more sense to keep these patterns concerned only with the lhs/rhs types and leaving the switch on op
to interval_op
instead of adding interval_mul_op/interval_div_op
. Since the preparation to dispatching a mul or a div are similar, keeping it all inside interval_op
and switching on op
may reduce binary size.
fn mul_int(left: Self::Native, right: i32) -> Result<Self::Native, ArrowError>; | ||
fn mul_float(left: Self::Native, right: f64) -> Result<Self::Native, ArrowError>; | ||
fn div_int(left: Self::Native, right: i32) -> Result<Self::Native, ArrowError>; | ||
fn div_float(left: Self::Native, right: f64) -> Result<Self::Native, ArrowError>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of instantiating a single-value interval [/*] number
operation to build all the array operations we want, we can approach this in a way that avoids the combinatorial explosion.
To implement interval_array [/*] number
operations we need operations that convert intervals to integers that count the number of the smallest unit of the interval type, regular int and float multiplication/divisions (which already exist), and then the conversion back to interval types. These conversion operations are parameterized by a single type so the number of specializations isn't a product of the number of interval types and the number of int and float types. All these operations are at the array level and not at the single-value level.
-
IntervalYearMonthType
is already anint32
array (number of months) [1], so just fallback toint32
x ... kernels and convert the result toint32
number of months again (conversion might not even be needed depending on whatrhs
is) -
IntervalDayTimeType
is days and milliseconds (both int32) so if the whole array is converted to anint64
array of milliseconds you can delegate to regular*
/
and convert back the result -
IntervalMonthDayNanoType
similar idea.
[1] https://github.com/apache/arrow/blob/02a165922e46e5ed6dd3ed2446141cd0922a7c54/format/Schema.fbs#L398
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately intervals cannot be coerced to a duration as proposed, doing so changes their semantic meaning, 2 months is not the same as 60 days nor is 48 hours the same as 2 days (because of daylight savings). This has implications when performing arithmetic involving them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately intervals cannot be coerced to a duration as proposed...
For the purpose of multiplication/division, some of the interval types can (internally to the kernel) -- before returning the output array, they are converted back to interval type to preserve their semantic meaning.
YEAR_MONTH
Example: the double of 1 year and 1 month
is 2 years and 2 months
which is what you get if you convert (1y, 1m)
to 13m
then 2x to 26m
which should then become (2y, 2m)
. This is all fine because every year, no matter what, has 12 months. So 12 months can become a year in the interval. This is why the underlying implementation itself is in number of months.
Daylight savings and calendar consideration only come into play when you add an interval to a specific timestamp/date which is not the case here. It would be wrong to add 1y+1m
to 2024-12-20
as just adding the number of seconds that interval in 30-day months converts to, but the operation we are doing here is the scaling of the interval by some factor while preserving the semantic meaning.
But maybe the multiplication by a float should not be available. Any system that wants to support interval[YEAR_MONTH]*float
multiplication should round the float first or convert the interval to an interval type with more resolution.
DAY_TIME
IntervalDayTimeType
expects the millis part to be less than a whole day worth of milliseconds (i.e. leap seconds are not accounted for) [1][2]. Perhaps the output of scaling operations on them (mul/div) should be a MONTH_DAY_NANO
interval. Because unlike years that always have 12 months, days don't always have the same number of millis in them.
MONTH_DAY_NANO
For MonthDayNano
my approach breaks indeed -- the components should be scaled independently. And the number of nanos is unbounded -- it doesn't have to be less than a day's worth of nanos.
Multiplication by float and division gets really confusing though.
[1] https://github.com/apache/arrow/blob/02a165922e46e5ed6dd3ed2446141cd0922a7c54/format/Schema.fbs#L398
[2] apache/arrow@7f7f72d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bkietz what are your thoughts on multiplication and division of interval types? I remember you being involved in some discussion regarding these types in the past.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they are converted back to interval type to preserve their semantic meaning
Right but this conversion is not possible for anything other than YearMonth (which is just a number of moths IIRC).
For anything else this conversion is lossy as days and months do not correspond to a fixed duration.
some of the interval types can (internally to the kernel)
If by some you mean one 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main point of my suggestion is to think of these operations in terms of whole-array operations instead of single-value operations.
To treat the interval components separately, we would only need functions that split an array of interval values into multiple arrays (each per component) so that all the arithmetic can be delegated to existing arithmetic operations for all int/float combinations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a sound idea in principle, and this sort of approach to avoiding codegen is something I would generally advocate for, however, I don't think it works in this case.
In particular we don't support arithmetic operations between heterogeneous primitive types, e.g. int32 * float64, instead leaving such coercion to to query planners which are better placed to do this. The result is that we would effectively need the query engines to do this explosion for us, in order for things like interval * f64
to work.
Even discounting this, the performance characteristics of such an approach would be unfortunate, at that point you'd be better off using a StructArray
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bkietz what are your thoughts on multiplication and division of interval types?
My contention is that (as noted above) intervals don't really map to a duration; they map to a pair of timestamps or dates. Indeed, when considered in isolation from any associated start or end date column, the only interpretation we have left for an interval column is in terms of its (flawed) conversion to duration. Therefore, the only interactions which should be supported on intervals is arithmetic with dates and timestamps:
YearMonthInterval +/- Date32 -> Date32
DayTimeInterval +/- Date64 -> Date64
MonthDayNanoInterval +/- TimestampNS -> TimestampNS
...
From these and an appropriate start or end date column, we can make the correct conversion to duration and do well defined multiplication and division.
if !value.is_finite() || value > i32::MAX as f64 || value < i32::MIN as f64 { | ||
Err(ArrowError::ComputeError( | ||
"Division result out of i32 range".to_string(), | ||
)) | ||
} else { | ||
Ok(value as i32) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is approached in the way I described above you won't be converting floats to integers. When you get a floating point array representing the number of milliseconds and you need to convert that to IntervalDayTimeType
you will be dividing the input to MILLIS_IN_A_DAY and round the rest which comfortably fits in a 32-bit integer.
/// Perform multiplication between an interval array and a numeric array | ||
fn interval_mul_op<T: IntervalOp>( | ||
op: Op, | ||
l: &dyn Array, | ||
l_s: bool, | ||
r: &dyn Array, | ||
r_s: bool, | ||
) -> Result<ArrayRef, ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said in the first comment, these should become different cases in
fn interval_op<T: IntervalOp>(
op: Op,
l: &dyn Array,
l_s: bool,
r: &dyn Array,
r_s: bool,
) -> Result<ArrayRef, ArrowError> {
let l = l.as_primitive::<T>();
let r = r.as_primitive::<T>();
match op {
Op::Add | Op::AddWrapping => Ok(try_op_ref!(T, l, l_s, r, r_s, T::add(l, r))),
Op::Sub | Op::SubWrapping => Ok(try_op_ref!(T, l, l_s, r, r_s, T::sub(l, r))),
-- NEW CASES HERE --
And then instead of relying on try_op_ref!
that produces array operations based on single-value functions, you will first convert the interval array to a integer array (for both mul and div so no need for different traits), then delegate to either mul of div of numeric inputs (no interval types involved at this point), then you take that output and convert the desired interval type with the appropriate kernel. These conversion functions might be defined with try_op_ref!
.
The approach I suggested doesn't rely on these Datafusion casts and doesn't increase binary size. It's a win-win. Decimals can be covered by truncating the resulting decimals to integers [1] (e.g. number of millis or nanos) and then converting those int arrays back to the interval type. [1] Kernels that might already exist or should exist (taking the integer part of a whole decimal array) |
Unfortunately it is incorrect to treat intervals in this way - #6906 (comment) I will try to find some time to re-review this PR over the next few days |
Thanks a lot to both of you for the feedback! I'll wait for @tustvold's additional review in the next few days and proceed based on the combination of recommendations. |
So how shall we proceed with this PR? I feel like there was a lot of high level options thrown around but it isn't clear to me what the next steps are? Perhaps we could make a new PR for a subset of the feature (e.g. interval multiplication) and work out the pattern in that PR where it is simpler to review? |
Marking as draft as we don't seem to have reached any consensus and I am trying to make it clear what PRs are waiting on review in the review queue |
Which issue does this PR close?
Closes #6335
Rationale for this change
This adds support for interval multiplication and division. The end goal is to enable this in DataFusion to make it on par with Postgres in this regard.