Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support interval multiplication and division by arbitrary numerics #6906

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

milevin
Copy link

@milevin milevin commented Dec 19, 2024

Which issue does this PR close?

Closes #6335

Rationale for this change

This adds support for interval multiplication and division. The end goal is to enable this in DataFusion to make it on par with Postgres in this regard.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Dec 19, 2024
arrow-arith/src/numeric.rs Outdated Show resolved Hide resolved
if right == 0 {
return Err(ArrowError::DivideByZero);
}
Ok((left as f64 / right as f64).round() as i32)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why in floats?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid premature truncation in case this is an i32?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for division, would left / right as Self or something like that work?
ie remaining in integer arithemtics. for example for math like 10000000000000000000000000000000000001 / 1 going thru f64 will change the value.

arrow-arith/src/numeric.rs Outdated Show resolved Hide resolved
arrow-arith/src/numeric.rs Outdated Show resolved Hide resolved
@tustvold
Copy link
Contributor

tustvold commented Dec 19, 2024

Might I suggest we just add support for f64 and i32 to start off with, this would drastically cut down on codegen, and make the code easier to parse. DF's type coercion machinery should be able to handle this transparently for users. In practice people will be performing these operations with scalars, and so this type coercion is effectively free.

As an aside, coercing decimals to floats will yield incorrect answers

@findepi
Copy link
Member

findepi commented Dec 19, 2024

As an aside, coercing decimals to floats will yield incorrect answers

that's exactly what would happen if interval * decimal is not implemented directly.
Is this argument to support f64, i64 and decimal here? (and leave out smaller integer and floating point types)

Comment on lines +579 to +585
if !value.is_finite() || value > i32::MAX as f64 || value < i32::MIN as f64 {
Err(ArrowError::ComputeError(
"Division result out of i32 range".to_string(),
))
} else {
Ok(value as i32)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a builtin faillible version of float_value as i32 that we could use instead of implementing the logic ourselves?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is approached in the way I described above you won't be converting floats to integers. When you get a floating point array representing the number of milliseconds and you need to convert that to IntervalDayTimeType you will be dividing the input to MILLIS_IN_A_DAY and round the rest which comfortably fits in a 32-bit integer.

@tustvold
Copy link
Contributor

Is this argument to support f64, i64 and decimal here? (and leave out smaller integer and floating point types)

I think we should just support f64, i64 for now. This is all that is actually implemented anyway by this PR, we're just doing implicit coercion.

that's exactly what would happen if interval * decimal is not implemented directly.

Right, my point is this PR is not implementing interval * decimal it is implementing interval * float64(decimal) masquerading as interval * decimal. Such lossy conversion IMO should be opt-in and explicit.

I'd recommend we leave out decimal support for now, and if people want it we can add support as a separate PR (it will be non-trivial).

@milevin
Copy link
Author

milevin commented Dec 19, 2024

Might I suggest we just add support for f64 and i32 to start off with, this would drastically cut down on codegen, and make the code easier to parse.

Sounds good, I'll make the change! Could you please clarify the comment about codegen? At which point is the codegen invoked?

DF's type coercion machinery should be able to handle this transparently for users. In practice people will be performing these operations with scalars, and so this type coercion is effectively free.

You mean coercion from, say, f32 to f64? What are you referring to in this sentence?

@tustvold
Copy link
Contributor

Could you please clarify the comment about codegen? At which point is the codegen invoked?

Rust relies on monomorophisation of generics, i.e. each distinct parametrisation of a generic generates a "copy" of that code. This can quickly add up and bloat binary size and compile times. See https://rustc-dev-guide.rust-lang.org/backend/monomorph.html

You mean coercion from, say, f32 to f64? What are you referring to in this sentence?

DataFusion has existing logic to coerce inputs to those compatible with a function signature, if we provide the ability to multiply interval * i64, DataFusion has existing machinery to generalize this to interval * (type coercable to i64), including from things like strings.

@milevin
Copy link
Author

milevin commented Dec 19, 2024

Note, in the last commit, I've implemented i32 (but not i64) per:

Might I suggest we just add support for f64 and i32 to start off with

Copy link

@felipecrv felipecrv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestions require a whole different approach, so you might wait for @tustvold to confirm this is the right direction. This is my first arrow-rs review, but this is exactly how I would approach in Arrow C++ with C++ templates that blow up in binary size similarly to Rust traits.

Comment on lines +233 to +250
(Interval(unit), rhs) if rhs.is_numeric() && matches!(op, Op::Mul | Op::MulWrapping) =>
match unit {
YearMonth => interval_mul_op::<IntervalYearMonthType>(op, l, l_scalar, r, r_scalar),
DayTime => interval_mul_op::<IntervalDayTimeType>(op, l, l_scalar, r, r_scalar),
MonthDayNano => interval_mul_op::<IntervalMonthDayNanoType>(op, l, l_scalar, r, r_scalar),
},
(lhs, Interval(unit)) if lhs.is_numeric() && matches!(op, Op::Mul | Op::MulWrapping) =>
match unit {
YearMonth => interval_mul_op::<IntervalYearMonthType>(op, r, r_scalar, l, l_scalar),
DayTime => interval_mul_op::<IntervalDayTimeType>(op, r, r_scalar, l, l_scalar),
MonthDayNano => interval_mul_op::<IntervalMonthDayNanoType>(op, r, r_scalar, l, l_scalar),
},
(Interval(unit), rhs) if rhs.is_numeric() && matches!(op, Op::Div) =>
match unit {
YearMonth => interval_div_op::<IntervalYearMonthType>(op, l, l_scalar, r, r_scalar),
DayTime => interval_div_op::<IntervalDayTimeType>(op, l, l_scalar, r, r_scalar),
MonthDayNano => interval_div_op::<IntervalMonthDayNanoType>(op, l, l_scalar, r, r_scalar),
},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes more sense to keep these patterns concerned only with the lhs/rhs types and leaving the switch on op to interval_op instead of adding interval_mul_op/interval_div_op. Since the preparation to dispatching a mul or a div are similar, keeping it all inside interval_op and switching on op may reduce binary size.

Comment on lines +571 to +574
fn mul_int(left: Self::Native, right: i32) -> Result<Self::Native, ArrowError>;
fn mul_float(left: Self::Native, right: f64) -> Result<Self::Native, ArrowError>;
fn div_int(left: Self::Native, right: i32) -> Result<Self::Native, ArrowError>;
fn div_float(left: Self::Native, right: f64) -> Result<Self::Native, ArrowError>;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of instantiating a single-value interval [/*] number operation to build all the array operations we want, we can approach this in a way that avoids the combinatorial explosion.

To implement interval_array [/*] number operations we need operations that convert intervals to integers that count the number of the smallest unit of the interval type, regular int and float multiplication/divisions (which already exist), and then the conversion back to interval types. These conversion operations are parameterized by a single type so the number of specializations isn't a product of the number of interval types and the number of int and float types. All these operations are at the array level and not at the single-value level.

  • IntervalYearMonthType is already an int32 array (number of months) [1], so just fallback to int32 x ... kernels and convert the result to int32 number of months again (conversion might not even be needed depending on what rhs is)

  • IntervalDayTimeType is days and milliseconds (both int32) so if the whole array is converted to an int64 array of milliseconds you can delegate to regular * / and convert back the result

  • IntervalMonthDayNanoType similar idea.

[1] https://github.com/apache/arrow/blob/02a165922e46e5ed6dd3ed2446141cd0922a7c54/format/Schema.fbs#L398

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately intervals cannot be coerced to a duration as proposed, doing so changes their semantic meaning, 2 months is not the same as 60 days nor is 48 hours the same as 2 days (because of daylight savings). This has implications when performing arithmetic involving them

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately intervals cannot be coerced to a duration as proposed...

For the purpose of multiplication/division, some of the interval types can (internally to the kernel) -- before returning the output array, they are converted back to interval type to preserve their semantic meaning.

YEAR_MONTH

Example: the double of 1 year and 1 month is 2 years and 2 months which is what you get if you convert (1y, 1m) to 13m then 2x to 26m which should then become (2y, 2m). This is all fine because every year, no matter what, has 12 months. So 12 months can become a year in the interval. This is why the underlying implementation itself is in number of months.

Daylight savings and calendar consideration only come into play when you add an interval to a specific timestamp/date which is not the case here. It would be wrong to add 1y+1m to 2024-12-20 as just adding the number of seconds that interval in 30-day months converts to, but the operation we are doing here is the scaling of the interval by some factor while preserving the semantic meaning.

But maybe the multiplication by a float should not be available. Any system that wants to support interval[YEAR_MONTH]*float multiplication should round the float first or convert the interval to an interval type with more resolution.

DAY_TIME

IntervalDayTimeType expects the millis part to be less than a whole day worth of milliseconds (i.e. leap seconds are not accounted for) [1][2]. Perhaps the output of scaling operations on them (mul/div) should be a MONTH_DAY_NANO interval. Because unlike years that always have 12 months, days don't always have the same number of millis in them.

MONTH_DAY_NANO

For MonthDayNano my approach breaks indeed -- the components should be scaled independently. And the number of nanos is unbounded -- it doesn't have to be less than a day's worth of nanos.

Multiplication by float and division gets really confusing though.

[1] https://github.com/apache/arrow/blob/02a165922e46e5ed6dd3ed2446141cd0922a7c54/format/Schema.fbs#L398
[2] apache/arrow@7f7f72d

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkietz what are your thoughts on multiplication and division of interval types? I remember you being involved in some discussion regarding these types in the past.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are converted back to interval type to preserve their semantic meaning

Right but this conversion is not possible for anything other than YearMonth (which is just a number of moths IIRC).

For anything else this conversion is lossy as days and months do not correspond to a fixed duration.

some of the interval types can (internally to the kernel)

If by some you mean one 😅

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main point of my suggestion is to think of these operations in terms of whole-array operations instead of single-value operations.

To treat the interval components separately, we would only need functions that split an array of interval values into multiple arrays (each per component) so that all the arithmetic can be delegated to existing arithmetic operations for all int/float combinations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a sound idea in principle, and this sort of approach to avoiding codegen is something I would generally advocate for, however, I don't think it works in this case.

In particular we don't support arithmetic operations between heterogeneous primitive types, e.g. int32 * float64, instead leaving such coercion to to query planners which are better placed to do this. The result is that we would effectively need the query engines to do this explosion for us, in order for things like interval * f64 to work.

Even discounting this, the performance characteristics of such an approach would be unfortunate, at that point you'd be better off using a StructArray

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felipecrv

@bkietz what are your thoughts on multiplication and division of interval types?

My contention is that (as noted above) intervals don't really map to a duration; they map to a pair of timestamps or dates. Indeed, when considered in isolation from any associated start or end date column, the only interpretation we have left for an interval column is in terms of its (flawed) conversion to duration. Therefore, the only interactions which should be supported on intervals is arithmetic with dates and timestamps:

YearMonthInterval +/- Date32 -> Date32
DayTimeInterval +/- Date64 -> Date64
MonthDayNanoInterval +/- TimestampNS -> TimestampNS
...

From these and an appropriate start or end date column, we can make the correct conversion to duration and do well defined multiplication and division.

Comment on lines +579 to +585
if !value.is_finite() || value > i32::MAX as f64 || value < i32::MIN as f64 {
Err(ArrowError::ComputeError(
"Division result out of i32 range".to_string(),
))
} else {
Ok(value as i32)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is approached in the way I described above you won't be converting floats to integers. When you get a floating point array representing the number of milliseconds and you need to convert that to IntervalDayTimeType you will be dividing the input to MILLIS_IN_A_DAY and round the rest which comfortably fits in a 32-bit integer.

Comment on lines +775 to +782
/// Perform multiplication between an interval array and a numeric array
fn interval_mul_op<T: IntervalOp>(
op: Op,
l: &dyn Array,
l_s: bool,
r: &dyn Array,
r_s: bool,
) -> Result<ArrayRef, ArrowError> {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said in the first comment, these should become different cases in

fn interval_op<T: IntervalOp>(
    op: Op,
    l: &dyn Array,
    l_s: bool,
    r: &dyn Array,
    r_s: bool,
) -> Result<ArrayRef, ArrowError> {
    let l = l.as_primitive::<T>();
    let r = r.as_primitive::<T>();
    match op {
        Op::Add | Op::AddWrapping => Ok(try_op_ref!(T, l, l_s, r, r_s, T::add(l, r))),
        Op::Sub | Op::SubWrapping => Ok(try_op_ref!(T, l, l_s, r, r_s, T::sub(l, r))),
        -- NEW CASES HERE --

And then instead of relying on try_op_ref! that produces array operations based on single-value functions, you will first convert the interval array to a integer array (for both mul and div so no need for different traits), then delegate to either mul of div of numeric inputs (no interval types involved at this point), then you take that output and convert the desired interval type with the appropriate kernel. These conversion functions might be defined with try_op_ref!.

@felipecrv
Copy link

felipecrv commented Dec 20, 2024

You mean coercion from, say, f32 to f64? What are you referring to in this sentence?

DataFusion has existing logic to coerce inputs to those compatible with a function signature, if we provide the ability to multiply interval * i64, DataFusion has existing machinery to generalize this to interval * (type coercable to i64), including from things like strings.

The approach I suggested doesn't rely on these Datafusion casts and doesn't increase binary size. It's a win-win.

Decimals can be covered by truncating the resulting decimals to integers [1] (e.g. number of millis or nanos) and then converting those int arrays back to the interval type.

[1] Kernels that might already exist or should exist (taking the integer part of a whole decimal array)

@tustvold
Copy link
Contributor

tustvold commented Dec 20, 2024

The approach I suggested doesn't rely on these Datafusion casts and doesn't increase binary size. It's a win-win.

Unfortunately it is incorrect to treat intervals in this way - #6906 (comment)

I will try to find some time to re-review this PR over the next few days

@milevin
Copy link
Author

milevin commented Dec 20, 2024

The approach I suggested doesn't rely on these Datafusion casts and doesn't increase binary size. It's a win-win.

Unfortunately it is incorrect to treat intervals in this way - #6906 (comment)

I will try to find some time to re-review this PR over the next few days

Thanks a lot to both of you for the feedback!

I'll wait for @tustvold's additional review in the next few days and proceed based on the combination of recommendations.

@alamb
Copy link
Contributor

alamb commented Jan 10, 2025

So how shall we proceed with this PR? I feel like there was a lot of high level options thrown around but it isn't clear to me what the next steps are?

Perhaps we could make a new PR for a subset of the feature (e.g. interval multiplication) and work out the pattern in that PR where it is simpler to review?

@alamb
Copy link
Contributor

alamb commented Jan 17, 2025

Marking as draft as we don't seem to have reached any consensus and I am trying to make it clear what PRs are waiting on review in the review queue

@alamb alamb marked this pull request as draft January 17, 2025 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support multiple and divide on intervals
6 participants