Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Support for e notation using existing parse_decimal in string to decimal conversion #6905

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

himadripal
Copy link
Contributor

@himadripal himadripal commented Dec 19, 2024

Which issue does this PR close?

Closes #. apache/datafusion#10315

Rationale for this change

What changes are included in this PR?

Completed :

  • replaced string to decimal conversion with existing parse_decimal method
  • added rounding logic to existing parse_decimal method
  • fixed the cast error message to precision and scale in the decimal cast error instead of default precision and scale.
  • add rounding logic in parse_e_notation method.Existing function does not round based on scale for eNotations.
  • removed unused method port over parse_string_to_decimal_native and remove this fn

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Dec 19, 2024
@himadripal himadripal changed the title Fix string to decimal e notation Fix: Support for e notation using existing parse_decimal in string to decimal conversion Dec 19, 2024
@himadripal
Copy link
Contributor Author

himadripal commented Dec 19, 2024

@andygrove @viirya @tustvold please take a first look. The one failing test will be fixed once I add the rounding logic in parse-e-notation function.

.and_then(|v| T::validate_decimal_precision(v, precision).map(|_| v))
parse_decimal::<T>(v, precision, scale).map_err(|_| {
ArrowError::CastError(format!(
"Cannot cast string '{}' to decimal type of precision {} and scale {}",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

T:DATA_TYPE shows default Decimal(38,10) or Decimal256(76,..) in the error message, hiding the precision and scale provided for cast.

@@ -230,6 +231,7 @@ where
)?))
}

#[allow(dead_code)]
Copy link
Contributor Author

@himadripal himadripal Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fails in clippy, hence added #[allow(dead_code)], there is no use, if required we can remove it and cover existing tests with parse_decimal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove this and port the tests, to ensure we aren't losing test coverage / accidentally changing behaviour

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be a breaking API change, as it changes the rounding behaviour of parse_decimal?

@himadripal
Copy link
Contributor Author

I think this might be a breaking API change, as it changes the rounding behaviour of parse_decimal?

Clippy did not complain and tests are passing, except one which I'm working on - rounding for e-notation. Would any others build task catch it?

@tustvold tustvold added api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version labels Dec 26, 2024
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR seems to remove a number of tests, and orphan some others. If we're changing what cast does, can we please remove the old implementation and port the old tests, so that we aren't losing test coverage.

Also as written this PR is a breaking change, as it alters the rounding behaviour of the parser.

@@ -1284,7 +1284,7 @@ mod tests {
assert_eq!("53.002666", lat.value_as_string(1));
assert_eq!("52.412811", lat.value_as_string(2));
assert_eq!("51.481583", lat.value_as_string(3));
assert_eq!("12.123456", lat.value_as_string(4));
assert_eq!("12.123457", lat.value_as_string(4));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we can see this is a breaking change to the rounding behaviour

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also to note, previous behavior was not correct.

12.12345678 cast to `Decimal128(38, 6)` =  12.123457

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It truncated rather than rounding, they're both valid behaviours, changing this is a breaking change

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an argument for accepting the breaking change to use rounding since it would be consistent with how we cast floating point to decimal. However, do we want to consider adding a parameter to choose between truncation and rounding?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally wouldn't characterize this a breaking change, though I can see how others might.

In my opinion, adding a parameter to choose between the behaviors would be the safest thing (aka a field to CastOptions that defaults to the old, rounding, behavior) for https://docs.rs/arrow/latest/arrow/compute/kernels/cast/fn.cast_with_options.html

Maybe @liukun4515 who added much of the initial decimal support in arrow-rs has time to offer historical perspective on rounding vs truncation during casting?

@himadripal
Copy link
Contributor Author

himadripal commented Dec 26, 2024

This PR seems to remove a number of tests, and orphan some others. If we're changing what cast does, can we please remove the old implementation and port the old tests, so that we aren't losing test coverage.

Also as written this PR is a breaking change, as it alters the rounding behaviour of the parser.

Thanks @tustvold for the quick review. I've moved over most of the tests for parse_string_to_decimal_native to use parse_decimal whichever is not already covered by another test. let me know if I've missed anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate next-major-release the PR has API changes and it waiting on the next major version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants