Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of *TRIM functions for StringViewArray #12387

Closed
Kev1n8 opened this issue Sep 9, 2024 · 4 comments · Fixed by #12395
Closed

Improve performance of *TRIM functions for StringViewArray #12387

Kev1n8 opened this issue Sep 9, 2024 · 4 comments · Fixed by #12395
Assignees
Labels
enhancement New feature or request

Comments

@Kev1n8
Copy link
Contributor

Kev1n8 commented Sep 9, 2024

Is your feature request related to a problem or challenge?

StringView provides us a way to directly modify views of an array instead of generating a new StringArray when it comes to string operations like substr, which was done in #12044. I think we can play a similar trick in trim.

Describe the solution you'd like

I've looked into the implementation of general_trim, it uses the str::trim_xxx_matches methods the obtain the "substring". Furthermore, inside the str::trim_xxx_matches method, it first computes the [start, end) boundary and slices the str. The index here is useful for modifying views. Unfortunately, currently the feature Pattern it uses is unstable.

Describe alternatives you've considered

I'm not very sure what alternatives we have. Perhaps either implement our own trim or wait until Pattern is stable in Rust.

Additional context

No response

@Kev1n8 Kev1n8 added the enhancement New feature or request label Sep 9, 2024
@Rachelint
Copy link
Contributor

Rachelint commented Sep 9, 2024

Seems we can get the index through this way:

  • Convert the originanl str1 to ptr1(*const [u8]) first.
  • Then we use trim_xxx_matches first to get the str2.
  • Convert the str2 to ptr2(*cosnt [u8]), and we get the index through ptr2 - ptr1?

Maybe I can try it.

@Kev1n8
Copy link
Contributor Author

Kev1n8 commented Sep 9, 2024

Sounds great! @Rachelint

@Rachelint
Copy link
Contributor

take

@Rachelint
Copy link
Contributor

Finally, after reminding from @alamb , I found we can get the index in safe way, detail can see:
https://github.com/Rachelint/arrow-datafusion/blob/f8174626e47d147e90c6715f5052ccfa269f0493/datafusion/functions/src/string/common.rs#L80

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants