-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[stdlib] Fix String.split()
implementations
#3528
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
String.split()
implementations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job. I've just add a NIT-pick suggestion.
Also, is it possible to add a unit test?
Co-authored-by: Manuel Saelices <[email protected]> Signed-off-by: martinvuyk <[email protected]>
Hi, thanks for the review. Any type of test in mind that split tescases don't cover ? |
Signed-off-by: martinvuyk <[email protected]>
I thought this check would be broken in ❯ git diff
diff --git a/stdlib/test/collections/test_string.mojo b/stdlib/test/collections/test_string.mojo
index a664d321..b7a85c6c 100644
--- a/stdlib/test/collections/test_string.mojo
+++ b/stdlib/test/collections/test_string.mojo
@@ -824,6 +824,11 @@ def test_split():
assert_equal(res6[2], "долор")
assert_equal(res6[3], "сит")
assert_equal(res6[4], "амет")
+ var res7 = in6.split("м")
+ assert_equal(res7[0], "Лоре")
+ assert_equal(res7[1], " ипсу")
+ assert_equal(res7[2], " долор сит а")
+ assert_equal(res7[3], "ет") BTW, I still think it's a good test to add. |
I'm not understanding, so the lines from |
Signed-off-by: martinvuyk <[email protected]>
It's just a diff if you want to complete it with more test. LGTM anyways so don't worry. Thanks for that contribution 🥇 |
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
…o into fix-split-implementations
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
@JoeLoser can we merge this? 5 months of keeping this on top of the tree and with so many changes to string stuff is laborious. @ConnorGray I'm un-deprecating My plan is to then on top of this, similar to #3858, build an iterator taking this code. Then just adding a layer which assembles it into a list. |
Main issue
Fix
String.split()
implementations to use a generic implementation and without assuming that indexing is by byte offset. Added all methods toStringLiteral
andStringSlice
. Some important optimizations were added by parametrizing and avoiding slicing with numeric tricks.Changes in behavior
This PR changes
split("")
behavior to be non-raising and return the separated unicode characters analogous to when the whole string has the separator at start, end, and in between every character. Closes #3635String
,StringLiteral
, andStringSlice
.split()
now return aList[StringSlice]
.Benchmark results:
CPU: Intel® Core™ i7-7700HQ
improvement metric: markdown percentage improvement (
(old_value - new_value) / old_value
)Average improvement for split with a sequence: 91.2486% . In orders of magnitude, this is a 11x improvement
Average improvement for split on any whitespace: 99.9975% . In orders of magnitude, this is a 40k x improvement
bench_string_split[1000000]
bench_string_split_none[1000000]