Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stdlib] Fix String.split() implementations #3528

Open
wants to merge 150 commits into
base: main
Choose a base branch
from

Conversation

martinvuyk
Copy link
Contributor

@martinvuyk martinvuyk commented Sep 22, 2024

Main issue

Fix String.split() implementations to use a generic implementation and without assuming that indexing is by byte offset. Added all methods to StringLiteral and StringSlice. Some important optimizations were added by parametrizing and avoiding slicing with numeric tricks.

Changes in behavior

This PR changes split("") behavior to be non-raising and return the separated unicode characters analogous to when the whole string has the separator at start, end, and in between every character. Closes #3635

String, StringLiteral, and StringSlice .split() now return a List[StringSlice].

Benchmark results:

CPU: Intel® Core™ i7-7700HQ

improvement metric: markdown percentage improvement ((old_value - new_value) / old_value)

Average improvement for split with a sequence: 91.2486% . In orders of magnitude, this is a 11x improvement
Average improvement for split on any whitespace: 99.9975% . In orders of magnitude, this is a 40k x improvement

Name old_value (ms) new_value (ms) improvement
bench_string_split[1000000] 5.6970877871796 0.503624904955618 0.912486298354641
bench_string_split_none[1000000] 147157.6808882 3.78480079182398 0.999975287456096

@martinvuyk martinvuyk requested a review from a team as a code owner September 22, 2024 23:17
Signed-off-by: martinvuyk <[email protected]>
@martinvuyk martinvuyk changed the title [stdlib] Fix split implementations [stdlib] Fix String.split() implementations Sep 23, 2024
Copy link
Contributor

@msaelices msaelices left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job. I've just add a NIT-pick suggestion.

Also, is it possible to add a unit test?

Co-authored-by: Manuel Saelices <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
@martinvuyk
Copy link
Contributor Author

Great job. I've just add a NIT-pick suggestion.

Also, is it possible to add a unit test?

Hi, thanks for the review. Any type of test in mind that split tescases don't cover ?

@msaelices
Copy link
Contributor

msaelices commented Sep 23, 2024

Great job. I've just add a NIT-pick suggestion.
Also, is it possible to add a unit test?

Hi, thanks for the review. Any type of test in mind that split tescases don't cover ?

I thought this check would be broken in nightly but I am glad that it's not:

❯ git diff
diff --git a/stdlib/test/collections/test_string.mojo b/stdlib/test/collections/test_string.mojo
index a664d321..b7a85c6c 100644
--- a/stdlib/test/collections/test_string.mojo
+++ b/stdlib/test/collections/test_string.mojo
@@ -824,6 +824,11 @@ def test_split():
     assert_equal(res6[2], "долор")
     assert_equal(res6[3], "сит")
     assert_equal(res6[4], "амет")
+    var res7 = in6.split("м")
+    assert_equal(res7[0], "Лоре")
+    assert_equal(res7[1], " ипсу")
+    assert_equal(res7[2], " долор сит а")
+    assert_equal(res7[3], "ет")

BTW, I still think it's a good test to add.

@martinvuyk
Copy link
Contributor Author

I thought this check would be broken in nightly but I am glad that it's not:

❯ git diff
diff --git a/stdlib/test/collections/test_string.mojo b/stdlib/test/collections/test_string.mojo
index a664d321..b7a85c6c 100644
--- a/stdlib/test/collections/test_string.mojo
+++ b/stdlib/test/collections/test_string.mojo
@@ -824,6 +824,11 @@ def test_split():
     assert_equal(res6[2], "долор")
     assert_equal(res6[3], "сит")
     assert_equal(res6[4], "амет")
+    var res7 = in6.split("м")
+    assert_equal(res7[0], "Лоре")
+    assert_equal(res7[1], " ипсу")
+    assert_equal(res7[2], " долор сит а")
+    assert_equal(res7[3], "ет")

BTW, I still think it's a good test to add.

I'm not understanding, so the lines from var res7 ... onwards didn't get merged in another PR and you'd like me to add them here?

@msaelices
Copy link
Contributor

I thought this check would be broken in nightly but I am glad that it's not:

❯ git diff
diff --git a/stdlib/test/collections/test_string.mojo b/stdlib/test/collections/test_string.mojo
index a664d321..b7a85c6c 100644
--- a/stdlib/test/collections/test_string.mojo
+++ b/stdlib/test/collections/test_string.mojo
@@ -824,6 +824,11 @@ def test_split():
     assert_equal(res6[2], "долор")
     assert_equal(res6[3], "сит")
     assert_equal(res6[4], "амет")
+    var res7 = in6.split("м")
+    assert_equal(res7[0], "Лоре")
+    assert_equal(res7[1], " ипсу")
+    assert_equal(res7[2], " долор сит а")
+    assert_equal(res7[3], "ет")

BTW, I still think it's a good test to add.

I'm not understanding, so the lines from var res7 ... onwards didn't get merged in another PR and you'd like me to add them here?

It's just a diff if you want to complete it with more test. LGTM anyways so don't worry. Thanks for that contribution 🥇

@skongum02 skongum02 deleted the branch modular:main January 29, 2025 18:58
@skongum02 skongum02 closed this Jan 29, 2025
@skongum02 skongum02 reopened this Jan 29, 2025
@skongum02 skongum02 changed the base branch from nightly to main January 29, 2025 20:54
@martinvuyk martinvuyk marked this pull request as ready for review February 1, 2025 19:51
Signed-off-by: martinvuyk <[email protected]>
@martinvuyk
Copy link
Contributor Author

martinvuyk commented Feb 22, 2025

@JoeLoser can we merge this? 5 months of keeping this on top of the tree and with so many changes to string stuff is laborious.

@ConnorGray I'm un-deprecating __iter__ here because it is needed and is simply an iterator over codepoint slices. Here you can also see many examples where the performance of building a Codepoint (encoding from utf8 to utf32) would be unnecessarily expensive.

My plan is to then on top of this, similar to #3858, build an iterator taking this code. Then just adding a layer which assembles it into a list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
imported-internally Signals that a given pull request has been imported internally.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Make String.split with sep non-raising
7 participants