Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stdlib] Micro-optimize utf8 helper functions #3896

Closed

Conversation

martinvuyk
Copy link
Contributor

Micro-optimize utf8 helper functions

@martinvuyk martinvuyk requested a review from a team as a code owner December 18, 2024 15:25
@JoeLoser
Copy link
Collaborator

@ConnorGray do you mind taking a look at this since you've been overhauling how we expose the UTF-8 helpers, Char, and friends?

@skongum02 skongum02 deleted the branch modular:main January 29, 2025 18:59
@skongum02 skongum02 closed this Jan 29, 2025
@skongum02 skongum02 reopened this Jan 29, 2025
@skongum02 skongum02 changed the base branch from nightly to main January 29, 2025 20:36
@martinvuyk
Copy link
Contributor Author

martinvuyk commented Feb 1, 2025

Hi @JoeLoser and @ConnorGray I'm back from vacation and had some time today to look over the new Char type. FWIW I think it's a good initiative, but IMO it has to be renamed to UTF32Char or something along those lines. It is expensive to create when iterating through long chains of text encoded in a different format (UTF8, UTF16). So IMO we need to create the equivalent UTF8Char and eventually if needed UTF16Char.

Each encoding has its own tricks that are faster for their respective raw data format. Transforming to and from utf32 to then check a byte is wasteful. Going UTF32 is inefficient for mostly ASCII text as well. It is also confusing for people why Char is different from List[Byte] (in the case of UTF8Char the underlying storage could be List[Byte] or (if we have a dynamically owned/CoW buffer one day) Variant[Span[Byte], List[Byte]]).

I was focusing on building out our UTF8 capacity to its maximum with all the little tricks I could find. Then start thinking about whether we want to support UTF32 fully (sidenote: Python is migrating over to UTF8 anyway). IMO the Char type is a nice to have that rounds up all the helper functions I've been tinkering with since I started contributing last year, but there is a lot to build out before I feel comfortable committing to making it a builtin type. UTF32 is not even our main String encoding to make Char use it by default, and like I said when doing some high perf stuff with UTF8 processing like #3528 there is a lot of need for every optimization.

What I propose: Let us build UTF8Char and also rename Char to UTF32Char since many conversions to and from Unicode are needed as evidenced by the amount of helper functions involved. Have both be inside the collections/string sub-package and not as builtins. Then we can insert all utf8 and utf32 helpers that are dispersed in string_slice.mojo into them and try and setup a readable API. Then (months later) we can try and unify both in a single trait.

Edit: I opened a full/better thought through proposal #3988

@ConnorGray
Copy link
Collaborator

!sync

@modularbot modularbot added the imported-internally Signals that a given pull request has been imported internally. label Feb 24, 2025
@modularbot
Copy link
Collaborator

✅🟣 This contribution has been merged 🟣✅

Your pull request has been merged to the internal upstream Mojo sources. It will be reflected here in the Mojo repository on the main branch during the next Mojo nightly release, typically within the next 24-48 hours.

We use Copybara to merge external contributions, click here to learn more.

@modularbot modularbot added the merged-internally Indicates that this pull request has been merged internally label Feb 26, 2025
@modularbot modularbot added the merged-externally Merged externally in public mojo repo label Feb 26, 2025
@modularbot
Copy link
Collaborator

Landed in bdaca0f! Thank you for your contribution 🎉

@martinvuyk martinvuyk deleted the micro-optimize-utf8-seq-length branch February 26, 2025 21:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
imported-internally Signals that a given pull request has been imported internally. merged-externally Merged externally in public mojo repo merged-internally Indicates that this pull request has been merged internally
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants