-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't require utf-8 source files #511
Conversation
@squeek502 when you get a chance can you try out this PR and see if it fixes the encoding issues you were seeing? You'll need a fairly recent build of the Zig compiler. It should also be somewhat faster than before since we're no longer scanning the whole file up front to validate UTF8-ness. |
Haven't investigated fully yet, but preliminary results are good. This branch brings the win32-samples-rc-tests results to:
up from
|
This removes the requirement that source files be UTF-8 encoded for tokenization and preprocessing. Only the parser cares about the source encoding, so we defer any checks until then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are some really nice changes!
const slice = p.tokSlice(p.tok_i); | ||
const view = std.unicode.Utf8View.init(slice) catch { | ||
try p.errTok(.invalid_utf8, p.tok_i); | ||
return error.FatalError; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a place for future improvements.
This removes the requirement that source files always be UTF-8 encoded. Identifiers are still required to be UTF-8 encoded; and there is a todo item for char literals to match clang/gcc behavior when the source text of the literal is not UTF-8 encoded.
Some simple testing on my machine shows that skipping the up-front UTF-8 validation is significantly faster for large files - the 8.4 meg amalgamated sqlite3.c source parses in ~400ms on master; and ~340ms on this branch for me.