Tokenizer: Implement character references #11
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Implements spec-compliant errors for character references, but otherwise does not process character references. The character references themselves are emitted as part of their containing text/tag/attr token.
Because character references are not converted into their mapped codepoint(s) (e.g.
¬
->¬
), this means that we only need to store a trie of named character references without a mapping to the relevant codepoints. For this, a DAFSA was generated using a modified version of https://github.com/squeek502/named-character-referencesA DAFSA (deterministic acyclic finite state automaton) is essentially a trie flattened into an array, but it also uses techniques to minimize redundant nodes. This provides fast lookups while minimizing the required data size.
Some resources:
The DAFSA here needs 3872 nodes encoded as
packed struct(u22)
s, which, due to alignment, ends up as 15488 bytes (15.1KiB). Using a PackedIntArray can reduce the number of bytes needed, but reduces the performance from my testing (using the named-char-test.html test file from https://gist.github.com/squeek502/07b7dee1086f6e9dc38c4a880addfeca I get +28.1% ± 0.3% when tokenizing it). Note also that using a regular struct instead of a packed struct increases the@sizeOf(Node)
to 6 bytes, and usingpacked(u32)
has ~no difference topacked(u22)
.Some examples of similar DAFSA PRs I've made in the past if you're curious about how the DAFSA compares to other approaches:
Here's what the errors look like when using SublimeText:
And here's proof that the example from here is handled correctly:
Note: It's worth merging #10 before testing this branch, since it's pretty easy to run into the root node bug that's fixed in that PR