Skip to content

Commit

Permalink
Tokenizer: Implement character references
Browse files Browse the repository at this point in the history
Implements spec-compliant errors for character references, but otherwise does not process character references. The characters themselves are emitted as part of their containing text/tag/attr token.

Because character references are not converted into their mapped codepoint(s) (e.g. `¬` -> `¬`), this means that we only need to store a trie of named character references without a mapping to the relevant codepoints. For this, a DAFSA was generated using a modified version of https://github.com/squeek502/named-character-references

A DAFSA (deterministic acyclic finite state automaton) is essentially a trie flattened into an array, but it also uses techniques to minimize redundant nodes. This provides fast lookups while minimizing the required data size.

Some resources:

- https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton
- https://web.archive.org/web/20220722224703/http://pages.pathcom.com/~vadco/dawg.html
- http://stevehanov.ca/blog/?id=115

The DAFSA here needs 3872 nodes encoded as `packed struct(u22)`s, which, due to alignment, ends up as 15488 bytes (15.1KiB). Using a PackedIntArray can reduce the number of bytes needed, but reduces the performance from my testing (using the named-char-test.html test file from https://gist.github.com/squeek502/07b7dee1086f6e9dc38c4a880addfeca I get +28.1% ± 0.3% when tokenizing it). Note also that using a regular struct instead of a packed struct increases the `@sizeOf(Node)` to 6 bytes, and using `packed(u32)` has ~no difference to `packed(u22)`.
  • Loading branch information
squeek502 committed Jul 6, 2024
1 parent 17c4eee commit 9d21231
Show file tree
Hide file tree
Showing 2 changed files with 4,702 additions and 23 deletions.
Loading

0 comments on commit 9d21231

Please sign in to comment.