Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Tokenizer: Implement character references
Implements spec-compliant errors for character references, but otherwise does not process character references. The characters themselves are emitted as part of their containing text/tag/attr token. Because character references are not converted into their mapped codepoint(s) (e.g. `¬` -> `¬`), this means that we only need to store a trie of named character references without a mapping to the relevant codepoints. For this, a DAFSA was generated using a modified version of https://github.com/squeek502/named-character-references A DAFSA (deterministic acyclic finite state automaton) is essentially a trie flattened into an array, but it also uses techniques to minimize redundant nodes. This provides fast lookups while minimizing the required data size. Some resources: - https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton - https://web.archive.org/web/20220722224703/http://pages.pathcom.com/~vadco/dawg.html - http://stevehanov.ca/blog/?id=115 The DAFSA here needs 3872 nodes encoded as `packed struct(u22)`s, which, due to alignment, ends up as 15488 bytes (15.1KiB). Using a PackedIntArray can reduce the number of bytes needed, but reduces the performance from my testing (using the named-char-test.html test file from https://gist.github.com/squeek502/07b7dee1086f6e9dc38c4a880addfeca I get +28.1% ± 0.3% when tokenizing it). Note also that using a regular struct instead of a packed struct increases the `@sizeOf(Node)` to 6 bytes, and using `packed(u32)` has ~no difference to `packed(u22)`.
- Loading branch information