Tokenizer: Implement character references · kristoff-it/superhtml@9d21231

Commit

Tokenizer: Implement character references

Implements spec-compliant errors for character references, but otherwise does not process character references. The characters themselves are emitted as part of their containing text/tag/attr token.

Because character references are not converted into their mapped codepoint(s) (e.g. `&not;` -> `¬`), this means that we only need to store a trie of named character references without a mapping to the relevant codepoints. For this, a DAFSA was generated using a modified version of https://github.com/squeek502/named-character-references

A DAFSA (deterministic acyclic finite state automaton) is essentially a trie flattened into an array, but it also uses techniques to minimize redundant nodes. This provides fast lookups while minimizing the required data size.

Some resources:

- https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton
- https://web.archive.org/web/20220722224703/http://pages.pathcom.com/~vadco/dawg.html
- http://stevehanov.ca/blog/?id=115

The DAFSA here needs 3872 nodes encoded as `packed struct(u22)`s, which, due to alignment, ends up as 15488 bytes (15.1KiB). Using a PackedIntArray can reduce the number of bytes needed, but reduces the performance from my testing (using the named-char-test.html test file from https://gist.github.com/squeek502/07b7dee1086f6e9dc38c4a880addfeca I get +28.1% ± 0.3% when tokenizing it). Note also that using a regular struct instead of a packed struct increases the `@sizeOf(Node)` to 6 bytes, and using `packed(u32)` has ~no difference to `packed(u22)`.

Loading branch information

squeek502 committed Jul 6, 2024

1 parent 17c4eee commit 9d21231

0 comments on commit `9d21231`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `9d21231`

Commit

There are no files selected for viewing

0 comments on commit 9d21231

0 comments on commit `9d21231`