Tokenizer: Implement character references #11

squeek502 · 2024-07-06T08:34:59Z

Implements spec-compliant errors for character references, but otherwise does not process character references. The character references themselves are emitted as part of their containing text/tag/attr token.

Because character references are not converted into their mapped codepoint(s) (e.g. ¬ -> ¬), this means that we only need to store a trie of named character references without a mapping to the relevant codepoints. For this, a DAFSA was generated using a modified version of https://github.com/squeek502/named-character-references

A DAFSA (deterministic acyclic finite state automaton) is essentially a trie flattened into an array, but it also uses techniques to minimize redundant nodes. This provides fast lookups while minimizing the required data size.

Some resources:

The DAFSA here needs 3872 nodes encoded as packed struct(u22)s, which, due to alignment, ends up as 15488 bytes (15.1KiB). Using a PackedIntArray can reduce the number of bytes needed, but reduces the performance from my testing (using the named-char-test.html test file from https://gist.github.com/squeek502/07b7dee1086f6e9dc38c4a880addfeca I get +28.1% ± 0.3% when tokenizing it). Note also that using a regular struct instead of a packed struct increases the @sizeOf(Node) to 6 bytes, and using packed(u32) has ~no difference to packed(u22).

Some examples of similar DAFSA PRs I've made in the past if you're curious about how the DAFSA compares to other approaches:

Here's what the errors look like when using SublimeText:

And here's proof that the example from here is handled correctly:

Note: It's worth merging #10 before testing this branch, since it's pretty easy to run into the root node bug that's fixed in that PR

Implements spec-compliant errors for character references, but otherwise does not process character references. The characters themselves are emitted as part of their containing text/tag/attr token. Because character references are not converted into their mapped codepoint(s) (e.g. `¬` -> `¬`), this means that we only need to store a trie of named character references without a mapping to the relevant codepoints. For this, a DAFSA was generated using a modified version of https://github.com/squeek502/named-character-references A DAFSA (deterministic acyclic finite state automaton) is essentially a trie flattened into an array, but it also uses techniques to minimize redundant nodes. This provides fast lookups while minimizing the required data size. Some resources: - https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton - https://web.archive.org/web/20220722224703/http://pages.pathcom.com/~vadco/dawg.html - http://stevehanov.ca/blog/?id=115 The DAFSA here needs 3872 nodes encoded as `packed struct(u22)`s, which, due to alignment, ends up as 15488 bytes (15.1KiB). Using a PackedIntArray can reduce the number of bytes needed, but reduces the performance from my testing (using the named-char-test.html test file from https://gist.github.com/squeek502/07b7dee1086f6e9dc38c4a880addfeca I get +28.1% ± 0.3% when tokenizing it). Note also that using a regular struct instead of a packed struct increases the `@sizeOf(Node)` to 6 bytes, and using `packed(u32)` has ~no difference to `packed(u22)`.

kristoff-it · 2024-07-06T13:06:44Z

Thank you squeek!!!!!
This PR is amazing.

I see that there is a difference between a bad character reference in an attribute value vs outside, according to the spec.

Since this parser is designed primarily for the usecase of supporting human-written HTML, I've diverged from the spec in some occasions when strict adherence would prevent me from detecting a probable human error. As an example, respecting implicitly closed tags would prevent Super from reporting <h1> foo <h1> as an error.

With that in mind, do you think it might make sense to be more strict than the spec wrt bad character references in attributes?

My understanding is that if your intent is to actually write '&notit;' in an attribute, you can always come up with an actually correct encoding, like '&notit;' (I think). That said, people might have different expectations and it would forbid them from using a "shorthand encoding".

What do you think?

squeek502 force-pushed the character-references branch 2 times, most recently from c74a58b to f8c21ba Compare July 6, 2024 09:05

squeek502 force-pushed the character-references branch from f8c21ba to 9d21231 Compare July 6, 2024 09:45

kristoff-it merged commit 85c12a2 into kristoff-it:main Jul 6, 2024

squeek502 mentioned this pull request Jul 6, 2024

Tokenizer: Do not suppress 'missing semicolon after character reference' errors in attributes #12

Merged

squeek502 mentioned this pull request Jul 30, 2024

Make Node in named_character_references a packed struct #26

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer: Implement character references #11

Tokenizer: Implement character references #11

squeek502 commented Jul 6, 2024 •

edited

Loading

kristoff-it commented Jul 6, 2024 •

edited

Loading

Tokenizer: Implement character references #11

Tokenizer: Implement character references #11

Conversation

squeek502 commented Jul 6, 2024 • edited Loading

kristoff-it commented Jul 6, 2024 • edited Loading

squeek502 commented Jul 6, 2024 •

edited

Loading

kristoff-it commented Jul 6, 2024 •

edited

Loading