Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer: Implement character references #11

Merged
merged 1 commit into from
Jul 6, 2024

Conversation

squeek502
Copy link
Contributor

@squeek502 squeek502 commented Jul 6, 2024

Implements spec-compliant errors for character references, but otherwise does not process character references. The character references themselves are emitted as part of their containing text/tag/attr token.

Because character references are not converted into their mapped codepoint(s) (e.g. ¬ -> ¬), this means that we only need to store a trie of named character references without a mapping to the relevant codepoints. For this, a DAFSA was generated using a modified version of https://github.com/squeek502/named-character-references

A DAFSA (deterministic acyclic finite state automaton) is essentially a trie flattened into an array, but it also uses techniques to minimize redundant nodes. This provides fast lookups while minimizing the required data size.

Some resources:

The DAFSA here needs 3872 nodes encoded as packed struct(u22)s, which, due to alignment, ends up as 15488 bytes (15.1KiB). Using a PackedIntArray can reduce the number of bytes needed, but reduces the performance from my testing (using the named-char-test.html test file from https://gist.github.com/squeek502/07b7dee1086f6e9dc38c4a880addfeca I get +28.1% ± 0.3% when tokenizing it). Note also that using a regular struct instead of a packed struct increases the @sizeOf(Node) to 6 bytes, and using packed(u32) has ~no difference to packed(u22).


Some examples of similar DAFSA PRs I've made in the past if you're curious about how the DAFSA compares to other approaches:


Here's what the errors look like when using SublimeText:

char-ref-errors-inline
char-ref-errors

And here's proof that the example from here is handled correctly:

char-ref-errors-spec


Note: It's worth merging #10 before testing this branch, since it's pretty easy to run into the root node bug that's fixed in that PR

@squeek502 squeek502 force-pushed the character-references branch 2 times, most recently from c74a58b to f8c21ba Compare July 6, 2024 09:05
Implements spec-compliant errors for character references, but otherwise does not process character references. The characters themselves are emitted as part of their containing text/tag/attr token.

Because character references are not converted into their mapped codepoint(s) (e.g. `¬` -> `¬`), this means that we only need to store a trie of named character references without a mapping to the relevant codepoints. For this, a DAFSA was generated using a modified version of https://github.com/squeek502/named-character-references

A DAFSA (deterministic acyclic finite state automaton) is essentially a trie flattened into an array, but it also uses techniques to minimize redundant nodes. This provides fast lookups while minimizing the required data size.

Some resources:

- https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton
- https://web.archive.org/web/20220722224703/http://pages.pathcom.com/~vadco/dawg.html
- http://stevehanov.ca/blog/?id=115

The DAFSA here needs 3872 nodes encoded as `packed struct(u22)`s, which, due to alignment, ends up as 15488 bytes (15.1KiB). Using a PackedIntArray can reduce the number of bytes needed, but reduces the performance from my testing (using the named-char-test.html test file from https://gist.github.com/squeek502/07b7dee1086f6e9dc38c4a880addfeca I get +28.1% ± 0.3% when tokenizing it). Note also that using a regular struct instead of a packed struct increases the `@sizeOf(Node)` to 6 bytes, and using `packed(u32)` has ~no difference to `packed(u22)`.
@squeek502 squeek502 force-pushed the character-references branch from f8c21ba to 9d21231 Compare July 6, 2024 09:45
@kristoff-it
Copy link
Owner

kristoff-it commented Jul 6, 2024

Thank you squeek!!!!!
This PR is amazing.

I see that there is a difference between a bad character reference in an attribute value vs outside, according to the spec.

Since this parser is designed primarily for the usecase of supporting human-written HTML, I've diverged from the spec in some occasions when strict adherence would prevent me from detecting a probable human error. As an example, respecting implicitly closed tags would prevent Super from reporting <h1> foo <h1> as an error.

With that in mind, do you think it might make sense to be more strict than the spec wrt bad character references in attributes?

My understanding is that if your intent is to actually write '&notit;' in an attribute, you can always come up with an actually correct encoding, like '&notit;' (I think). That said, people might have different expectations and it would forbid them from using a "shorthand encoding".

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants