Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

textContent strips <br/>s #32

Open
guidedways opened this issue May 10, 2018 · 5 comments
Open

textContent strips <br/>s #32

guidedways opened this issue May 10, 2018 · 5 comments

Comments

@guidedways
Copy link

let element:HTMLElement = HTMLElement(tagName: "div")
element.innerHTML = "Line<br/>Breaks"
print("\(element.textContent)")

output:

LineBreaks

desired:

Line\nBreaks

At leas this is how NSAttributedString's initWithHTML works. Anything I need to do to get this to work properly?

@guidedways
Copy link
Author

Actually a better question would be: how do I get HTMLKit to behave just like NSAttributedString? The only reason I'm looking for an alternative is because it uses WebKit internally and keeps the runloop running on the main thread, causing other asynchronous issues. It looks like HTMLKit is returning be a string with all tags stripped, whereas I'd like it to return me an equivalent to what I'd get if I simply turned HTML to plain text.

@iabudiab
Copy link
Owner

iabudiab commented May 13, 2018

@guidedways textContent is behaving as it should, i.e. <br> tags are stripped because they are not a textual content. Take a look here MDN Node.textContent

NSAttributedString

how do I get HTMLKit to behave just like NSAttributedString?

In order for HTMLKit to behave like NSAttributedString it should render the resulting HTML and then give back the resulting visual representation as a string. That's why NSAttributedString uses WebKit internally.

Plain Text

I'd like it to return me an equivalent to what I'd get if I simply turned HTML to plain text.

This is a much more complex topic than you would initially realise. The same with you other issue #31

Strictly speaking, the plain text variant of Line<br/>Breaks would be LineBreaks, because <br> is a HTML tag, i.e. the input is parsed to this DOM, assuming this is parsed as a fragment inside a <div>:

<div>Line<br>Breaks<div>

However HTML parsing is very lenient and even the most corrupt/invalid/unknown HTML would still produce a DOM tree that is more or less usable. Hence an input like this:

This is an <b>email</b>: John Do <[email protected]>

would produce this:

<div>This is an <b>email</b>: John Do <[email protected]></[email protected]></div>

Notice how the email <[email protected]> is now an element in the DOM.

Now let's take a look at another example, say the input is:

<table><tr><td>Hello<td>Plain<tr>Text

What would the plain text of this be? Is it HelloPlainText or HelloPlain\nTextor Hello\tPlain\nText or something completely different?

What I am trying to say is:

If you could provide a universally valid definition to turn HTML to plain text then maybe I could implement it.

HTML standard specifies one such definition and it is implemented via the textContent property. The bad news is, it is not usable for many purposes without further processing.

...

All this to say, I don't have a solution for this issue and still not completely sure how to solve #31 in a general way.

I'll let you know when I come to a conclusion.

@Jcragons
Copy link

Jcragons commented Apr 29, 2021

@iabudiab sorry old here, but i'm new to the class :) i'm agree with your point on the strategy or the aglo to switch html element to plain text, obviously with a basic styling html string, the main issue for using here is <br> to nothing, could have been an option to <br> = space or <br> = newline break ? i use to see that in php classes where you can "map" which html node returning something, like <b> = **{textContent}** (markdown style) or <b> = textContent for plain text only

@iabudiab
Copy link
Owner

@Jcragons 👋 hey there. Let me see if I understood correctly. You want an option to be able to specify how some tags should be replaced when retrieving the textContent of a node, correct? i.e. something like

let element: HTMLElement = HTMLElement(tagName: "div")
element.innerHTML = "Hello<br/>World"
let text = element.textContent(withCustomRules: ["br": " "])
// text: Hello World

I guess this shouldn't be hard to implement. However, I won't promise anything about an ETA 😉

@Jcragons
Copy link

@iabudiabA yeah exactly that :) no pressure for ETA, I know :) anyway I think it could be a nice addition, a lot of people use an old class in Php just because there is this feature. I'm sure it could help a lot here :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants