Email addresses in HTML content are removed when sanitizing text coming from a plaintext email #126

istrasoft · 2017-08-23T14:25:02Z

When the string to sanitize comes from a plaintext email, such items are present in the original content :

blah blah

From: Mark <mailto:[email protected]>
Sent: Wednesday, August 16, 2017 19:47
To: John <[email protected]>
Subject: Re: Document Test

Hello John

If the email was a HTML email, the < and > around "<[email protected]>" are aleady escaped as < and > but if the email was plaintext, they are not.

In this specific case, the part <[email protected]> is considered to be an invalid HTML tag and is removed, along with all the following content from that point.

If option "Keep child nodes of removed elements" is chosen, then only these email tags are lost.

It would be great if after testing a tag against the whitelist, an additional test was made to attempt to match it to these two authorized and standard and safe instances.

The text was updated successfully, but these errors were encountered:

mganss · 2017-08-23T14:55:07Z

This is the same problem as #91. See there for a workaround.

istrasoft · 2017-08-24T10:00:04Z

Thanks @mganss !
However this specific case is an almost RFC-level standard occurence unlike "custom" html tags, so I thought maybe the project would support this in standard rather than a workaround :)

mganss · 2017-08-24T12:12:44Z

I'd like HtmlSanitizer to "do one thing well" and that's sanitize HTML, so adding this would be outside of this scope. If you have something that's not HTML, you'll need to do preprocessing.

istrasoft · 2017-08-24T12:37:11Z

Indeed, makes sense.. Maybe your gist could be included in the distribution and accessed through an additional call or option/flag. Thanks for the workaround and quick replies :)

mganss · 2017-08-24T14:24:29Z

I have added it to the Examples wiki page.

istrasoft · 2017-08-24T14:25:54Z

Thanks a lot @mganss !

sunitana · 2018-11-21T05:33:07Z

Hi, Just FYI want to update how I am handling this issue.

Created a method which identifies if the tag to be removed is in email format.
sanitizer.RemovingTag += (sender, evt) => { if (IsValidMailAddress(evt.Tag)) //tag won't be removed if it is n email format { isValid = true; evt.Cancel = true; } else if (!invalidTags.ContainsKey(evt.Tag.TagName)) { //invalidTags.Add(evt.Tag.TagName, evt.Reason.ToString()); isValid = false; } };

public static bool IsValidMailAddress(AngleSharp.Dom.IElement emailAddress) { try { if (emailAddress.NodeName.ToLower().StartsWith("mailto:")) { System.Net.Mail.MailAddress mTo = new System.Net.Mail.MailAddress(emailAddress.NodeName.Substring("mailto:".Length, emailAddress.NodeName.Length - "mailto:".Length)); } else { System.Net.Mail.MailAddress m = new System.Net.Mail.MailAddress(emailAddress.NodeName); } return true; } catch (Exception Ex) { return false; } }

istrasoft closed this as completed Aug 24, 2017

mganss mentioned this issue Jul 24, 2019

How do we handle emails in angle brackets #179

Closed

mganss mentioned this issue Sep 7, 2023

data<text removes <text part instead of sanitizing to data<text #464

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Email addresses in HTML content are removed when sanitizing text coming from a plaintext email #126

Email addresses in HTML content are removed when sanitizing text coming from a plaintext email #126

istrasoft commented Aug 23, 2017 •

edited

Loading

mganss commented Aug 23, 2017

istrasoft commented Aug 24, 2017 •

edited

Loading

mganss commented Aug 24, 2017

istrasoft commented Aug 24, 2017 •

edited

Loading

mganss commented Aug 24, 2017

istrasoft commented Aug 24, 2017

sunitana commented Nov 21, 2018

Email addresses in HTML content are removed when sanitizing text coming from a plaintext email #126

Email addresses in HTML content are removed when sanitizing text coming from a plaintext email #126

Comments

istrasoft commented Aug 23, 2017 • edited Loading

mganss commented Aug 23, 2017

istrasoft commented Aug 24, 2017 • edited Loading

mganss commented Aug 24, 2017

istrasoft commented Aug 24, 2017 • edited Loading

mganss commented Aug 24, 2017

istrasoft commented Aug 24, 2017

sunitana commented Nov 21, 2018

istrasoft commented Aug 23, 2017 •

edited

Loading

istrasoft commented Aug 24, 2017 •

edited

Loading

istrasoft commented Aug 24, 2017 •

edited

Loading