Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lxml.clean cannot defang some raw text ("<3") #2

Open
jmoiron opened this issue Feb 7, 2012 · 0 comments
Open

lxml.clean cannot defang some raw text ("<3") #2

jmoiron opened this issue Feb 7, 2012 · 0 comments

Comments

@jmoiron
Copy link
Owner

jmoiron commented Feb 7, 2012

lxml.html.clean cannot clean some raw text (in titles, descriptions, etc); particularly, text that might look like html but isn't. The first case I noticed was the tag "<title><3</title>" will fail with a ParserError('Document is empty'), which is likely an underlying libxml2 issue.

We cannot simply pass the text through, as <3<script>alert('foo');</script> will also raise this same error. Currently, there is a regression test "TestHeartParserError" which confirms this error in speedparser and confirms that feedparser will read this content.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant