-
-
Notifications
You must be signed in to change notification settings - Fork 350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XML parsing with CDATA not working #817
Comments
This seems to work using the XML parser instead of the HTML one, but you do need to specify the namespace correctly:
|
This works great! So my problem is solved, but I don't know if the issue should be left open, since it probably should work with xpath also? |
I'm not sure. Your trying to parse XML with an html parser. From what I could see it should work but doesn't. I expect a simple test case using lxml etree on its own would be a good start, open an issue on the lxml bug tracker with sample code and see what happens. I don't see anything wrong with how urlwatch is using the library, but I'm not an expert. |
I don't know either. But according to wikipedia XPath stands for "XML Path Language" ... I also found lots of XML examples without searching for it... Maybe the used library is not set out for XML? But that makes also not really sense. Let's keep this here for the moment and see what the dev(s) have to say about this. |
By default urlwatch uses the HTMLParser class from lxml etree. My example switches it to the XML parser. |
I try to monitor new releases of factorio. But it seems that fields with
CDATA
fields are always returned empty.Factorio publishes the new releases in their phpbb which has a atom feed. The entry that should work IMHO is:
One of the entries looks like this:
I am able to get all the fields not containing a
CDATA
but none containing one. So for example '//entry[1]/id/text()' works without a problem.The text was updated successfully, but these errors were encountered: