-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Special Chars getting butchered #2
Comments
The html parser accepts html source as string, which means it doesn't need to worry about text encoding because a javascript string is expected to contain valid utf-8 data. If you need to deal with data with encodings other than utf-8, check out node-iconv https://github.com/bnoordhuis/node-iconv |
I actually stepped back a level and handled all the encodings after I do the readability parsing since I need to pick out certain elements anyways. Thanks for the answer and the explanation though. |
Can you post an example? I'm seeing this too. |
I'm seeing an issue with htmlentities. Using the clean-proxy example, try to open this article: http://www.openforum.com/idea-hub/topics/marketing/article/what-we-can-learn-from-justin-bieber-guy-kawasaki ” converts to ” .. and similarly other entities too. |
Before : Readability: ---DOM created Braquage cette nuit � Saint-Pierre <div id="corps"><p><p>Deux hommes, arriv�s � bord d’un scooter, ont fait irruption cette nuit vers 3h30 chez un marchand de fruits et l�gumes ouvert 24h/24 � Saint-Pierre. Ils auraient alors menac� d’une arme � feu le g�rant en r�clamant la caisse. Mais ne seraient repartis qu’avec la balance, croyant sans doute qu’elle pouvait contenir de l’argent. Pour le magasin, le pr�judice �conomique est donc plut�t l�ger. Mais si personne n’a �t� bless�, le braqu� est �videmment choqu�.</p> After: Readability: ---DOM created Braquage cette nuit à Saint-Pierre <div id="corps"><p><p>Deux hommes, arrivés à bord d’un scooter, ont fait irruption cette nuit vers 3h30 chez un marchand de fruits et légumes ouvert 24h/24 à Saint-Pierre. Ils auraient alors menacé d’une arme à feu le gérant en réclamant la caisse. Mais ne seraient repartis qu’avec la balance, croyant sans doute qu’elle pouvait contenir de l’argent. Pour le magasin, le préjudice économique est donc plutôt léger. Mais si personne n’a été blessé, le braqué est évidemment choqué.</p> <p><strong>Plus d’informations demain dans votre Journal de l’île.</strong></p></p></div> if you use request (from mikeal for example), you will just have to do this : var readability = require('../lib/readability'); var url = "http://www.clicanoo.re/322520-braquage-cette-nuit-a-saint-pierre.html"; var request = require('request'); request({url:url, 'encoding':'binary'}, function (error, response, html) { var content_type = response['headers']['content-type'].split('='); var encoding = content_type[1].toUpperCase(); if (!error && response.statusCode == 200) { readability.parse(html, url, {encoding:encoding}, function(result) { console.log(result.title, result.content); }); } });
Is there way to specify encoding? Or does it already default to something during the parsing. I'm using node-readability to parse lots and lots of blog pages (from a list of permalinks) but I'm getting lots of special chars (a manner of quotes, hyphens, etc.) that are transforming into messy jumbles (eg. ’)
The text was updated successfully, but these errors were encountered: