Russian articles are not extracted #223

szhem · 2015-04-28T16:48:19Z

Seems that article.cleaned_text from here #135 does not work. article_text is always empty.

* As STOP_WORDS are stored in unicode format we should keep our words candidates in unicode also to be able to compare candidates against dictionary correctly * With some languages, short stopwords are linked to the next word in the sentance with no-breakable-space. To designate those stop words we should support nbsp when tokenizing. Russian is an example. So this fixes grangier#223

Lol4t0 linked a pull request Nov 13, 2015 that will close this issue

Fixed unicode handling, Python 3 support, Request as network backend, better content root extraction and other awesome features #248

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Russian articles are not extracted #223

Russian articles are not extracted #223

szhem commented Apr 28, 2015

Russian articles are not extracted #223

Russian articles are not extracted #223

Comments

szhem commented Apr 28, 2015