Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Please add plain text output functionality #150

Open
yevgenpapernyk opened this issue Jul 24, 2020 · 6 comments
Open

Feature: Please add plain text output functionality #150

yevgenpapernyk opened this issue Jul 24, 2020 · 6 comments

Comments

@yevgenpapernyk
Copy link

Like .summary() but plain text instead of the .summary() html version.

E.g. as a new method or as an argument for the .summary() method.

That would be very useful for Natural Language Processing.

@yevgenpapernyk yevgenpapernyk changed the title Feature: Add plain text output functionality Feature: Please add plain text output functionality Jul 24, 2020
@buriy
Copy link
Owner

buriy commented Jul 24, 2020

I highly recommend using html2text library on the .summary() output for that.

    converter = HTML2Text()
    converter.ignore_links = True
    converter.ignore_emphasis = True
    converter.body_width = 0
    text = converter.handle(html)
    return text

given that it's that easy and that different people need different rendering options, and the options might change over time and I would need to reflect them in the library interface, I'd like to leave it as is.
However, I might consider adding a simple version, for that you need just .text_content() in lxml.

@adbar
Copy link

adbar commented Jul 29, 2020

Shameless plug: trafilatura builds upon readability-lxml and can convert the output to TXT, XML, CSV and JSON.

@yevgenpapernyk
Copy link
Author

yevgenpapernyk commented Aug 24, 2020

However, I might consider adding a simple version, for that you need just .text_content() in lxml.

So I'll leave the issue opened until you decide whether you want to add it, right?

@IdavalapatiRamanjaneyulu

Is there a plan to support textContent like we have in JS module https://github.com/mozilla/readability#parse?

@buriy
Copy link
Owner

buriy commented Aug 17, 2021

Yes if many people want an easy way to have text output, I'll add it.

@Seshu77
Copy link

Seshu77 commented Aug 20, 2021

Could you please support to get clear text content?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants