A python package for cleaning Gutenberg books and dataset.
nltk package
[sudo] pip install gutenberg-cleaner
it has two methods called "simple_cleaner" and "super_cleaner".
from gutenberg_cleaner import simple_cleaner, super_cleaner
Just removes lines that are part of the Project Gutenberg header or footer. Doesnt go deeply in the text to remove other things like titles or footnotes or etc...
simple_cleaner(book: str) -> str
Super clean the book (titles, footnotes, images, book information, etc.). may delete some good lines too.
super_cleaner(book: str, min_token: int = 5, max_token: int = 600) -> str
min_token: The minimum tokens of a paragraph that is not "dialog" or "quote", -1 means don't tokenize the txt (so it will be faster, but less efficient cleaning). max_token: The maximum tokens of a paragraph.
it will mark deleted paragraphs with: [deleted]
- Peyman Mohseni kiasari
This project is licensed under the MIT License - see the LICENSE.md file for details