Namu Wiki Extractor

This library strips all namu marks from a namu wiki document and extracts its plain text only.

Requirement

Python 3

Installation

pip install namu-wiki-extractor

Usage

Basic

import json
from namuwiki.extractor import extract_text

with open('namu_wiki.json', 'r', encoding='utf-8') as input_file:
    namu_wiki = json.load(input_file)

item = namu_wiki[1]
plain_text = extract_text(item['text'])
print(plain_text)

Extract deletions and footnotes separately

import json
from namuwiki.extractor import extract_text

with open('namu_wiki.json', 'r', encoding='utf-8') as input_file:
    namu_wiki = json.load(input_file)

item = namu_wiki[1]
document = extract_text(item['text'], separate_deletions=True, separate_footnotes=True)
print(document.text)
print(document.deletions)
print(document.footnotes)

Multiprocessing

import json
from multiprocessing import Pool

from namuwiki.extractor import extract_text

def work(document):
    return {
        'title': document['title'],
        'content': extract_text(document['text'])
    }

with open('namu_wiki.json', 'r', encoding='utf-8') as input_file:
    namu_wiki = json.load(input_file)

with Pool() as pool:
    items = pool.map(work, namu_wiki)

API

namuwiki.extractor.extract_text(source: str, separate_deletions: bool = False, separate_footnotes: bool = False) -> Union[str, Document]

This function strips all namu marks from source and extracts its plain text. If either separate_deletions or separate_footnotes is True, this returns extracted plain text as str. Otherwise, this returns extracted plain text, deletions and footnotes as Document

Parameter

source: Text from a namu wiki document
separate_deletions: Whether deletions should be separately extracted from the source
separate_footnotes: Whether footnotes should be separately extracted from the source

namuwiki.extractor.Document(text: str, deletions: List[str], footnotes: List[str])

text: Plain text with all namu marks removed from the given source
deletions: Separately extracted deletions from the given source
footnotes: Separately extracted footnotes from the given source

Note

A JSON dump file of namu wiki can be downloaded from here

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
namuwiki		namuwiki
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Namu Wiki Extractor

Requirement

Installation

Usage

Basic

Extract deletions and footnotes separately

Multiprocessing

API

namuwiki.extractor.extract_text(source: str, separate_deletions: bool = False, separate_footnotes: bool = False) -> Union[str, Document]

Parameter

namuwiki.extractor.Document(text: str, deletions: List[str], footnotes: List[str])

Note

About

Releases

Packages

Contributors 3

Languages

License

jonghwanhyeon/namu-wiki-extractor

Folders and files

Latest commit

History

Repository files navigation

Namu Wiki Extractor

Requirement

Installation

Usage

Basic

Extract deletions and footnotes separately

Multiprocessing

API

namuwiki.extractor.extract_text(source: str, separate_deletions: bool = False, separate_footnotes: bool = False) -> Union[str, Document]

Parameter

namuwiki.extractor.Document(text: str, deletions: List[str], footnotes: List[str])

Note

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages