forked from sammyer/BoilerPy
-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathREADME.txt
86 lines (48 loc) · 2.7 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
=========
BoilerPy
=========
About
---------------------------------------
BoilerPy is a native Python port of Christian Kohlschütter's Boilerpipe library, released under the Apache 2.0 Licence. (http://code.google.com/p/boilerpipe/
)
I created this port since I don't have access to Java on my webhost and I wanted to create a pure Python version. Another Python version which consists of Python hooks to the original Java library can be found here : (https://github.com/misja/python-boilerpipe
) It might be a better option if you are able to run Java.
BoilerPy was created with the help of the excellent Java2Python library :(https://github.com/natural/java2python
)
Installation
---------------------------------------
BoilerPy was packaged with distutils. In can be installed from the command-line with the following line:
``>python setup.py install``
Usage
---------------------------------------
``import boilerpy``
``boilerpy.extractors.ARTICLE_EXTRACTOR.getContentFromUrl('http://www.example.com/')``
``boilerpy.extractors.ARTICLE_EXTRACTOR.getContentFromFile('site/example.html')``
``htmlText='<html><body><h1>Example</h1></body></html>'``
``boilerpy.extractors.ARTICLE_EXTRACTOR.getContent(htmlText)``
Extractors
---------------------------------------
ARTICLE_EXTRACTOR
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A full-text extractor which is tuned towards news articles. In this scenario it achieves higher accuracy than DefaultExtractor. Works very well for most types of Article-like HTML.
DEFAULT_EXTRACTOR
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Usually worse than ArticleExtractor, but simpler/no heuristics. A quite generic full-text extractor.
LARGEST_CONTENT_EXTRACTOR
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A full-text extractor which extracts the largest text component of a page. For news articles, it may perform better than the DefaultExtractor but usually worse than ArticleExtractor
CANOLA_EXTRACTOR
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Trained on krdwrd Canola (different definition of "boilerplate"). You may give it a try.
KEEP_EVERYTHING_EXTRACTOR
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Marks everything as content. Dummy Extractor; should return the input text. Use this to double-check that your problem is within a particular Extractor or somewhere else.
NUM_WORDS_RULES_EXTRACTOR
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A quite generic full-text extractor solely based upon the number of words per block (the current, the previous and the next block).
ARTICLE_SENTENCES_EXTRACTOR
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A full-text extractor which is tuned towards extracting sentences from news articles.
Version
---------------------------------------
1.0 - Created 14 Feb 2013