pageone

a module for polling urls and stats from homepages

Install

$ pip install pageone

Tests

Requires nose

$ nosetests

Usage

pageone does two things: extract article urls from a site's homepage and also uses selenium and phantomjs to find the relative positions of these urls.

pageone provides a single interface:

import pageone

for link in pageone.get('http://www.propublica.org/', pattern='.*article.*'):
    print link

Here, pattern represents regex used to identify which urls are artilces. If newslynx is installed and pattern is not provided, it will default to using newslynx.lib.url.is_article, which uses a series of heuristics to determine whether a url is an article.

All methods will return a list of dictionaries that look like this:

{
 'bucket': 8,
 'bucket_size': 200,
 'datetime': datetime.datetime(2015, 10, 6, 20, 21, 22, 422478),
 'domain': 'www.propublica.org',
 'font_size': 14,
 'n_links': 1,
 'page': 'http://www.propublica.org/',
 'text': u'The Stories of Everyday Lives, Hidden in Reams of Data',
 'url': u'https://www.propublica.org/nerds/item/the-stories-of-everyday-lives-hidden-in-reams-of-data/',
 'visible': True,
 'x': 61,
 'x_bucket': 1,
 'y': 1578,
 'y_bucket': 8
}

Here bucket variables represent where a link falls in 200x200 pixel grid. For x_bucket this number moves from left-to-right. For y_bucket, it moves top-to-bottom. bucket moves from top-left to bottom right. You can customize the size of this grid by passing in bucket_pixels to get, eg:

import pageone

for link in pageone.get('http://www.propublica.org/', bucket_pixels = 100, pattern='.*article.*'):
    print link

PhantomJS

pageone requires phantomjs to run pageone.get(). pageone defaults to looking for phantomjs in /usr/bin/phantomjs, but if you want to specify another path, pass in phantom_path to pageone.get:

import pageone

for link in pageone.get('http://www.propublica.org/', pattern='.*article.*', phantom_path="/usr/bin/phantomjs"):
    print link

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
pageone		pageone
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pageone

Install

Tests

Usage

PhantomJS

About

Releases

Packages

Languages

newslynx/pageone

Folders and files

Latest commit

History

Repository files navigation

pageone

Install

Tests

Usage

PhantomJS

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages