inka.js - Reactive Web Crawler

Inka is useful especially for downloading content of blogs, news sites, and so on.

Requirements

node

How to use

Take a look at the example directory.

But for short you configure which part of site is an "article", what types of links you want to visit, decide how an article object should look like and finally do whatever you want with this object.

config.js:

var self = module.exports = {
    url: 'http://example.com/html',
    rootUrl: 'http://example.com/', // for links starting with "/"
    debounce: 1000,
    selectors: {
        article: 'article',
        links: 'a'
    },
    callbacks: {
        toArticle: function($) {
            return {
                title: $('h2 > a').text(),
                date: $('time').attr('datetime'),
                body: $('section').html()
            };
        },
        shouldDownloadLink: function($) { // probably you want to filter only "next page" links
            return $('a').text() == '»';
        },
        extractUrlFromLink: function($) {
            return $('a').attr('href');
        }
    }
};

download.js:

var InkaCrawler = require('../lib/inka-crawler').InkaCrawler;
var crawler = new InkaCrawler(require('./config'));

crawler.toObservable().subscribe(function(article) {
    // save to the mongodb, publish on message broker, etc.
});

Limitations

As for now inka.js supports only pages that have full content of an article on article's listing. I would probably work on that.

inka?

A friend of mine has a cat named "Inka" and a blog on which she had plenty of notes about the kitty that she want to migrate to new blog on wordpress.com. I've decided to help her with that.

New blog about cats, handmade works and sewing you can visit here (in Polish).

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
example		example
lib		lib
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

inka.js - Reactive Web Crawler

Requirements

How to use

Limitations

inka?

About

Releases

Packages

Languages

License

wikp/inka.js

Folders and files

Latest commit

History

Repository files navigation

inka.js - Reactive Web Crawler

Requirements

How to use

Limitations

inka?

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages