Skip to content

Latest commit

 

History

History
178 lines (123 loc) · 8.37 KB

README.md

File metadata and controls

178 lines (123 loc) · 8.37 KB

Anansi

A web crawler and crawling framework

Current build status Apache 2.0 licensed Implemented in C Follow @RES_Project

Anansi is a web crawler which includes specific support for Linked Data and can be operated out-of-the-box or used as a framework for developing your own crawling applications.

This software was developed as part of the Research & Education Space project and is actively maintained by a development team within BBC Design and Engineering. We hope you’ll find this project useful!

Table of Contents

Requirements

Optionally, you may also wish to install:—

On Debian-based systems, the following will install those required packages which are generally-available in APT repositories:

  sudo apt-get install -qq libjansson-dev libmysqlclient-dev libcurl4-gnutls-dev libxml2-dev librdf0-dev libltdl-dev uuid-dev automake autoconf libtool pkg-config clang build-essential xsltproc docbook-xsl-ns

Anansi has not yet been ported to non-Unix-like environments, and will install as shared libraries and command-line tools on macOS rather than frameworks and LaunchDaemons.

It ought to build inside Cygwin on Windows, but this is untested.

Contributions for building properly with Visual Studio or Xcode, and so on, are welcome (provided they do not significantly complicate the standard build logic).

Using Anansi

Configuring the crawler

The first time you install Anansi, an example crawl.conf will be installed to $(sysconfdir) (by default, /usr/local/etc).

Invoking the crawler

The crawl daemon is installed by default as $(sbindir)/crawld, which will typically be /usr/local/sbin/crawld.

After you’ve initially configured the crawler, you should perform any database schema updates which may be required:

$ /usr/local/sbin/crawld -S

This happens automatically when you launch it, but the -S option will give you an opportunity to see the results of a first run without examining log files, and will cause the daemon to terminate after ensuring the schema is up to date.

To run the crawler in the foreground, with debugging enabled:

$ /usr/local/sbin/crawld -d

Or to run it in the foreground, without debug-level verbosity:

$ /usr/local/sbin/crawld -f

Alternatively, to run in the background:

$ /usr/local/sbin/crawld -f

If you want to perform a single test fetch of a URI using your current configuration, you can do this with:

$ /usr/local/sbin/crawld -t http://example.com/somelocation

Once you’ve configured the crawler, you can add a URI to its queue using the crawler-add utility, installed as $(bindir)/crawler-add (typically /usr/local/bin/crawler-add). Note that crawld does not have to be running in order to add URIs to the queue.

Components

  • crawler - the crawler daemon, its components, and command-line tools
  • libspider - a high-level library which implements the core of the Anansi crawler
  • libcrawl - a low-level library implementing the basic logic of a crawler
  • libsupport - utility library providing support for configuration files and logging

Bugs and feature requests

If you’ve found a bug, or have thought of a feature that you would like to see added, you can file a new issue. A member of the development team will triage it and add it to our internal prioritised backlog for development—but in the meantime we welcome contributions and encourage forking.

Building from source

You will need git, automake, autoconf and libtool. Also see the Requirements section.

$ git clone git://github.com/bbcarchdev/anansi.git
$ cd anansi
$ git submodule update --init --recursive
$ autoreconf -i
$ ./configure --prefix=/some/path
$ make
$ make check
$ sudo make install

Automated builds

We have configured Travis to automatically build and invoke the tests on Anansi for new commits on each branch. See .travis.yml for the details.

You may wish to do similar for your own forks, if you intend to maintain them.

The debian directory contains the logic required to build a Debian package for Anansi, except for the changelog. This is used by the system that auto-deploys packages for the production Research & Education Space, and so if you need a modified version to suit your own deployment needs, it’s probably easiest to maintain a fork of this repository with your changes in.

Contributing

If you’d like to contribute to Anansi, fork this repository and commit your changes to the develop branch.

For larger changes, you should create a feature branch with a meaningful name, for example one derived from the issue number.

Once you are satisfied with your contribution, open a pull request and describe the changes you’ve made and a member of the development team will take a look.

Information for BBC Staff

This is an open source project which is actively maintained and developed by a team within Design and Engineering. Please bear in mind the following:—

  • Bugs and feature requests must be filed in GitHub Issues: this is the authoratitive list of backlog tasks.
  • Issues with the label triaged have been prioritised and added to the team’s internal backlog for development. Feel free to comment on the GitHub Issue in either case!
  • You should never add nor remove the triaged label to yours or anybody else’s Github Issues.
  • Forking is encouraged! See the “Contributing” section.
  • Under no circumstances may you commit directly to this repository, even if you have push permission in GitHub.
  • If you’re joining the development team, contact “Archive Development Operations” in the GAL to request access to GitLab (although your line manager should have done this for you in advance).

Finally, thanks for taking a look at this project! We hope it’ll be useful, do get in touch with us if we can help with anything (“RES-BBC” in the GAL, and we have staff in BC and PQ).

License

Anansi is licensed under the terms of the Apache License, Version 2.0

  • Copyright © 2013 Mo McRoberts
  • Copyright © 2014-2017 BBC