pip install hepcrawl
Warning
Beware that you may need to install additional system level packages like libffi, libssl, libxslt, libxml2 etc.
We start by creating a virtual environment for our Python packages:
mkvirtualenv hepcrawl
cdvirtualenv
mkdir src && cd src
Now we grab the code and install it in development mode:
git clone https://github.com/inspirehep/hepcrawl.git
cd hepcrawl
pip install -e .
Development mode ensures that any changes you do to your sources are automatically taken into account = no need to install again after changing something.
Finally run the tests to make sure all is setup correctly:
python setup.py test
Thanks to the command line tools provided by Scrapy, we can easily test the spiders as we are developing them. Here is an example using the simple sample spider:
cdvirtualenv src/hepcrawl
scrapy crawl arXiv -a source_file=file://`pwd`/tests/responses/arxiv/sample_arxiv_record.xml
Run the crawler with INSPIRE (assuming you already have a virtualenv with everything set up).
The example below shows how to get all papers from the 24th June 2016 to the 26th June 2016 from arXiv where the subject area is hep-th (HEP Theory). We use the arXiv spider and assign the article workflow.
workon inspire-next
inspirehep oaiharvester harvest -m arXiv -u http://export.arxiv.org/oai2 -f 2016-06-24 -t 2016-06-26 -s 'physics:hep-th' -a 'spider=arXiv' -a 'workflow=article'
Thanks for contributing!