This page gives a more detailed description and further instructions on running the spider in various ways.
- Environment Variables
- Output Targets
- Elasticsearch
- Starting Spider Jobs
- Running Scrapy Scheduler
- Running For All Domains
- Adding New Spiders
If running a scheduler or benchmark, we support the use of a .env
file in the project root to read keys and values. Othewise these must be exported through other means.
# Optional variables for process control and info
SCRAPY_LOG_LEVEL="INFO"
SPIDER_SCRAPY_MAX_WORKERS="5"
SPIDER_DEPTH_LIMIT="3"
SPIDER_CRAWL_SITES_FILE_NAME="crawl-sites-sample.json"
# Needed for elasticsearch Output target
SPIDER_ES_INDEX_NAME="i14y-documents-spider"
SPIDER_ES_INDEX_ALIAS="i14y-documents"
ES_HOSTS="http://localhost:9200"
ES_USER="username"
ES_PASSWORD="password"
# Needed for endpoint Output Target
SPIDER_URLS_API="https://jsonplaceholder.typicode.com/posts"
# Needed for deployment
SPIDER_PYTHON_VERSION="3.12"
SPIDER_RUN_WITH_UI="False"
Before setting the output target to elastcisearch
for any domains:
- Install required nltk modules (only required for output target of elasticsearch):
# make sure the virtual environment is activate
python ./search_gov_crawler/elasticsearch/install_nltk.py
- Start elasticsearch using the docker compose file at the project root:
# ensure current working directory is the project root
docker compose up
Make sure to follow Quick Start steps, before running any spiders.
- Navigate to the search_gov_crawler directory
- Run a scrapy crawl command
# write URLs to a CSV
scrapy crawl domain_spider -a allowed_domains=quotes.toscrape.com -a start_urls=https://quotes.toscrape.com -a output_target=csv
# post URLs to an endpoint
scrapy crawl domain_spider -a allowed_domains=quotes.toscrape.com -a start_urls=https://quotes.toscrape.com -a output_target=endpoint
# post documents to elasticsearch
scrapy crawl domain_spider_js -a allowed_domains=quotes.toscrape.com -a start_urls=https://quotes.toscrape.com/js -a output_target=elasticsearch
The benchmark script is primarily intended for use in timing and testing scrapy runs. There are two ways to run. In either case if
you want to redirect your ouput to a log file and not have the terminal session tied up the whole time you should wrap your command using something like nohup <benchmark command> >> scrapy.log 2>&1 &
- To run a single domain (specifying starting URL
-u
, allowed domain-d
, and-o
for output target):
python search_gov_spiders/benchmark.py -u https://www.example.com -d example.com -o csv
- To run multiple spiders simultaneously, provide a json file in the format of the crawl-sites-sample.json file as an argument:
python search_gov_spiders/benchmark.py -f /path/to/crawl-sites-like-file.json
There are other options available. Run python search_gov_spiders/benchmark.py -h
for more info.
See the Running Scrapy Scheduler section below.
We do not actively use this option but it remains in place for potential future use.
- Navigate to the Scrapy project root directory and start the server.
scrapyd
- Note: the directory where you start the server is arbitrary. It's simply where the logs and Scrapy project FEED destination (relative to the server directory) will be.
- Run this command to eggify the Scrapy project and deploy it to the Scrapyd server:
scrapyd-deploy default
Note: This will simply deploy it to a local Scrapyd server. To add custom deployment endpoints, you navigate to the scrapy.cfg file and add or customize endpoints.
For instance, if you wanted local and production endpoints:
[settings]
default = search_gov_spiders.settings
[deploy: local]
url = http://localhost:6800/
project = search_gov_spiders
[deploy: production]
url = <IP_ADDRESS>
project = search_gov_spiders
And then to deploy:
# deploy locally
scrapyd-deploy local
# deploy production
scrapyd-deploy production
- For an interface to view jobs (pending, running, finished) and logs, access http://localhost:6800/. However, to actually manipulate the spiders deployed to the Scrapyd server, you'll need to use the Scrapyd JSON API.
Some often-used commands:
- Schedule a job:
curl http://localhost:6800/schedule.json -d project=search_gov_spiders -d spider=<spider_name>
- Check load status of a service:
curl http://localhost:6800/daemonstatus.json
This process allows for scrapy to be run directly using an in-memory scheduler. The schedule is based on the initial schedule setup in the crawl-sites-sample.json file. The process will run until killed.
The json input file must be in a format similar what is below. There are validations in place when the file is read and in tests that should help prevent this file from getting into an invalid state.
[
{
"name": "Example",
"allowed_domains": "example.com",
"allow_query_string": false,
"handle_javascript": false,
"schedule": "30 08 * * MON",
"starting_urls": "https://www.example.com"
}
]
-
Source virtual environment and update dependencies.
-
Start scheduler
$ python search_gov_crawler/scrapy_scheduler.py
This method is not recommended. If you want to run a large amount of domains you should setup a schedule.
Navigate down to search_gov_crawler/search_gov_spiders
, then enter the command below:
scrapy crawl domain_spider
to run for all urls / domains that do not require javacript handling. To run for all sites that require javascript run:
scrapy crawl domain_spider_js
^^^ These will take a long time
-
Navigate to anywhere within the Scrapy project root directory and run this command:
$ scrapy genspider -t crawl <spider_name> "<spider_starting_domain>"
-
Using the domain spider as an example, copy code to the new spider file.
-
Modify the
rules
in the new spider as needed. Here's the Scrapy rules documentation for the specifics. -
To update the Scrapyd server with the new spider, run:
$ scrapyd-deploy <default or endpoint_name> ## Running Against All Listed Search.gov Domains