u3 is scraper and feeder for univizor project.
Scraper | Homepage | State |
---|---|---|
rul | repozitorij.uni-lj.si | Done |
dkum | dk.um.si | Done |
bf | digitalna-knjiznica.bf.uni-lj.si | Done |
famnit | famnit.upr.si | Done |
ung | sabotin.ung.si | Done |
docker-compose run u3 bf -a categories=biologija -L INFO
If you need to rebuild image
docker build -t univizor/u3:latest .
Some crawling options can be seen in refresh.sh.
Please read NATIVE.md.
- refresh.sh - Script that starts scraping in parallel fashion. New items will be added to collection.
This script should be ran on periodic intervals via
cron
. - recreate_database.py - Drops all existing tables, and creates new tables with up-to-date structure.
This is default configuration that can be overwritten by setting ENV
variables.
CONCURRENT_REQUESTS = 16
DOWNLOAD_DELAY = 3
FILES_STORE = ./data/files
HASHING_ALGORITHM = sha256
DATABASE_URL = ...
PERSIST_STATS_INTERVAL = 10
DOGSTATSD_ADDR = ...
DOGSTATSD_PORT = ...
u3 now supports Sentry integration via scrapy-sentry library.
To use, set the SENTRY_DSN
environment variable:
docker run -ti --rm \
--name u3 \
--link pg \
--env DATABASE_URL="postgresql://postgres:@pg:5432/u3_dev" \
--env SENTRY_DSN="http://public:[email protected]/12345" \
univizor/u3:latest bf -a categories=biologija