Source code for a Django, PostGres & AWS Lambda - based tool for scraping data from
Code (c) Research Action Design, LLC. Originally produced for Centro de los Derechos del Migrante, Inc.
Released under a GPL v3 license, see LICENSE file for specific text of license.
- Run
PIPENV_VENV_IN_PROJECT=1 pipenv install
- Configure environment variables in
file. You'll need, at a minimum:- JOBS_API_KEY = API key for requests to's Microsoft Search back-end. Can be found by inspecting network requests on an individual job listing in web browser.
- Run
pre-commit install
to install pre-commit hooks for Black python formatter. - Run
python migrate --run-syncdb
to create a local database.
All of the scraper functionality can be run via Django's python ___
- Download the most recent RSS feed of job listings and create/update listing records for each item in the feed.scrape_listings
- Query API for the data of a single listing, save it to the database, and download PDF of full job listing application and save to wherever local file uploads are stored.
The scraper is designed to run as an AWS Lambda function, saving listings to an RDS database and saving PDFs to an S3 bucket.
The file
contains a lambda function handler which essentially passes through commands to the Django management command parser. If no command or an invalid command is set, the lambda handler just returns some basic stats.
In order to run on AWS, the following environment variables need to be set:
- password for postgres user on RDS instanceAWS_PGHOST
- domain name for RDS instanceUSE_AWS
- flag to use AWS, should be set to anything other thanFalse
- S3 bucket to store job order PDFs in
Additionally, the Lambda function must be in the same VPC as the RDS instance and have a role which has write access to the relevant S3 bucket. Lastly, the VPC needs to have a NAT Gateway in order for the scraper to successfully make outgoing requests. See this article for a full how-to.
- Run
from the console. This command will create a zip file with all project dependencies (from.venv/../site-packages
), a special AWS Lambda-friendly version of psycopg2 (fromaws_psycopg2
directory, sourced from and the project code, and then upload it to AWS as a lambda function.
Schedule the scraper using Amazon Event Bridge. Event input should be fixed JSON, e.g.:
{"command": "scrape_rss"}
To run migrations on AWS, set the .env variables to locally access the AWS PostGres database and run migrations as normal. Note that you will need to ensure that your IP is allowed as an inbound/outbound IP address in the security group.