I will try to provide steps on how to run this locally. Please take everything I share here with a grain of salt, as this is only a personal project without ongoing maintenance for other deployments besides my own. After cloning the repo, the key piece needed to run this locally is properly setting up your environment varaibles. Let's get started!
I will walk through how to set it up as I have with a
Postgres Database, but for a simple use case only focused on scraping, and not using dbt, you may want to consider using the
FEEDS setting in settings.py, and export the output to a csv or JSON.
More detail on this can be found here. If you go this route, you will also need to disable the ITEM_PIPELINES
setting in settings.py
. You may also want to look at the code run_job_scraper_single.py
which takes a single url as a command line parameter, instead of scraping many urls.
I'll go in to detail on some environment variables that need to be defined to run the scraper. Please see .env.example
as a helpful resource.
-
You need an AWS account, as well as the following enivironment variables:
- AWS_ACCESS_KEY_ID
- AWS_SECRET_ACCESS_KEY
- AWS_REGION
Here is some information on how to get these, in the boto3 doumentation
-
You also need an S3 bucket to store your raw HTML scraped, or you choose to omit this part by removing the
export_html
function call. I have this variable defined in my environment variables asRAW_HTML_S3_BUCKET
-
You need to connect your own Postgres database, or sign up for Neon as I have. You will probably be okay using the free tier of Neon for quite a while. You will need the following environment variables:
- PG_DATABASE
- PG_HOST
- PG_USER
- PG_PASSWORD
-
There is one point in the code where we are querying Postgres to determine what urls we need to scrape. You need to create a table in your Postgres DB with at minimum the columns
id
(I did type SERIAL here),url
, andis_enabled
, which should be defaulted to true. Here is where you must insert into this table the urls which you would like to scrape job postings from. You can changeis_enabled
to false if you want to stop scraping a certain url. I had to do this with Ramp and OpenSea, as they opted for Ashby which is not supported (yet!). You need the environment variable,PAGES_TO_SCRAPE_QUERY
, which should look something similar to thisselect distinct url from <YOUR_URLS_TABLE> where is_enabled;
-
Finally, I opted to determine a unique id for a Levergreen record using Hashids, it looks they they rebranded to Sqids. For now, I'll keep the existing hashids implemenation. We need a salt in the hashids algorithm which can be any string. Save any string to the environment variable
HASHIDS_SALT
.
Once you've completed the prerequisites, you are ready to get scraping. My recommendation here is to use the Github Action workflow daily_job_board_scrape.yml
as a guide, to see how we are scraping. The key steps are:
- Installing the dependencies defined in
requirements.txt
(virtual enviorment is recommended). - Running the file
run_job_scraper.py
with the necessary environment variables. - Installing dbt packages using the
dbt deps
command - Running dbt to build are mart tables (
all_job_postings
andactive_job_postings
).
I also have a testing workflow to make sure my data adheres to the tests I've defined in dbt.
This should be all that is needed to begin scraping jobs from companies you are interested in. I hope you can use this to find your dream role at an exciting company! Please star this repo and spread the word if you find this project useful. Best of luck!