- Collect: Gather key states' political campaign finance report data which should include recipient information, donor information, and transaction information.
- Transform: Define database schema for storing transaction and entity information and write code to transform and validate raw data to fit appropriate schema.
- Clean: Perform record linkage and fix likely data entry errors.
- Classify: Label all entities as fossil fuel, clean energy, or other
- Graph: Construct a network graph of campaign finance contributions
- Analyze: Perform analysis on network data and join with other relevant dataset
- Collect the data through one of the steps below a. Collect state's finance campaign data either from web scraping (AZ, MI, PA) or direct download (MN) OR b. Go to the Project's Google Drive to download each state's data to their local repo following this format: repo_root / "data" / "raw" / state acronym / "file"
- Run
pip install -r requirements.txt
andpip install -e .
if not in Docker (not recommended for development)
The main components of the package are broken up into subpackages which can be imported and used in external code. To run pipelines directly you can use the scripts in the scripts
directory. These scripts have been dockerized already and can be run simply using make
commands.
make run-transform-pipeline
: This runs the pipeline to transform raw data from each state into the appropriate schema.- Expects there to be a folder for each state in a
data/raw
folder. Follow setup instructions to get data.
- Expects there to be a folder for each state in a
make run-clean-classify-graph-pipeline
: This runs the pipeline to clean, classify, and graph data that is already in the correct schema.- Expects there to be an
inds_mini.csv
,orgs_mini.csv
, andtrans_mini.csv
in adata/transformed
directory (should be in git by default)
- Expects there to be an
For developing, please use either a Docker dev container or slurm computer cluster. See more details in CONTRIBUTING.md
Project python code
Contains short, clean notebooks to demonstrate analysis.
Contains details of acquiring all raw data used in repository. If data is small (<50MB) then it is okay to save it to the repo, making sure to clearly document how to the data is obtained.
If the data is larger than 50MB than you should not add it to the repo and instead document how to get the data in the README.md file in the data directory.
This README.md file should be kept up to date.
This folder is empty by default. The final outputs of make commands will be placed here by default.
Student Name: Nicolas Posner Student Email: [email protected]
Student Name: Alan Kagiri Student Email: [email protected].
Student Name: Adil Kassim Student Email: [email protected]
Student Name: Nayna Pashilkar Student Email: [email protected]
Student Name: Yangge Xu Student Email: [email protected]
Student Name: Bhavya Pandey
Student Email: [email protected]
Student Name: Kaya Lee Student Email: [email protected]