Skip to content

Commit

Permalink
Merge branch 'main' into improve-classify.py
Browse files Browse the repository at this point in the history
  • Loading branch information
trevorspreadbury authored May 22, 2024
2 parents e325e25 + af1f546 commit f904f4a
Show file tree
Hide file tree
Showing 37 changed files with 3,817 additions and 81 deletions.
4 changes: 3 additions & 1 deletion .git-blame-ignore-revs
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,6 @@
# migrate code style to ruff
7313695f4c091a3943d17a0abea351987cc02eb6
# ruff format src/utils/classify_infogroup_data.py
4e4336ea0ff4af1ec6a84d309f042073b7eea25e
4e4336ea0ff4af1ec6a84d309f042073b7eea25e
# fix code style of `collect_harvard_data` branch
4f978d2082a440f31479ca5cfbec90e8b7683b80
26 changes: 26 additions & 0 deletions LICENSE.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
Copyright (c) 2024 University of Chicago. All rights reserved.

Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice,
this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors
may be used to endorse or promote products derived from this software without
specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
21 changes: 8 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,9 @@
1. Collect: Gather key states' political campaign finance report data which should include recipient information, donor information, and transaction information.
2. Transform: Define database schema for storing transaction and entity information and write code to transform and validate raw data to fit appropriate schema.
3. Clean: Perform record linkage and fix likely data entry errors.
4. Classify: Label all entities as fossil fuel, clean energy, or other
5. Graph: Construct a network graph of campaign finance contributions
6. Analyze: Perform analysis on network data and join with other relevant dataset
4. Classify: Label all entities as fossil fuel, clean energy, or other.
5. Graph: Construct a network graph of campaign finance contributions with mirco-level and macro-level views.
6. Analyze: Perform analysis on network data and join with other relevant dataset.


## Setup
Expand All @@ -33,24 +33,19 @@ For developing, please use either a Docker dev container or slurm computer clust

### Network Visualization

# TODO: #101 document what we want to see in the visualization and decide how many types of visual are needed

The network visualizations created and their associated relevant metrics are housed in the `\output` directory. Specifically, [this](https://github.com/dsi-clinic/2024-winter-climate-cabinet-campaign-finance-tracker/tree/main/output/network_graphs) folder. Details about the approaches adopted for these visuals are present in [this](https://github.com/dsi-clinic/2024-winter-climate-cabinet-campaign-finance-tracker/blob/main/output/network_graphs/README.md) document.

## Repository Structure

### utils
Project python code
Project python code.

### notebooks
Contains short, clean notebooks to demonstrate analysis.
Contains short, clean notebooks to demonstrate analysis. This is a dynamic folder with notebooks added/removed as per current working processes.

### data

Contains details of acquiring all raw data used in repository. If data is small (<50MB) then it is okay to save it to the repo, making sure to clearly document how to the data is obtained.

If the data is larger than 50MB than you should not add it to the repo and instead document how to get the data in the README.md file in the data directory.

This [README.md file](/data/README.md) should be kept up to date.
Contains details of acquiring all raw data used in repository.

### output
This folder is empty by default. The final outputs of make commands will be placed here by default.
Expand All @@ -74,7 +69,7 @@ Student Email: [email protected]
Student Name: Yangge Xu
Student Email: [email protected]

Student Name: Bhavya Pandey
Student Name: Bhavya Pandey
Student Email: [email protected]

Student Name: Kaya Lee
Expand Down
31 changes: 31 additions & 0 deletions data/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -198,6 +198,8 @@ These companies were listed on the website in a table, which was copy and pasted
### How to access:
This file is called FFF_oil_companies.csv and can be downloaded from the climate cabinet drive in 2024-spring-clinic folder.

-Limitation: companies are global companies, so they may not all be applicable for our U.S. based analysis.

This file should be saved in the path data/raw_classification/FFF_oil_companies.csv

### Features
Expand Down Expand Up @@ -250,3 +252,32 @@ This file is called SIC_codes and and should be downloaded as a csv from the cli
- SIC_code: the SIC code associated with the company. If the SIC code is shorter than 6 numbers, the code represents the first n numbers of an SIC code
- SIC_code_description: description associated with the SIC code
- classification: if the company is fossil fuel (f), clean energy (c), maybe fossil fuel (uf), maybe clean energy (uc)

## State Legislative Election Returns (1967-2016)

### Overview
This [dataset](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/3WZFK9) accompanies the State Legislative Election Returns dataset, which chronicles detailed election outcomes for individual candidates in state legislative races across the United States, covering the period from 1967 to 2016. This extensive dataset allows for historical analysis of electoral trends, candidate performance, and legislative turnover.

### Data Source
The dataset aggregates data from multiple authoritative sources, including state election boards and historical archives, to ensure comprehensive coverage and accuracy. It provides an invaluable resource for researchers focusing on political science, electoral behavior, and governance.
# TODO: #106 add a link to where this came from and where it is expected to be saved to run the pipeline

### Features
- **Temporal Coverage:** Includes data from 1967 to 2016, capturing a broad spectrum of political and historical contexts.
- **Utility:** Designed to support a wide range of analyses, from simple descriptive statistics to complex longitudinal studies.

### Key Variables
The dataset comprises several critical variables that capture the essentials of each election:
- **caseid:** A unique identifier for each election entry.
- **year, month, day:** The date on which the election was held.
- **sab:** State abbreviation, indicating the state in which the election took place.
- **cname:** Name of the county for localized analysis.
- **candid:** A unique identifier for each candidate.
- **vote:** The number of votes received by the candidate.
- **termz:** The actual length of term the elected candidate served.
- **cand:** Name of the candidate.
- **sen:** Indicates whether the election was for the state senate.
- **partyt:** The political party affiliation of the candidate.
- **outcome:** The result of the election for the candidate (e.g., won, lost).
- **last, first:** Last and first names of the candidate.
- **v19_20171211:** A standardized candidate name variable, updated as of December 11, 2017.
Binary file not shown.
3 changes: 3 additions & 0 deletions notebooks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,6 @@
* `MN_EDA.ipynb` : Notebook containing the EDA and visualizations for Minnesota contribution and expenditure data

* `PA_EDA.ipynb` : This notebook contains the EDA for Pennsylvania datasets on contributions, filer information, and expenditure data from 2018-2023.

* `harvard_eda.ipynb`: This notebook contains the EDA for the Harvard datasets on election results from 1967 - 2016

960 changes: 960 additions & 0 deletions notebooks/election_dedupe.ipynb

Large diffs are not rendered by default.

Loading

0 comments on commit f904f4a

Please sign in to comment.