Merge branch 'main' into improve-classify.py

uchicago-dsi · May 22, 2024 · f904f4a · f904f4a
2 parents e325e25 + af1f546
commit f904f4a
Show file tree

Hide file tree

Showing 37 changed files with 3,817 additions and 81 deletions.
diff --git a/.git-blame-ignore-revs b/.git-blame-ignore-revs
@@ -5,4 +5,6 @@
 # migrate code style to ruff
 7313695f4c091a3943d17a0abea351987cc02eb6
 # ruff format src/utils/classify_infogroup_data.py
-4e4336ea0ff4af1ec6a84d309f042073b7eea25e
+4e4336ea0ff4af1ec6a84d309f042073b7eea25e
+# fix code style of `collect_harvard_data` branch
+4f978d2082a440f31479ca5cfbec90e8b7683b80
diff --git a/LICENSE.txt b/LICENSE.txt
@@ -0,0 +1,26 @@
+Copyright (c) 2024 University of Chicago. All rights reserved.
+
+Redistribution and use in source and binary forms, with or without modification,
+are permitted provided that the following conditions are met:
+
+1. Redistributions of source code must retain the above copyright notice,
+this list of conditions and the following disclaimer.
+
+2. Redistributions in binary form must reproduce the above copyright notice,
+this list of conditions and the following disclaimer in the documentation
+and/or other materials provided with the distribution.
+
+3. Neither the name of the copyright holder nor the names of its contributors
+may be used to endorse or promote products derived from this software without
+specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
+LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
+USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
diff --git a/README.md b/README.md
@@ -5,9 +5,9 @@
 1. Collect: Gather key states' political campaign finance report data which should include recipient information, donor information, and transaction information.
 2. Transform: Define database schema for storing transaction and entity information and write code to transform and validate raw data to fit appropriate schema.
 3. Clean: Perform record linkage and fix likely data entry errors.
-4. Classify: Label all entities as fossil fuel, clean energy, or other
-5. Graph: Construct a network graph of campaign finance contributions
-6. Analyze: Perform analysis on network data and join with other relevant dataset
+4. Classify: Label all entities as fossil fuel, clean energy, or other.
+5. Graph: Construct a network graph of campaign finance contributions with mirco-level and macro-level views.
+6. Analyze: Perform analysis on network data and join with other relevant dataset.
 
 
 ## Setup
@@ -33,24 +33,19 @@ For developing, please use either a Docker dev container or slurm computer clust
 
 ### Network Visualization
 
-# TODO: #101 document what we want to see in the visualization and decide how many types of visual are needed
-
+The network visualizations created and their associated relevant metrics are housed in the `\output` directory. Specifically, [this](https://github.com/dsi-clinic/2024-winter-climate-cabinet-campaign-finance-tracker/tree/main/output/network_graphs) folder. Details about the approaches adopted for these visuals are present in [this](https://github.com/dsi-clinic/2024-winter-climate-cabinet-campaign-finance-tracker/blob/main/output/network_graphs/README.md) document. 
 
 ## Repository Structure
 
 ### utils
-Project python code
+Project python code.
 
 ### notebooks
-Contains short, clean notebooks to demonstrate analysis.
+Contains short, clean notebooks to demonstrate analysis. This is a dynamic folder with notebooks added/removed as per current working processes. 
 
 ### data
 
-Contains details of acquiring all raw data used in repository. If data is small (<50MB) then it is okay to save it to the repo, making sure to clearly document how to the data is obtained.
-
-If the data is larger than 50MB than you should not add it to the repo and instead document how to get the data in the README.md file in the data directory. 
-
-This [README.md file](/data/README.md) should be kept up to date.
+Contains details of acquiring all raw data used in repository.
 
 ### output
 This folder is empty by default. The final outputs of make commands will be placed here by default.
@@ -74,7 +69,7 @@ Student Email: [email protected]
 Student Name: Yangge Xu
 Student Email: [email protected]
 
-Student Name: Bhavya Pandey    
+Student Name: Bhavya Pandey
 Student Email: [email protected]
 
 Student Name: Kaya Lee

diff --git a/data/README.md b/data/README.md
@@ -198,6 +198,8 @@ These companies were listed on the website in a table, which was copy and pasted
 ### How to access: 
 This file is called FFF_oil_companies.csv and can be downloaded from the climate cabinet drive in 2024-spring-clinic folder. 
 
+-Limitation: companies are global companies, so they may not all be applicable for our U.S. based analysis.
+
 This file should be saved in the path data/raw_classification/FFF_oil_companies.csv
 
 ### Features
@@ -250,3 +252,32 @@ This file is called SIC_codes and and should be downloaded as a csv from the cli
 - SIC_code: the SIC code associated with the company. If the SIC code is shorter than 6 numbers, the code represents the first n numbers of an SIC code
 - SIC_code_description: description associated with the SIC code
 - classification: if the company is fossil fuel (f), clean energy (c), maybe fossil fuel (uf), maybe clean energy (uc)
+
+## State Legislative Election Returns (1967-2016)
+
+### Overview
+This [dataset](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/3WZFK9) accompanies the State Legislative Election Returns dataset, which chronicles detailed election outcomes for individual candidates in state legislative races across the United States, covering the period from 1967 to 2016. This extensive dataset allows for historical analysis of electoral trends, candidate performance, and legislative turnover.
+
+### Data Source
+The dataset aggregates data from multiple authoritative sources, including state election boards and historical archives, to ensure comprehensive coverage and accuracy. It provides an invaluable resource for researchers focusing on political science, electoral behavior, and governance.
+# TODO: #106 add a link to where this came from and where it is expected to be saved to run the pipeline
+
+### Features
+- **Temporal Coverage:** Includes data from 1967 to 2016, capturing a broad spectrum of political and historical contexts.
+- **Utility:** Designed to support a wide range of analyses, from simple descriptive statistics to complex longitudinal studies.
+
+### Key Variables
+The dataset comprises several critical variables that capture the essentials of each election:
+- **caseid:** A unique identifier for each election entry.
+- **year, month, day:** The date on which the election was held.
+- **sab:** State abbreviation, indicating the state in which the election took place.
+- **cname:** Name of the county for localized analysis.
+- **candid:** A unique identifier for each candidate.
+- **vote:** The number of votes received by the candidate.
+- **termz:** The actual length of term the elected candidate served.
+- **cand:** Name of the candidate.
+- **sen:** Indicates whether the election was for the state senate.
+- **partyt:** The political party affiliation of the candidate.
+- **outcome:** The result of the election for the candidate (e.g., won, lost).
+- **last, first:** Last and first names of the candidate.
+- **v19_20171211:** A standardized candidate name variable, updated as of December 11, 2017.
diff --git a/data/raw/HV/SLERs1967to2016_20180927_Codebook.docx b/data/raw/HV/SLERs1967to2016_20180927_Codebook.docx
diff --git a/notebooks/README.md b/notebooks/README.md
@@ -7,3 +7,6 @@
 * `MN_EDA.ipynb` : Notebook containing the EDA and visualizations for Minnesota contribution and expenditure data
 
 * `PA_EDA.ipynb` : This notebook contains the EDA for Pennsylvania datasets on contributions, filer information, and expenditure data from 2018-2023.
+
+* `harvard_eda.ipynb`: This notebook contains the EDA for the Harvard datasets on election results from 1967 - 2016
+
diff --git a/notebooks/election_dedupe.ipynb b/notebooks/election_dedupe.ipynb
Original file line number	Diff line number	Diff line change
Expand Up		@@ -7,3 +7,6 @@
		* `MN_EDA.ipynb` : Notebook containing the EDA and visualizations for Minnesota contribution and expenditure data

		* `PA_EDA.ipynb` : This notebook contains the EDA for Pennsylvania datasets on contributions, filer information, and expenditure data from 2018-2023.

		* `harvard_eda.ipynb`: This notebook contains the EDA for the Harvard datasets on election results from 1967 - 2016