Merge branch 'dev-f23' into az_state_cleaner

uchicago-dsi · Dec 5, 2023 · 330ae38 · 330ae38
2 parents 51b490d + 57390b3
commit 330ae38
Show file tree

Hide file tree

Showing 22 changed files with 12,998 additions and 968 deletions.
diff --git a/.gitignore b/.gitignore
@@ -138,3 +138,4 @@ venv.bak/
 
 # data files
 *.avro
+data/*.txt
diff --git a/README.md b/README.md
@@ -1,16 +1,5 @@
 # 2023-fall-clinic-climate-cabinet
 
-## Project Background
-
-- Local politics are vital in enacting climate legislation, but information on local and state political campaign is often under-explored.
-- Climate Cabinate's Hypothesis: Powerful fossil fuel companies' interests do not align with climate-friendly policies in local and state legislature. Therefore, their contribution and the politicians under their fingers are holding us back from achieving green energy goal.
-- What are the impact of fossil fuel companies on local and state political campaign? Which donors are from fossil fuel companies? Which donors are from clean energy companies?
-- Problem Statement: In state and local races in select states, what is the disparity between campaign contributions from fossil fuel and clean energy companies? 
-- Final Goal:
-    1. Develop a graph of campaign finance networks in select states with entities (individuals, parties, PACs, etc.) as nodes and directed edges weighted by monetary connection.
-    2. Classify nodes as ‘Clean energy’, ‘Fossil fuel’, or ‘Other’
-    3. Analyze funding disparities in select races
-
 ## Data Science Clinic Project Goals
 
 1. Collect state's political campaign finance report data which should include
@@ -22,7 +11,6 @@ the conribution made by green energy company versus that by fossil
 fuel company in terms of state's political campaign activity
 
 
-
 ## Usage
 
 ### Docker
@@ -46,63 +34,17 @@ If you prefer to develop inside a container with VS Code then do the following s
 3. Click the blue or green rectangle in the bottom left of VS code (should say something like `><` or `>< WSL`). Options should appear in the top center of your screen. Select `Reopen in Container`.
 
 
-
-
-## Repository Structure
-
-### utils
-Project python code
-
-Files:
-- arizona.py: python code to implement Arizona's state cleaner abstract class
-- michigan.py: python code to implement Michigan's state cleaner abstract class
-- minnesota.py: python code to implement Minnesota's state cleaner abstract class
-- pennsylvania.py: python code to implement Pennsylvania's state cleaner abstract class
-- constants.py: the python script file to store any necessary constants used for state campaign finance data preprocess, clean, and stardandization
-- clean.py: python code for the state cleaner parent class implementation
-- pipeline.py: python code for running the state cleaner for 4 states. It generates the final database (DataFrame) through steps of preprocess, clean, standardize, and create table
-
-
-### notebooks
-Contains short, clean notebooks to demonstrate analysis, including information such as:
-1. Raw dataset format (file format, relational?)
-2. Raw dataset column information (type, content)
-3. Top 10 contributors and top 10 recipients in each state per year
-4. Bar charts to compare contributions by donor type (PAC, individual, etc) and to compare recipients by the office type they are running for
-5. Additional analysis: Yearly trend and possible explanation
-
-Files:
-- AZ_EDA
-- mi_campaign_eda
-- MN_EDA
-- PA_EDA
-
-### data
-
-Contains details of acquiring all raw data used in repository. If data is small (<50MB) then it is okay to save it to the repo, making sure to clearly document how to the data is obtained.
-
-If the data is larger than 50MB than you should not add it to the repo and instead document how to get the data in the README.md file in the data directory. 
-
-This [README.md file](/data/README.md) should be kept up to date.
-
-### output
-Should contain work product generated by the analysis. Keep in mind that results should (generally) be excluded from the git repository.
-
-Creating a searchable, relational database of Arizona, Michigan, Minnesota, and Pennsylvania campaign finance data to chart money flows from 2018 to 2023
-- individual table: include nidividual recipient and donor information of id, first name, last name, full name, entity type (Individual, Lobbyist), state, party, company
-- organization table: include organizational recipient and donor information of id, name, state, entity type (party, committee, corporation, etc.)
-- transaction table: include contribution and expenditure transaction information of transaction id, donor id, recipient id, year, amount, recipient office sought, purpose, and transaction type
-
-
 ### Project Pipeline
+
 1. Collect state's finance campaign data either from web scraping (AZ, MI, PA) or direct download (MN)
-2. User can go to [the shared Google Drive]('https://drive.google.com/drive/u/2/folders/1HUbOU0KRZy85mep2SHMU48qUQ1ZOSNce') to download each state's data to their local repo following this format: repo_root / "data" / "file"
-3. Install all the necessary python packages listed in requirements.txt
+2. User can  go to [this shared Google Drive]('https://drive.google.com/drive/u/2/folders/1HUbOU0KRZy85mep2SHMU48qUQ1ZOSNce') to download each state's data to their local repo following this format: repo_root / "data" / "raw" / <State Initial> / "file"
+3. Open in development container which installs all necessary packages.
 4. Use utils/pipeline.py to preprocess, clean, standardize, and create tables for each state and ultimately concatinate tables across 4 states into a comprehensive database
-5. The final result should be an individual DataFrame, an organization DataFrame, and a transaction DataFrame. They each contain all data in AZ, MI, MN, PA datasets
-6. For future reference, the above pipeline also stores the information mapping given id to our database id (generated via uuid) in a csv file in the format of (state)IDMap.csv
+5. The final result should be an individual DataFrame, an organization DataFrame, and a list of transaction DataFrames. The tables combine all data in AZ, MI, MN, PA datasets
+6. For future reference, the above pipeline also stores the information mapping given id to our database id (generated via uuid) in a csv file in the format of (state)IDMap.csv in the output folder
+
+## Team Members
 
-## Team Member
 Student Name: April Wang
 Student Email: [email protected]
 

diff --git a/data/README.md b/data/README.md
@@ -2,7 +2,6 @@
 
 This directory contains information for use in this project. 
 
-Please make sure to document each source file here.
 #### Arizona Campaign Finance Data
 
 ##### Summary
@@ -72,71 +71,52 @@ contribution data and READMEs in a Google Drive for the duration of this project
 #### Minnesota Campaign Finance Data
 
 ##### Summary
-- The Minnesota Campaign Finance data are publicly available on the 
-[Minnesota Campaign Finance and Public Disclosure Board](https://cfb.mn.gov/reports-and-data/self-help/data-downloads/campaign-finance/) in csv format and has no anti-webscraping defenses. 
+- The Minnesota Campaign Finance data are available in this shared
+[Google Drive](https://drive.google.com/drive/u/2/folders/1uA70woWDhTf3_0F8AbadDa_XIKraCeoc) in zip format and has no anti-webscraping defenses. Please first unzip it and store 12 csv files (10 candidate-recipient contribution dataset, 1 noncandidate-recipient dataset, and 1 expenditure dataset) to local repo in this format: repo root / "data" / "file name"
 
-- However, there is an glitch in the data available through the Data Downloads page above: this dataset does not include contributions reported by the committees of candidates for State Court of Appeals Judge. Consequently, I have utilized an alternative dataset provided by the Minnesota Campaign Finance website developer. This dataset comprises 10 separate CSV files, each documenting contributions made to a specific recipient type from 1998 to 2023. I have consolidated these files into a single dataset to ensure comprehensive coverage.
+- The above dataset is provided by the Minnesota Campaign Finance website developer. This dataset includes 10 separate CSV files, each documenting contributions made to a specific recipient type from 1998 to 2023. This dataset also includes a non-candidate contribution dataset dating back to 1998 and an independent expenditure dataset dating back to 2015.
 
-- The old dataset comprises itemized records of contributions and expenditures made since 2015, specifically including transactions exceeding $200, which aligns with the reporting threshold set at $200 in Minnesota campaign finance regulations. The new dataset itemizes all contributions to candidates from 1998 to 2023.
+- MN dataset comprises itemized records of contributions and expenditures exceeding $200, which aligns with the reporting threshold set at $200 in Minnesota campaign finance regulations. 
 
-- For the purpose of our project I will focus on contribution, not expenditure.
+- For the purpose of our project I will focus on contribution and independent expenditure from 2018 to 2023.
 
 ##### Features
 - Races / Office Sought:
-    1. Governor (GC) 
-    2. Attorney General(AG)
-    3. Secretary of State(SS)
-    4. State Auditor(SA)
-    5. State Treasurer (ST, this office was abolished in 2003 and no longer exists)
-    6. State Senator(Senate)
-    7. State Representative(House)
-    8. State Supreme Court Justice(SC)
-    9. State Appeals Court Judge(AP) 
-    10. State District Court Judge(DC)
-
-- This dataset covers 1998 to present
+    - AG: Attorney General
+    - AP: State Appeals Court Judge
+    - DC: State District Court Judge
+    - GC: Governor
+    - House: State Representative
+    - SA: State Auditor
+    - SC: State Supreme Court Justice
+    - Senate: State Senator
+    - SS: Secretary of State
+    - ST: State Treasurer (this office was abolished in 2003 and no longer exists)
+
+- Donor Types:
+    - I: Individual 
+    - L: Lobbyist  
+    - C: Candidate Committee 
+    - F: Political Committee/Fund  
+    - S: Supporting Association
+    - P: Party Unit
+    - B: Businness
+    - H: Hennepin County Local Candidate Committee
+    - U: Association Not Registered in Board
+    - O: Other 
+    - PTU: Political Party Unit
+    - PCF: Political Committee and Fund
 
 - Trasactions required to report and itemize: Contributions received from any particular source in excess of $200 within a calendar year
 
-- Limitation:
-    1. This new dataset only covers contributions made to candidates, i.e., all recipients are candidates
-    2. Only covers contributions over 200$ by MN campaign finance regulation
-    3. This dataset only dates back to 1998. Pre-1998 is not digitized so access to that data is limited to paper reports.
+- Limitation: Only covers contributions over 200$ by MN campaign finance regulation
 
 
 - Additional information: 
     1. in-kind: Donations of things other than money are in-kind contributions to the receiving entity
     2. For the purpose of our project, I created a separate column of total donation by summing both monetary donation and in-kind donation
-    3. Type and Subtype Acronym:
-        - PCC: Political Contribution Committee
-        - PTU: Political Party Unit
-        - PCF - Political Committee Fund
-        - PF: Political Fund
-        - PC: Political Committee
-        - PCN: Positive Community Norms
-        - PFN: Professional Fundraising Network
-        - IEF: Independent Expenditure Fund
-        - IEC: Independent Expenditure Committee
-        - BC: Ballot Committee
-    4. Recipient Type and Subtype: 
-        - Candidates: Recipient Type PCC 
-        - Party Units: Recipient Type PTU
-        - State Party Units: Recipient Type PTU, Recipient Subtype SPU
-        - Party Unit Caucus Committees: Recipient Type PTU, Recipient subtype CAU
-        - Local Party Units: Recipient Type PTU
-        - Committees and Funds: Recipient Type PCF, Recipient Subtype PF, PC, PCN, PFN, IEF, IEC, BC
-        - Independent Expenditure Committees and Funds: Recipient Type PCF, Recipient Subtype IEF, IE
-    5. Contributors whose total contributions exceed $200 are individually itemized in separate rows. Contributions from donors who each give $200 or less are reported as aggregate totals and are not included in this dataset by definition.
-    6. Contributor/donor Types:
-        - C: Candidate Committee 
-        - I: Individual 
-        - L: Lobbyist  
-        - F: Political Committee/Fund  
-        - S: Self 
-        - P: Party Unit 
-        - H: Registered with Hennepin County 
-        - O: Other 
-    7. The new dataset has 467 missing rows, of which belong to "Registration fee for Netroots event" and have no recipient, donor, or total donation amount.
+    3. Contributors whose total contributions exceed $200 are individually itemized in separate rows. Contributions from donors who each give $200 or less are reported as aggregate totals and are not included in this dataset by definition.
+    4. The dataset has 467 missing rows, of which belong to "Registration fee for Netroots event" and have no recipient, donor, or total donation amount.
 
 
 #### Pennsylvania Campaign Finance Data
Original file line number	Diff line number	Diff line change
Expand Up		@@ -138,3 +138,4 @@ venv.bak/

		# data files
		*.avro
		data/*.txt