Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Az state cleaner #45

Merged
merged 81 commits into from
Dec 5, 2023
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
81 commits
Select commit Hold shift + click to select a range
034d462
need to access dev file
yuzhouw313 Nov 8, 2023
ce30f96
first steps to cleaning
nrposner Nov 11, 2023
5c37e58
adding state cleaner functionality and updating curl crawler
nrposner Nov 13, 2023
87413d2
added state extraction functionality
nrposner Nov 13, 2023
115ddc9
added state validation
nrposner Nov 13, 2023
a69b8bb
fixed case sensitivity for states
nrposner Nov 13, 2023
404025a
first draft of MN abstract class, entity map not done
yuzhouw313 Nov 14, 2023
d45ab06
removed unnecessary crawler element, made more efficient
nrposner Nov 15, 2023
3c88de0
just saving my work, no need for review
Nov 15, 2023
362c222
Merge branch 'dev' into MN_abstract_class
yuzhouw313 Nov 15, 2023
74b6e70
no need to check this commit, doing this before merging with dev
Nov 15, 2023
1f8db78
revised notebook changes
Nov 16, 2023
4e8c853
revised notebook changes
Nov 16, 2023
c9a0c4a
revised EDA after Avery's feedback
Nov 16, 2023
e7c38b3
linter tests passed after Avery's feedback
Nov 16, 2023
27cc164
minor update, commit to merge dev
yuzhouw313 Nov 16, 2023
b8a7756
second draft of MN abstract implement, added entity map
yuzhouw313 Nov 20, 2023
38503ee
restoring older commit to solve commit problems
Nov 21, 2023
be53e64
gitignore
Nov 21, 2023
5f30046
updated crawler, clean_utils, and clean to run smoothly from end to end
nrposner Nov 24, 2023
70383a4
Update
necabotheking Nov 26, 2023
40785c7
updated crawler and cleaner, almost complete end to end
nrposner Nov 27, 2023
6d18723
Merge branch 'dev' into az_state_cleaner
nrposner Nov 27, 2023
013c20f
changed class name to ArizonaCleaner
nrposner Nov 27, 2023
be9921c
Merge branch 'az_state_cleaner' of https://github.com/dsi-clinic/2023…
nrposner Nov 27, 2023
0e916b4
Merge branch 'dev' into MN_abstract_class
yuzhouw313 Nov 27, 2023
09fe283
fixed linter issues and merging conflict
yuzhouw313 Nov 27, 2023
096d1ec
Merge remote-tracking branch 'refs/remotes/origin/MN_abstract_class' …
yuzhouw313 Nov 27, 2023
a5f23f6
Update constants.py
necabotheking Nov 27, 2023
fd2721f
fixed constant.py linter test
yuzhouw313 Nov 27, 2023
c5d375d
update raw data google drive link
yuzhouw313 Nov 27, 2023
9e4d021
updated some docstrings and info, addressing comments still in progress
nrposner Nov 27, 2023
ff4c83f
Delete utils/PA_constants.py
averyschoen Nov 27, 2023
526b3a2
Implemented UUID mapping
necabotheking Nov 27, 2023
ca486e4
Delete notebooks/PennsylvaniaCleaner.py
averyschoen Nov 28, 2023
83f29e1
commiting changes before merging
Nov 29, 2023
ccd06f8
Finished create_organizations and create_individuals()
necabotheking Nov 30, 2023
fdcbc47
finish MichiganCleaner() and rename EDA notebook
necabotheking Nov 30, 2023
d627c23
finished minnesota.py and tested in jupyter notebook, updated dev, ut…
yuzhouw313 Nov 30, 2023
a1d1e66
updated notebook descriptions
nrposner Dec 1, 2023
1cf748f
updated AZ_EDA notebook to access needed data
nrposner Dec 1, 2023
2d2cbc1
update on PACleaner thus far. Still working on create_Tables
Dec 1, 2023
b2c473e
Delete utils/pennsylvania_helper_functions.py
alankagiri Dec 1, 2023
88d5ea8
made many changes for functionality and according to comments, employ…
nrposner Dec 1, 2023
a64d2a8
Delete utils/mn_state_cleaner.py
averyschoen Dec 1, 2023
17633f4
Merge branch 'dev' of github.com:dsi-clinic/2023-fall-clinic-climate-…
Dec 1, 2023
fc74fe2
Merge branch 'Pennsylvania_State_Cleaner' of github.com:dsi-clinic/20…
Dec 1, 2023
73e7578
rework michigan cleaner and add ID_MAP output
necabotheking Dec 3, 2023
fc4de31
fix transactions bug and linter error
necabotheking Dec 3, 2023
819fddc
preprocess done, create_tables almost done
Dec 3, 2023
54e5413
-linter check passed for pennsylvania.py
Dec 3, 2023
2ad79f5
-had to git rm PennsylvaniaCleaner.py to pass linter tests
Dec 3, 2023
6c09a75
moved the cleaner to its own file, updated crawler, cleaner, and add-ons
nrposner Dec 3, 2023
c28f285
updated filepaths and cleaner to run demo files
nrposner Dec 3, 2023
9317cea
updated some docstrings, fixed some bugs, moved towards schema
nrposner Dec 3, 2023
7c86a8c
changed name from arizona_cleaner to arizona
nrposner Dec 3, 2023
475b096
added note about readme
nrposner Dec 3, 2023
af8605e
added utils readme
nrposner Dec 3, 2023
1019f1e
updated readme
nrposner Dec 3, 2023
af60a51
remove functions and uncomment commented filepaths
necabotheking Dec 4, 2023
6a1438c
improved code quality based on Nico's input and updated dev README
yuzhouw313 Dec 4, 2023
7b19a74
fixed minor issue in creating mapping table csv
yuzhouw313 Dec 4, 2023
aaad2c6
Merge MN_abstract_class into dev-f23
trevorspreadbury Dec 4, 2023
01c104e
Merge remote-tracking branch 'origin/michigan-statecleaner' into dev-f23
trevorspreadbury Dec 4, 2023
500e67a
updated filepaths and setup
nrposner Dec 4, 2023
602862c
updated readme
nrposner Dec 4, 2023
b02ae29
Merge remote-tracking branch 'origin/Pennsylvania_State_Cleaner' into…
trevorspreadbury Dec 4, 2023
ecf212f
Merge remote-tracking branch 'origin/az_state_cleaner' into dev-f23
trevorspreadbury Dec 4, 2023
ac48cb2
ran minnesota on ipython with the whole dataset and produced right ou…
yuzhouw313 Dec 4, 2023
33a677a
updated readme
nrposner Dec 5, 2023
e909b85
Delete utils/arizona_cleaner.py
averyschoen Dec 5, 2023
28af64a
Delete utils/README.md
averyschoen Dec 5, 2023
5c79dd6
Update README.md
averyschoen Dec 5, 2023
51b490d
uncommented arizonacleaner in pipeline.py and imported
nrposner Dec 5, 2023
a832404
Update pipeline.py
averyschoen Dec 5, 2023
dcbf6ea
Update description in create_tables()
averyschoen Dec 5, 2023
930b2b8
Update clean.py
averyschoen Dec 5, 2023
261fa60
update for linter tests
Dec 5, 2023
40f92a1
Merge branch 'dev-f23' into MN_abstract_class
averyschoen Dec 5, 2023
57390b3
Merge pull request #51 from dsi-clinic/MN_abstract_class
averyschoen Dec 5, 2023
330ae38
Merge branch 'dev-f23' into az_state_cleaner
averyschoen Dec 5, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
97 changes: 34 additions & 63 deletions notebooks/AZ_EDA.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@
"import plotly.express as px\n",
"import plotly.graph_objects as go\n",
"import warnings\n",
"\n",
"warnings.filterwarnings(\"ignore\")\n",
"warnings.filterwarnings('ignore')\n",
"from utils.helper_fns import pre_process_az\n",
"\n",
"ind_bla, pac_bla, org_bla, bla_cands = pre_process_az()"
]
Expand Down Expand Up @@ -203,7 +203,7 @@
}
],
"source": [
"# looking at just proper contributors, ignoring 'multiple contributors' in first place\n",
"#looking at just proper contributors, ignoring 'multiple contributors' in first place\n",
"ind_bla[1:11]"
]
},
Expand Down Expand Up @@ -364,12 +364,8 @@
}
],
"source": [
"# looking at all expenses by individual contributors, PACs and organizations\n",
"top_expenses = (\n",
" ind_bla.append(org_bla)\n",
" .append(pac_bla)\n",
" .sort_values(by=\"total_spending\", ascending=False)\n",
")\n",
"#looking at all expenses by individual contributors, PACs and organizations\n",
"top_expenses = ind_bla.append(org_bla).append(pac_bla).sort_values(by=\"total_spending\", ascending=False)\n",
"\n",
"top_expenses.head(11)"
]
Expand Down Expand Up @@ -1318,16 +1314,11 @@
}
],
"source": [
"# means\n",
"#means\n",
"\n",
"grp_cands = cands_14_22.groupby(by=\"Office Summary\").mean()\n",
"\n",
"fig = px.bar(\n",
" x=list(grp_cands.index),\n",
" y=grp_cands[\"Expense\"].values,\n",
" title=\"Mean Candidate Income by Office, 2014-22\",\n",
" labels={\"x\": \"Year\", \"y\": \"US Dollars\"},\n",
")\n",
"fig = px.bar(x=list(grp_cands.index), y=grp_cands[\"Expense\"].values, title=\"Mean Candidate Income by Office, 2014-22\", labels = {\"x\":\"Year\", \"y\":\"US Dollars\"})\n",
"\n",
"fig.show()"
]
Expand Down Expand Up @@ -2278,12 +2269,7 @@
"source": [
"grp_cands = cands_14_22.groupby(by=\"Office Summary\").sum()\n",
"\n",
"fig = px.bar(\n",
" x=list(grp_cands.index),\n",
" y=grp_cands[\"Expense\"].values,\n",
" title=\"Gross Candidate Income by Office, 2014-22\",\n",
" labels={\"x\": \"Year\", \"y\": \"US Dollars\"},\n",
")\n",
"fig = px.bar(x=list(grp_cands.index), y=grp_cands[\"Expense\"].values, title=\"Gross Candidate Income by Office, 2014-22\", labels = {\"x\":\"Year\", \"y\":\"US Dollars\"})\n",
"\n",
"fig.show()"
]
Expand Down Expand Up @@ -3246,16 +3232,11 @@
}
],
"source": [
"# recipient income by year\n",
"#recipient income by year\n",
"\n",
"by_year_cands_sum = cands_14_22.groupby(by=\"Year\").sum()\n",
"\n",
"fig = px.bar(\n",
" x=list(by_year_cands_sum.index),\n",
" y=by_year_cands_sum[\"Income\"].values,\n",
" title=\"Gross Candidate Income by Year, 2014-22\",\n",
" labels={\"x\": \"Year\", \"y\": \"US Dollars\"},\n",
")\n",
"fig = px.bar(x=list(by_year_cands_sum.index), y=by_year_cands_sum[\"Income\"].values, title=\"Gross Candidate Income by Year, 2014-22\", labels = {\"x\":\"Year\", \"y\":\"US Dollars\"})\n",
"\n",
"fig.show()"
]
Expand Down Expand Up @@ -4173,32 +4154,26 @@
}
],
"source": [
"# individual contributions by year\n",
"#individual contributions by year\n",
"\n",
"grp = inds_14_22.groupby(by=\"Year\").sum()\n",
"\n",
"grp[\"total_spending\"].values\n",
"\n",
"fig = go.Figure(\n",
" data=[\n",
" go.Bar(\n",
" name=\"Individual\",\n",
" x=list(range(2014, 2023)),\n",
" y=grp[\"total_spending\"].values,\n",
" )\n",
" ]\n",
")\n",
"fig = go.Figure(data = [\n",
" go.Bar(name = \"Individual\", x = list(range(2014, 2023)), y = grp[\"total_spending\"].values, )\n",
"])\n",
"\n",
"fig.update_layout(\n",
" title=\"Gross Individual Contributions by Year, 2014-22\",\n",
" xaxis_title=\"Year\",\n",
" yaxis_title=\"US Dollars\",\n",
" # legend_title=\"Legend Title\",\n",
" # font=dict(\n",
" # family=\"Courier New, monospace\",\n",
" # size=18,\n",
" # color=\"RebeccaPurple\"\n",
" # )\n",
"# legend_title=\"Legend Title\",\n",
"# font=dict(\n",
"# family=\"Courier New, monospace\",\n",
"# size=18,\n",
"# color=\"RebeccaPurple\"\n",
"# )\n",
")\n",
"\n",
"fig.show()"
Expand Down Expand Up @@ -5169,40 +5144,36 @@
}
],
"source": [
"# overall expenses by type by year\n",
"#overall expenses by type by year\n",
"\n",
"donors_14_22 = (\n",
" org_14_22[[\"Name\", \"total_spending\", \"type\", \"Year\"]]\n",
" .append(pac_14_22[[\"Name\", \"total_spending\", \"type\", \"Year\"]])\n",
" .append(inds_14_22[[\"Name\", \"total_spending\", \"type\", \"Year\"]])\n",
")\n",
"donors_14_22 = org_14_22[[\"Name\", \"total_spending\", \"type\", \"Year\"]].append(pac_14_22[[\"Name\", \"total_spending\", \"type\", \"Year\"]]).append(inds_14_22[[\"Name\", \"total_spending\", \"type\", \"Year\"]])\n",
"\n",
"donors_14_22.groupby(by=[\"type\", \"Year\"]).sum()\n",
"donors_14_22.groupby(by = [\"type\", \"Year\"]).sum()\n",
"\n",
"years = [\"2014\", \"2015\", \"2016\", \"2017\", \"2018\", \"2019\", \"2020\", \"2021\", \"2022\"]\n",
"years = ['2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022']\n",
"\n",
"yrs = []\n",
"\n",
"df = donors_14_22.groupby(by=[\"type\", \"Year\"]).sum()\n",
"df=donors_14_22.groupby(by = [\"type\", \"Year\"]).sum()\n",
"\n",
"for i in range(26):\n",
" yrs.append(df.take([i]).values[0][0])\n",
"\n",
"fig = go.Figure(\n",
" data=[\n",
" go.Bar(name=\"Individual\", x=years, y=yrs[0:9]),\n",
" go.Bar(name=\"Organization\", x=years, y=[yrs[9]] + [0] + list(yrs[10:17])),\n",
" go.Bar(name=\"PAC\", x=years, y=yrs[17:26]),\n",
" ]\n",
")\n",
"fig = go.Figure(data = [\n",
" go.Bar(name = \"Individual\", x = years, y = yrs[0:9]),\n",
" go.Bar(name = \"Organization\", x = years, y = [yrs[9]]+[0]+list(yrs[10:17])),\n",
" go.Bar(name = \"PAC\", x = years, y = yrs[17:26])\n",
"])\n",
"\n",
"fig.update_layout(\n",
" title=\"Gross Overall Expenses by Donor Type by Year, 2014-22\",\n",
" xaxis_title=\"Year\",\n",
" yaxis_title=\"US Dollars\",\n",
"\n",
")\n",
"\n",
"fig.show()"
"fig.show()\n",
"\n"
]
},
{
Expand Down Expand Up @@ -5232,7 +5203,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
"version": "3.10.12"
}
},
"nbformat": 4,
Expand Down
5 changes: 4 additions & 1 deletion notebooks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,14 @@ This should contain information about what is done in each notebook

* `az_webcrawler_3.ipynb` : This is a notebook of test code, including the final code used in `az_curl_crawler.py`

* `arizona_scraper_proof_of_concept` is a notebook containing proof of cocenpt for a curl-based webcrawler, which was later expanded on in `az_webcrawler_3.ipynb` and finally used to make `az_curl_crawler.py`
* `arizona_scraper_proof_of_concept` is a notebook containing proof of concept for a curl-based webcrawler, which was later expanded on in `az_webcrawler_3.ipynb` and finally used to make `az_curl_crawler.py`

* `mi_campaign_eda.ipynb`: This notebook contains the exploratory data analysis of the Michigan campaign contribution datasets, with a dropdown that allows the user to select different years to view.

* `mi_campaign_expenditure.ipynb`: This notebook contains the exploratory data analysis of the Michigan campaign expenditure datasets, with a dropdown that allows the user to select different years to view.

* `AZ_EDA` : A notebook containing the EDA and plots for Arizona.

* `PA_EDA.ipynb` : This notebook contains the EDA for Pennsylvania datasets on contributions, filer information, and expenditure data from 2018-2023.

* `az_cleaner_scratch.ipynb` and `az_cleaner_scratch_thanksgiving.ipynb` are notebooks which were used to test parallel versions of the code which concluded in `clean.py` and `cleaner_utils.py`
Loading