Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

makefile path #69

Closed
wants to merge 273 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
273 commits
Select commit Hold shift + click to select a range
c446aaf
updates to get_likely_name function after feedback to consider genera…
Jan 20, 2024
efc02e2
adjusted the sample usage output to single quotes as per Avery's sugg…
Jan 20, 2024
6c37c45
took care of empty strings that were adding extra whitespace to o output
Jan 20, 2024
81e52db
took care of empty strings that were adding extra whitespace to output
Jan 20, 2024
2dcb7d9
fixed error in sample usage output
Jan 20, 2024
a51c337
added explanation of jaro-winkler and reversed strings
Jan 24, 2024
98a1058
fixing linter error
Jan 24, 2024
d711a26
adding usaddress to requirements.txt file
adilkassim Jan 24, 2024
0ddd8b0
updated function with additional test cases
adilkassim Jan 24, 2024
94265ed
Merge pull request #10 from dsi-clinic/street_from_address_line_1
averyschoen Jan 24, 2024
adfa889
Merge branch 'main' into string_similarity
averyschoen Jan 24, 2024
6f882dd
fix for linter
Jan 24, 2024
762bbaf
Merge pull request #8 from dsi-clinic/string_similarity
averyschoen Jan 24, 2024
c8cdcaf
added row similarity and row match functions. these do not yet have t…
Jan 25, 2024
edfda3e
fixing linter error
Jan 25, 2024
4ca96ee
fixing merges
Jan 25, 2024
fa78221
fixing linter errors
Jan 25, 2024
427d5b6
fixing pytest errors
Jan 25, 2024
236bfc7
revised and added test case for calculate_row_similarity
Jan 25, 2024
f3e3b22
fixing pytest string error
Jan 25, 2024
26b9095
fixed some more pytest errors
Jan 25, 2024
20f4e93
adding cleaning_company_column function
adilkassim Jan 25, 2024
7e7fba3
Updated linkage.py
npashilkar Jan 26, 2024
fcb9a70
Merge branch 'main' into line_1_from_full_address
averyschoen Jan 26, 2024
87be7e2
run precommit
Jan 26, 2024
baf56f5
testing if merge was done correctly after git pull
Jan 29, 2024
ec5f9fd
checking merge outcome from previous git pull after reseting to last …
Jan 29, 2024
7bd294a
Merge branch 'main' of github.com:uchicago-dsi/climate-cabinet-campai…
Jan 29, 2024
d12970b
Merge branch 'uchicago-dsi-main' into get_likely_name_func
Jan 29, 2024
6e5f656
updated sample usage
npashilkar Jan 29, 2024
3d6500c
undoing the mistake of previous commit where I committed files from t…
Jan 29, 2024
90872c0
fixing last linter error for function sample output
Jan 29, 2024
ca8b3f7
standardizing corporate names function
npashilkar Jan 30, 2024
d3d3ebf
fixing row similarity test syntax
Jan 30, 2024
85021df
adding backspaces to fix string literals
Jan 30, 2024
44b89b3
fixing typos
Jan 30, 2024
1a39b72
trying out chatpgts recommendation to fix the pytest error
Jan 30, 2024
498009f
resolving linter error
Jan 30, 2024
663f08d
corp names function update
npashilkar Jan 31, 2024
45008e6
Merge pull request #11 from dsi-clinic/line_1_from_full_address
averyschoen Jan 31, 2024
1ab1d42
updated corp names
npashilkar Jan 31, 2024
b5f764a
addressed comments, but address function back in
Jan 31, 2024
df1dbd1
fixing linter error
Jan 31, 2024
6aad87e
moved dict to constants file
npashilkar Jan 31, 2024
5b4de8c
updated constants file
npashilkar Jan 31, 2024
e4fe9fc
updated constants file
npashilkar Jan 31, 2024
844d20e
updated constants file
npashilkar Jan 31, 2024
976fc3f
updated function
adilkassim Jan 31, 2024
87ea3da
Adding Avery's feedback
Jan 31, 2024
23a8c1f
Adding Avery's feedback
Jan 31, 2024
4081715
saving personal work before merging, no need to look or review @Avery…
Jan 31, 2024
585f63e
Merge branch 'main' of https://github.com/dsi-clinic/2024-winter-clim…
Jan 31, 2024
50537f9
updating requirements.txt to include names-dataset package
adilkassim Jan 31, 2024
3fcbc5b
precommit checks
npashilkar Jan 31, 2024
fe540b6
Merge pull request #18 from dsi-clinic/standardizing-corporate-names
averyschoen Jan 31, 2024
f07dae2
get address number from line 1 function
npashilkar Jan 31, 2024
b21fd52
initial name_rank function
adilkassim Jan 31, 2024
8849f46
get address number from line 1 function
npashilkar Jan 31, 2024
d0086ef
get address number from line 1 function
npashilkar Jan 31, 2024
5f65159
attempt so far at dedup
Feb 1, 2024
28c0034
edited function
adilkassim Feb 1, 2024
71a3174
attempt so far at dedup
Feb 1, 2024
56cde5f
attempt so far at dedup
Feb 1, 2024
72eeffb
progress on dedup function
Feb 1, 2024
161a175
updates on linkage doc, ignore notebooks/Test.ipynb
Feb 1, 2024
8a75d81
Merge pull request #21 from dsi-clinic/get-building-number-from-address
averyschoen Feb 1, 2024
b519fa1
modifications to dedup function, not yet done, no need to review yet
Feb 1, 2024
de9bb3a
merging results for local to remote branches
Feb 1, 2024
4ac551f
passing pre-commits and doctests
adilkassim Feb 2, 2024
38ee4bc
Merge branch 'main' into cleaning_company_column
averyschoen Feb 2, 2024
37dcbf7
Update linkage.py
averyschoen Feb 2, 2024
7f9135f
finished dedup function with helper function to output to a csv_file …
Feb 4, 2024
fb10654
updated function
adilkassim Feb 5, 2024
29ee6bb
made modifications to the deduplication function
Feb 6, 2024
cfa15d0
received a git push error stating that the tip of my branch is behind…
Feb 6, 2024
3d26fde
trying to see what the git branch issues are...no need to review this…
Feb 7, 2024
9646030
Merge pull request #16 from dsi-clinic/cleaning_company_column
averyschoen Feb 7, 2024
5843485
implementing PR feedback
Feb 8, 2024
97b89dd
addressing linter tests failure due to formatting
Feb 8, 2024
61c731f
Merge branch 'main' of https://github.com/dsi-clinic/2024-winter-clim…
Feb 8, 2024
7dc5b70
updating requirements.txt
adilkassim Feb 8, 2024
a3310a1
adding pre_process pipeline funcion
adilkassim Feb 8, 2024
270d532
fixed error in row_matches
nrposner Feb 13, 2024
99dc781
Merge branch 'main' into row_similarity
nrposner Feb 14, 2024
7b3e8f0
fixing linter errors
nrposner Feb 14, 2024
6655192
updates to dedup file and beginning steps on netorkx
Feb 14, 2024
b24041d
Delete notebooks/Test.ipynb
averyschoen Feb 14, 2024
cbe4d1e
Merge pull request #20 from dsi-clinic/clean_rm_unnecessary_row_info
averyschoen Feb 14, 2024
869a2ea
(not complete) splink
npashilkar Feb 14, 2024
21e8575
trying to fix disconnect
Feb 14, 2024
495db81
updated classify
Feb 15, 2024
dbaad50
updated name_rank function
adilkassim Feb 15, 2024
fb48450
splink to .py
npashilkar Feb 15, 2024
a943442
Merge branch 'main' into splink-library-notebook
npashilkar Feb 15, 2024
fbc579c
Update linkage.py
averyschoen Feb 15, 2024
6a41aa0
discovered logic error in dedup function...no need to review yet
Feb 15, 2024
2e28eec
file that tests my deduplicate function
Feb 15, 2024
5621652
file that tests my deduplicate function
Feb 15, 2024
cfb6d26
testing if path to complete_orgs_table.csv is working
Feb 18, 2024
e8f2200
Merge branch 'main' into row_similarity
nrposner Feb 19, 2024
d9356a2
added test for row_matches
Feb 19, 2024
256caf7
changes to linkage
Feb 19, 2024
899263f
merging linkage
Feb 19, 2024
1d11b52
fixing linter issues
Feb 19, 2024
2c03548
fixing test
Feb 19, 2024
1b90fc4
fixing linter errors
Feb 19, 2024
3dd4b4d
fixing typo
Feb 19, 2024
4df8236
fixing typo again
Feb 19, 2024
7aab13e
added match_confidence function
Feb 19, 2024
8796fa6
fixing linter
Feb 19, 2024
f932563
removed duplicate test
Feb 19, 2024
ce54095
updating classifier
Feb 19, 2024
ff02e3d
fixing linter
Feb 19, 2024
caa3f99
Merge pull request #15 from dsi-clinic/row_similarity
averyschoen Feb 19, 2024
e8d246d
Merge branch 'main' into name_uniqueness
adilkassim Feb 19, 2024
4e35327
slight formatting changes
adilkassim Feb 19, 2024
0ed5538
Merge pull request #23 from dsi-clinic/name_uniqueness
averyschoen Feb 19, 2024
c3c8def
preprocess file and function initial commit
adilkassim Feb 19, 2024
97e78ae
update readme
Feb 19, 2024
71cbb37
adding tests to appropriate winter repo
Feb 19, 2024
204c330
Merge branch 'main' of https://github.com/dsi-clinic/2024-winter-clim…
Feb 19, 2024
cccc7cc
slight edits
adilkassim Feb 19, 2024
25eaf60
fixing linter errors
Feb 19, 2024
57c6070
removing preprocess function from linkage.py
adilkassim Feb 19, 2024
2776636
slight changes
adilkassim Feb 19, 2024
d3df75b
changing branches, no need to review
Feb 19, 2024
531453c
changing branches, no need to review
Feb 19, 2024
75082e4
finishing up with dedup func
Feb 19, 2024
1ea09b4
Renaming File
adilkassim Feb 19, 2024
8be737b
Merge pull request #27 from dsi-clinic/afs/update_readme
trevorspreadbury Feb 19, 2024
6d5cee8
update on function to add nodes and their attributes to graph
Feb 19, 2024
e92192b
checking for issue with linter test
Feb 19, 2024
6e04344
Delete tests/tester.ipynb
averyschoen Feb 20, 2024
2ce3d66
Merge pull request #26 from dsi-clinic/dedup_func_testing
averyschoen Feb 21, 2024
976d2af
Saving notebook on networkx
Feb 21, 2024
462cbbc
Merge branch 'main' into preprocess
adilkassim Feb 21, 2024
9534af9
combine test files
Feb 21, 2024
f91ee0e
precommit
Feb 21, 2024
06e91e0
modified get_likely_name function to accomodate non-str inputs
Feb 21, 2024
2a30961
Delete tests directory
averyschoen Feb 21, 2024
9399f10
Merge pull request #30 from dsi-clinic/afs/fix_testing_files
averyschoen Feb 21, 2024
2bea6e7
finishing merge process, no need to review
Feb 21, 2024
e007f3c
splink notebook
npashilkar Feb 21, 2024
89af1cb
splink function clean-up
npashilkar Feb 21, 2024
811165a
Merge branch 'main' into splink-library-notebook
npashilkar Feb 21, 2024
a5eb7a1
splink function clean-up
npashilkar Feb 22, 2024
8a43ca5
Merge branch 'splink-library-notebook' of https://github.com/dsi-clin…
npashilkar Feb 22, 2024
ae1db64
splink function clean-up2
npashilkar Feb 22, 2024
21af2c9
updates
adilkassim Feb 22, 2024
4d7bdfb
adding output csv
adilkassim Feb 22, 2024
fa8c0da
Saving work on networkx branch
Feb 22, 2024
1e4a550
Saving work on networkx branch
Feb 22, 2024
0a043a9
updating docstring of dedup func based on feedback
Feb 22, 2024
cd94c08
pipeline progress so far on network linkage
Feb 24, 2024
22607e7
saving changes in networkx, no need for review
Feb 24, 2024
661feff
updated column names and docstring of dedup func based on Avery's fee…
Feb 24, 2024
4425afe
Merge pull request #31 from dsi-clinic/get_likely_name_func
averyschoen Feb 25, 2024
fa011a7
Merge pull request #32 from dsi-clinic/dedup_func_testing
averyschoen Feb 25, 2024
d58795b
splink output edit
npashilkar Feb 26, 2024
d0f36b6
saving Networkx work before merge...no need to review
Feb 26, 2024
2595098
concluding merge
Feb 26, 2024
f8df69f
saving work for merge, no need to review
Feb 26, 2024
7b2ca08
splink output edits
npashilkar Feb 27, 2024
42ca58e
pipeline changes
adilkassim Feb 28, 2024
77bc2b3
adding removed files
adilkassim Feb 28, 2024
485fe43
basics of makefile and added classify fns
Feb 28, 2024
e4b3a0a
linter fixes
Feb 28, 2024
3ce0c50
modifying classify to fit makefile
Feb 28, 2024
517f909
linter fixes
Feb 28, 2024
244fe94
make should run classification properly
Feb 28, 2024
c273a17
moved names to constants
Feb 28, 2024
9519d67
linter fixes
Feb 28, 2024
0611585
added classification wrapper
Feb 28, 2024
f8c4dc1
linter fix
Feb 28, 2024
3c61937
proper updates
adilkassim Feb 28, 2024
4e32543
removing duplicated function
adilkassim Feb 28, 2024
d94243a
attempting to pass dev checks
adilkassim Feb 28, 2024
4336d3b
modified readme
Feb 28, 2024
df41e42
reformatting files
adilkassim Feb 28, 2024
c687295
add usage instructions
Feb 28, 2024
ddfd126
Merge pull request #34 from dsi-clinic/update_readme
averyschoen Feb 28, 2024
f363bbe
splink changes + deleted notebook
npashilkar Feb 28, 2024
1901341
moved to original makefile
Feb 29, 2024
a7db7d8
Delete utils/Makefile
averyschoen Feb 29, 2024
56a78d8
Merge branch 'main' of https://github.com/dsi-clinic/2024-winter-clim…
Feb 29, 2024
9561f82
Merge pull request #33 from dsi-clinic/make_file
averyschoen Feb 29, 2024
b626fc8
Update linkage.py
averyschoen Feb 29, 2024
807a69d
Merge branch 'main' into splink-library-notebook
npashilkar Feb 29, 2024
1fc2a2a
splink output edits
npashilkar Feb 29, 2024
ecd73d0
Merge pull request #25 from dsi-clinic/splink-library-notebook
averyschoen Feb 29, 2024
bfefce6
Merge branch 'main' into preprocess
adilkassim Feb 29, 2024
26d4773
classify function
adilkassim Feb 29, 2024
609220d
saving work for graph work. No need to review yet
Mar 3, 2024
3266ce7
slight changes
adilkassim Mar 4, 2024
0cdc61a
pulling from main and pipeline additions
adilkassim Mar 4, 2024
d262dee
possible splink implementation fix
adilkassim Mar 4, 2024
6e12bac
Merge branch 'main' of https://github.com/dsi-clinic/2024-winter-clim…
Mar 4, 2024
5a81b23
graph work so far with plotly
Mar 4, 2024
b377acd
Test notebook with functions for merging datasets, no need to review,…
Mar 4, 2024
b8da98e
updating splink function
adilkassim Mar 4, 2024
0185093
pipeline updates
adilkassim Mar 4, 2024
f05778b
passing linter
adilkassim Mar 4, 2024
6de450d
linter
adilkassim Mar 4, 2024
96b8e0b
updated network graph work
Mar 4, 2024
51cc9de
updated classify test
Mar 4, 2024
4cc7ce4
fix pytest
Mar 4, 2024
7f3483c
updated classify and test_classifier
Mar 4, 2024
94f807c
Revert "fix pytest"
Mar 4, 2024
d62f3b7
Revert "updated classify test"
Mar 4, 2024
621e35a
expanded docstrings for classify
Mar 4, 2024
cdf035a
updated visualizations for the graph
Mar 4, 2024
74996a5
updates to the README files under the output and data directories
Mar 4, 2024
0cebc4c
latest version of networkx work
Mar 5, 2024
4625248
linkage.py clean up including additions to constants.py
npashilkar Mar 5, 2024
133dadc
addressing comments on classify and test_classify
Mar 5, 2024
feda102
Delete utils/tests/test_classifier.py
averyschoen Mar 5, 2024
863cfab
Merge pull request #36 from dsi-clinic/update_classify
averyschoen Mar 5, 2024
9c5ff3c
making revisions to data/README and network.py per Avery's feedback
Mar 5, 2024
cb9613c
Merge branch 'main' of https://github.com/dsi-clinic/2024-winter-clim…
Mar 5, 2024
18a52ff
making revisions to data/README and network.py per Avery's feedback
Mar 5, 2024
743b306
updating readme and makefile as well as location of data for linkage_…
Mar 5, 2024
793b8af
removing unneccessary tests
npashilkar Mar 5, 2024
a571d91
slight update to splink_dedupe function
adilkassim Mar 5, 2024
1db2839
pre-commit fixes
adilkassim Mar 5, 2024
083f92f
last minute modifications to network file. final version
Mar 5, 2024
09aca55
removing main() from file
Mar 5, 2024
269998c
removing main() from file
Mar 5, 2024
d6167df
updated README.md to show networkX portion of the pipeline
Mar 5, 2024
8d62939
saving Test.ipynb work
Mar 5, 2024
c9752a0
Delete notebooks/Test.ipynb
averyschoen Mar 5, 2024
0f7d07e
Update Makefile
averyschoen Mar 5, 2024
3c2005f
Merge pull request #29 from dsi-clinic/networkx_record_linkage
averyschoen Mar 5, 2024
039768b
matching local branch with main
Mar 5, 2024
0af314c
Merge pull request #37 from dsi-clinic/linkage-code-clean-up
averyschoen Mar 5, 2024
e0147df
Merge branch 'main' of https://github.com/dsi-clinic/2024-winter-clim…
Mar 6, 2024
d06da14
Merge branch 'main' into preprocess
adilkassim Mar 6, 2024
0a3b4e7
slight modifications to linkage.py for cleaning purposes
Mar 6, 2024
9f980ff
slight modifications to linkage.py for cleaning purposes, now passing…
Mar 6, 2024
7ebe2a2
slight changes
adilkassim Mar 6, 2024
51f82d4
revert changes to standardize_corp_names...the logic goes through man…
Mar 6, 2024
9a03521
renaming file
adilkassim Mar 6, 2024
d4161f6
updating functions to latest versions
adilkassim Mar 6, 2024
45347e2
slight changes to match function changes in linkage.py
adilkassim Mar 6, 2024
24ef142
Merge pull request #39 from dsi-clinic/networkx_record_linkage
averyschoen Mar 6, 2024
ad2ed0f
slight changes
adilkassim Mar 6, 2024
4b0de47
readme changes
adilkassim Mar 6, 2024
0c79023
data/ readme changes
adilkassim Mar 6, 2024
48470c2
pre-commit formatting changes
adilkassim Mar 6, 2024
3fbf913
Merge pull request #28 from dsi-clinic/preprocess
averyschoen Mar 6, 2024
bee198a
Update Makefile
adilkassim Mar 6, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .devcontainer/devcontainer.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{
"name": "2023-fall-clinic-climate-cabinet-devcontainer",
"name": "2024-winter-clinic-climate-cabinet-devcontainer",
"build": {
"dockerfile": "../Dockerfile",
"context": "..",
Expand Down
11 changes: 9 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@ current_abs_path := $(subst Makefile,,$(mkfile_path))

# pipeline constants
# PROJECT_NAME
project_image_name := "2023-fall-clinic-climate-cabinet"
project_container_name := "2023-fall-clinic-climate-cabinet-container"
project_image_name := "2024-winter-clinic-climate-cabinet"
project_container_name := "2024-winter-clinic-climate-cabinet-container"
project_dir := "$(current_abs_path)"

# environment variables
Expand All @@ -29,3 +29,10 @@ run-notebooks:
jupyter lab --port=8888 --ip='*' --NotebookApp.token='' --NotebookApp.password='' \
--no-browser --allow-root


#running the linkage pipeline and creating the network graph
#still waiting on linkage_pipeline completion to get this into final shape

run-linkage-and-network-pipeline:
docker build -t $(project_image_name) -f Dockerfile $(current_abs_path)
docker run -v $(current_abs_path):/project -t $(project_image_name) python utils/linkage_and_network_pipeline.py
43 changes: 33 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# 2023-fall-clinic-climate-cabinet
# 2024-winter-clinic-climate-cabinet

## Data Science Clinic Project Goals

Expand Down Expand Up @@ -34,28 +34,51 @@ If you prefer to develop inside a container with VS Code then do the following s
3. Click the blue or green rectangle in the bottom left of VS code (should say something like `><` or `>< WSL`). Options should appear in the top center of your screen. Select `Reopen in Container`.


### Project Pipeline
### Data Collection and Standardization Pipeline
1. Collect the data through **<span style="color: red;">one</span>** of the steps below
a. Collect state's finance campaign data either from web scraping (AZ, MI, PA) or direct download (MN) OR
b. Go to the [Project's Google Drive]('https://drive.google.com/drive/u/2/folders/1HUbOU0KRZy85mep2SHMU48qUQ1ZOSNce') to download each state's data to their local repo following this format: repo_root / "data" / "raw" / <State Initial> / "file"
b. Go to the [Project's Google Drive]('https://drive.google.com/drive/u/2/folders/1HUbOU0KRZy85mep2SHMU48qUQ1ZOSNce') to download each state's data to their local repo following this format: repo_root / "data" / "raw" / state acronym / "file"
2. Open in development container which installs all necessary packages.
3. Run the project by running ```python utils/pipeline.py``` or ```python3 utils/pipeline.py``` run the processing pipeline that cleans, standardizes, and creates the individuals, organizations, and transactions concatenated into one comprehensive database.
5. running ```pipeline.py``` returns the tables to the output folder as csv files containing the complete individuals, organizations, and transactions DataFrames combining the AZ, MI, MN, and PA datasets.
5. Running ```pipeline.py``` returns the tables to the output folder as csv files containing the complete individuals, organizations, and transactions DataFrames combining the AZ, MI, MN, and PA datasets.
6. For future reference, the above pipeline also stores the information mapping given id to our database id (generated via uuid) in a csv file in the format of (state)IDMap.csv (example: ArizonaIDMap.csv) in the output folder

## Team Members
### Record Linkage and Network Pipeline
1. Save the standardized tables "complete_individuals_table.csv", "complete_organizations_table.csv", and "complete_transactions_table.csv" (collected from the above pipeline or data from the project's Google Drive) in the following format: repo_root / "output" / "file"
2. **UPDATE:** Run the pipeline by calling ```make run-linkage-and-network-pipeline```. This pipeline will perform conservative record linkage, attempt to classify entities as neutral, fossil fuels, or clean energy, convert the standardized tables into a NetworkX Graph, and show an interactive network visual.
3. The pipeline will output the deduplicated tables saved as "cleaned_individuals_table.csv", "cleaned_organizations_table.csv", and "cleaned_transactions_table.csv". A mapping file, "deduplicated_UUIDs" tracks the UUIDs designated as duplicates. The pipeline will also output "Network Graph Node Data", which is the NetworkX Graph object converted into an adjecency list.

Student Name: April Wang
Student Email: [email protected]
## Repository Structure

### utils
Project python code

### notebooks
Contains short, clean notebooks to demonstrate analysis.

### data

Contains details of acquiring all raw data used in repository. If data is small (<50MB) then it is okay to save it to the repo, making sure to clearly document how to the data is obtained.

If the data is larger than 50MB than you should not add it to the repo and instead document how to get the data in the README.md file in the data directory.

This [README.md file](/data/README.md) should be kept up to date.

### output
This folder is empty by default. The final outputs of the Makefile will be placed here, consisting of a NetworkX Graph object and a txt file containing graph metrics.



## Team Member

Student Name: Nicolas Posner
Student Email: [email protected]

Student Name: Aïcha Camara
Student Email: [email protected]

Student Name: Alan Kagiri
Student Email: [email protected].

Student Name: Adil Kassim
Student Email: [email protected]

Student Name: Nayna Pashilkar
Student Email: [email protected]
39 changes: 0 additions & 39 deletions notebooks/Test.ipynb

This file was deleted.

1 change: 1 addition & 0 deletions output/README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
# Output README
---
'deduplicated_UUIDs.csv' : Following record linkage work in the record_linkage pipeline, this file stores all the original uuids, and indicates the uuids to which the deduplicated uuids have been matched to.
8 changes: 8 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,11 @@ beautifulsoup4==4.11.1
numpy==1.25.0
Requests==2.31.0
setuptools==68.0.0
textdistance==4.6.1
usaddress==0.5.4
nameparser==1.1.3
names-dataset==3.1.0
networkx~=3.1
networkx~=3.1
splink==3.9.12
names-dataset==3.1.0
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
from setuptools import find_packages, setup

setup(
name="2023-fall-clinic-climate-cabinet",
name="2024-winter-clinic-climate-cabinet",
version="0.1.0",
packages=find_packages(
include=[
Expand Down
10 changes: 9 additions & 1 deletion utils/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,4 +70,12 @@ Util functions for MN EDA
classify the donor entities in the expenditures.
3. The Contributors datasets have 4 kinds of recipient entities: lobbyists,
candidates, committees, and nan. In order to fit the entries within the
schema, I code nan entries as 'Organization'
schema, I code nan entries as 'Organization'

#### classify.py
1. These functions take in the deduplicated and cleaned individuals and organizations
dataframes from the deduplication and linkage pipeline.
2. We classify based on substrings known to indicate clean energy or fossil fuels groups.
In particular, individuals are classified based on their employment by fossil fuels companies,
and organizations are classified by their names, prioritizing high profile corporations/PACs
and those which were found by a manual search of the largest donors/recipients in the dataset
107 changes: 107 additions & 0 deletions utils/classify.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
import pandas as pd

from utils.constants import c_org_names, f_companies, f_org_names


def classify_wrapper(
individuals_df: pd.DataFrame, organizations_df: pd.DataFrame
):
"""Wrapper for classification in linkage pipeline

Initialize the classify column in both dataframes and
call sub-functions classifying individuals and organizations

Args:
individuals_df: cleaned and deduplicated dataframe of individuals
organizations_df: cleaned and deduplicated dataframe of organizations

Returns:
individuals and organizations datfarames with a new
'classification' column containing 'neutral', 'f', or 'c'.
'neutral' status is the default for all entities, and those tagged
as 'neutral' are entities which we could not confidently identify as
either fossil fuel or clean energy organizations or affiliates.
Classification is very conservative, and we are very confident that
entities classified as one group or another are related to them.

"""

individuals_df["classification"] = "neutral"
organizations_df["classification"] = "neutral"

classified_individuals = classify_individuals(individuals_df)
classified_orgs = classify_orgs(organizations_df)

return classified_individuals, classified_orgs


def matcher(df: pd.DataFrame, substring: str, column: str, category: str):
"""Applies a label to the classification column based on substrings

We run through a given column containing strings in the dataframe. We
seek out rows containing substrings, and apply a certain label to
the classification column. We initialize using the 'neutral' label and
use the 'f' and 'c' labels to denote fossil fuel and clean energy
entities respectively.

Args:
df: a pandas dataframe
substring: the string to search for
column: the column name in which to search
category: the category to assign the row, such as 'f' 'c' or 'neutral'

Returns:
A pandas dataframe in which rows matching the substring conditions in
a certain column are marked with the appropriate category
"""

bool_series = df[column].str.contains(substring, na=False)

df.loc[bool_series, "classification"] = category

return df


def classify_individuals(individuals_df: pd.DataFrame):
"""Part of the classification pipeline

We check if individuals work for a known fossil fuel company
and categorize them using the matcher() function.

Args:
individuals_df: a dataframe containing deduplicated
standardized individuals data

Returns:
an individuals dataframe updated with the fossil fuels category
"""

for i in f_companies:
individuals_df = matcher(individuals_df, i, "company", "f")

return individuals_df


def classify_orgs(organizations_df: pd.DataFrame):
"""Part of the classification pipeline

We apply the matcher function to the organizations dataframe
repeatedly, using a variety of substrings to identify fossil
fuel and clean energy companies.

Args:
organizations_df: a dataframe containing deduplicated
standardized organizations data

Returns:
an organizations dataframe updated with the fossil fuels
and clean energy category
"""

for i in f_org_names:
organizations_df = matcher(organizations_df, i, "name", "f")

for i in c_org_names:
organizations_df = matcher(organizations_df, i, "name", "c")

return organizations_df
Loading
Loading