uchicago-dsi · adilkassim · Jan 20, 2024 · Jan 20, 2024 · Jan 20, 2024 · Jan 20, 2024
diff --git a/.devcontainer/devcontainer.json b/.devcontainer/devcontainer.json
@@ -1,5 +1,5 @@
 {
-	"name": "2023-fall-clinic-climate-cabinet-devcontainer",
+	"name": "2024-winter-clinic-climate-cabinet-devcontainer",
 	"build": {
 		"dockerfile": "../Dockerfile",
 		"context": "..",

diff --git a/Makefile b/Makefile
@@ -7,8 +7,8 @@ current_abs_path := $(subst Makefile,,$(mkfile_path))
 
 # pipeline constants
 # PROJECT_NAME
-project_image_name := "2023-fall-clinic-climate-cabinet"
-project_container_name := "2023-fall-clinic-climate-cabinet-container"
+project_image_name := "2024-winter-clinic-climate-cabinet"
+project_container_name := "2024-winter-clinic-climate-cabinet-container"
 project_dir := "$(current_abs_path)"
 
 # environment variables
@@ -29,3 +29,10 @@ run-notebooks:
 	jupyter lab --port=8888 --ip='*' --NotebookApp.token='' --NotebookApp.password='' \
 	--no-browser --allow-root
 
+
+#running the linkage pipeline and creating the network graph
+#still waiting on linkage_pipeline completion to get this into final shape
+
+run-linkage-and-network-pipeline:
+	docker build -t $(project_image_name) -f Dockerfile $(current_abs_path)
+	docker run -v $(current_abs_path):/project -t $(project_image_name) python utils/linkage_and_network_pipeline.py
diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-# 2023-fall-clinic-climate-cabinet
+# 2024-winter-clinic-climate-cabinet
 
 ## Data Science Clinic Project Goals
 
@@ -34,28 +34,51 @@ If you prefer to develop inside a container with VS Code then do the following s
 3. Click the blue or green rectangle in the bottom left of VS code (should say something like `><` or `>< WSL`). Options should appear in the top center of your screen. Select `Reopen in Container`.
 
 
-### Project Pipeline
+### Data Collection and Standardization Pipeline
 1. Collect the data through **<span style="color: red;">one</span>** of the steps below
     a. Collect state's finance campaign data either from web scraping (AZ, MI, PA) or direct download (MN) OR
-    b. Go to the [Project's Google Drive]('https://drive.google.com/drive/u/2/folders/1HUbOU0KRZy85mep2SHMU48qUQ1ZOSNce') to download each state's data to their local repo following this format: repo_root / "data" / "raw" / <State Initial> / "file"
+    b. Go to the [Project's Google Drive]('https://drive.google.com/drive/u/2/folders/1HUbOU0KRZy85mep2SHMU48qUQ1ZOSNce') to download each state's data to their local repo following this format: repo_root / "data" / "raw" / state acronym / "file"
 2. Open in development container which installs all necessary packages.
 3. Run the project by running ```python utils/pipeline.py``` or ```python3 utils/pipeline.py``` run the processing pipeline that cleans, standardizes, and creates the individuals, organizations, and transactions concatenated into one comprehensive database.
-5. running ```pipeline.py``` returns the tables to the output folder as csv files containing the complete individuals, organizations, and transactions DataFrames combining the AZ, MI, MN, and PA datasets.
+5. Running ```pipeline.py``` returns the tables to the output folder as csv files containing the complete individuals, organizations, and transactions DataFrames combining the AZ, MI, MN, and PA datasets.
 6. For future reference, the above pipeline also stores the information mapping given id to our database id (generated via uuid) in a csv file in the format of (state)IDMap.csv (example: ArizonaIDMap.csv) in the output folder
 
-## Team Members
+### Record Linkage and Network Pipeline
+1. Save the standardized tables "complete_individuals_table.csv", "complete_organizations_table.csv", and "complete_transactions_table.csv" (collected from the above pipeline or data from the project's Google Drive) in the following format: repo_root / "output" / "file"
+2. **UPDATE:** Run the pipeline by calling ```make run-linkage-and-network-pipeline```. This pipeline will perform conservative record linkage, attempt to classify entities as neutral, fossil fuels, or clean energy, convert the standardized tables into a NetworkX Graph, and show an interactive network visual.
+3. The pipeline will output the deduplicated tables saved as "cleaned_individuals_table.csv", "cleaned_organizations_table.csv", and "cleaned_transactions_table.csv". A mapping file, "deduplicated_UUIDs" tracks the UUIDs designated as duplicates. The pipeline will also output "Network Graph Node Data", which is the NetworkX Graph object converted into an adjecency list.
 
-Student Name: April Wang
-Student Email: [email protected]
+## Repository Structure
+
+### utils
+Project python code
+
+### notebooks
+Contains short, clean notebooks to demonstrate analysis.
+
+### data
+
+Contains details of acquiring all raw data used in repository. If data is small (<50MB) then it is okay to save it to the repo, making sure to clearly document how to the data is obtained.
+
+If the data is larger than 50MB than you should not add it to the repo and instead document how to get the data in the README.md file in the data directory. 
+
+This [README.md file](/data/README.md) should be kept up to date.
+
+### output
+This folder is empty by default. The final outputs of the Makefile will be placed here, consisting of a NetworkX Graph object and a txt file containing graph metrics. 
+
+
+
+## Team Member
 
 Student Name: Nicolas Posner
 Student Email: [email protected]
 
-Student Name: Aïcha Camara
-Student Email: [email protected]
-
 Student Name: Alan Kagiri
 Student Email: [email protected]. 
 
 Student Name: Adil Kassim
 Student Email: [email protected]
+
+Student Name: Nayna Pashilkar
+Student Email: [email protected]
diff --git a/notebooks/Test.ipynb b/notebooks/Test.ipynb
diff --git a/output/README.md b/output/README.md
@@ -1,2 +1,3 @@
 # Output README
 ---
+'deduplicated_UUIDs.csv' : Following record linkage work in the record_linkage pipeline, this file stores all the original uuids, and indicates the uuids to which the deduplicated uuids have been matched to.
diff --git a/requirements.txt b/requirements.txt
@@ -17,3 +17,11 @@ beautifulsoup4==4.11.1
 numpy==1.25.0
 Requests==2.31.0
 setuptools==68.0.0
+textdistance==4.6.1
+usaddress==0.5.4
+nameparser==1.1.3
+names-dataset==3.1.0
+networkx~=3.1
+networkx~=3.1
+splink==3.9.12
+names-dataset==3.1.0
diff --git a/setup.py b/setup.py
@@ -1,7 +1,7 @@
 from setuptools import find_packages, setup
 
 setup(
-    name="2023-fall-clinic-climate-cabinet",
+    name="2024-winter-clinic-climate-cabinet",
     version="0.1.0",
     packages=find_packages(
         include=[

diff --git a/utils/README.md b/utils/README.md
@@ -70,4 +70,12 @@ Util functions for MN EDA
     classify the donor entities in the expenditures.
     3. The Contributors datasets have 4 kinds of recipient entities: lobbyists,
     candidates, committees, and nan. In order to fit the entries within the
-    schema, I code nan entries as 'Organization'
+    schema, I code nan entries as 'Organization'
+
+#### classify.py
+1. These functions take in the deduplicated and cleaned individuals and organizations 
+dataframes from the deduplication and linkage pipeline. 
+2. We classify based on substrings known to indicate clean energy or fossil fuels groups. 
+In particular, individuals are classified based on their employment by fossil fuels companies, 
+and organizations are classified by their names, prioritizing high profile corporations/PACs 
+and those which were found by a manual search of the largest donors/recipients in the dataset
diff --git a/utils/classify.py b/utils/classify.py
@@ -0,0 +1,107 @@
+import pandas as pd
+
+from utils.constants import c_org_names, f_companies, f_org_names
+
+
+def classify_wrapper(
+    individuals_df: pd.DataFrame, organizations_df: pd.DataFrame
+):
+    """Wrapper for classification in linkage pipeline
+
+    Initialize the classify column in both dataframes and
+    call sub-functions classifying individuals and organizations
+
+    Args:
+        individuals_df: cleaned and deduplicated dataframe of individuals
+        organizations_df: cleaned and deduplicated dataframe of organizations
+
+    Returns:
+        individuals and organizations datfarames with a new
+        'classification' column containing 'neutral', 'f', or 'c'.
+        'neutral' status is the default for all entities, and those tagged
+        as 'neutral' are entities which we could not confidently identify as
+        either fossil fuel or clean energy organizations or affiliates.
+        Classification is very conservative, and we are very confident that
+        entities classified as one group or another are related to them.
+
+    """
+
+    individuals_df["classification"] = "neutral"
+    organizations_df["classification"] = "neutral"
+
+    classified_individuals = classify_individuals(individuals_df)
+    classified_orgs = classify_orgs(organizations_df)
+
+    return classified_individuals, classified_orgs
+
+
+def matcher(df: pd.DataFrame, substring: str, column: str, category: str):
+    """Applies a label to the classification column based on substrings
+
+    We run through a given column containing strings in the dataframe. We
+    seek out rows containing substrings, and apply a certain label to
+    the classification column. We initialize using the 'neutral' label and
+    use the 'f' and 'c' labels to denote fossil fuel and clean energy
+    entities respectively.
+
+    Args:
+        df: a pandas dataframe
+        substring: the string to search for
+        column: the column name in which to search
+        category: the category to assign the row, such as 'f' 'c' or 'neutral'
+
+    Returns:
+        A pandas dataframe in which rows matching the substring conditions in
+        a certain column are marked with the appropriate category
+    """
+
+    bool_series = df[column].str.contains(substring, na=False)
+
+    df.loc[bool_series, "classification"] = category
+
+    return df
+
+
+def classify_individuals(individuals_df: pd.DataFrame):
+    """Part of the classification pipeline
+
+    We check if individuals work for a known fossil fuel company
+    and categorize them using the matcher() function.
+
+    Args:
+        individuals_df: a dataframe containing deduplicated
+        standardized individuals data
+
+    Returns:
+        an individuals dataframe updated with the fossil fuels category
+    """
+
+    for i in f_companies:
+        individuals_df = matcher(individuals_df, i, "company", "f")
+
+    return individuals_df
+
+
+def classify_orgs(organizations_df: pd.DataFrame):
+    """Part of the classification pipeline
+
+    We apply the matcher function to the organizations dataframe
+    repeatedly, using a variety of substrings to identify fossil
+    fuel and clean energy companies.
+
+    Args:
+        organizations_df: a dataframe containing deduplicated
+        standardized organizations data
+
+    Returns:
+        an organizations dataframe updated with the fossil fuels
+        and clean energy category
+    """
+
+    for i in f_org_names:
+        organizations_df = matcher(organizations_df, i, "name", "f")
+
+    for i in c_org_names:
+        organizations_df = matcher(organizations_df, i, "name", "c")
+
+    return organizations_df