Merge branch 'refactor-network.py' of github.com:dsi-clinic/2024-wint…

…er-climate-cabinet-campaign-finance-tracker into refactor-network.py
uchicago-dsi · May 22, 2024 · 4294e3c · 4294e3c
2 parents 6bdfb6a + 648f995
commit 4294e3c
Show file tree

Hide file tree

Showing 6 changed files with 100 additions and 80 deletions.
diff --git a/README.md b/README.md
@@ -5,9 +5,9 @@
 1. Collect: Gather key states' political campaign finance report data which should include recipient information, donor information, and transaction information.
 2. Transform: Define database schema for storing transaction and entity information and write code to transform and validate raw data to fit appropriate schema.
 3. Clean: Perform record linkage and fix likely data entry errors.
-4. Classify: Label all entities as fossil fuel, clean energy, or other
-5. Graph: Construct a network graph of campaign finance contributions
-6. Analyze: Perform analysis on network data and join with other relevant dataset
+4. Classify: Label all entities as fossil fuel, clean energy, or other.
+5. Graph: Construct a network graph of campaign finance contributions with mirco-level and macro-level views.
+6. Analyze: Perform analysis on network data and join with other relevant dataset.
 
 
 ## Setup
@@ -32,24 +32,19 @@ For developing, please use either a Docker dev container or slurm computer clust
 
 ### Network Visualization
 
-# TODO: #101 document what we want to see in the visualization and decide how many types of visual are needed
-
+The network visualizations created and their associated relevant metrics are housed in the `\output` directory. Specifically, [this](https://github.com/dsi-clinic/2024-winter-climate-cabinet-campaign-finance-tracker/tree/main/output/network_graphs) folder. Details about the approaches adopted for these visuals are present in [this](https://github.com/dsi-clinic/2024-winter-climate-cabinet-campaign-finance-tracker/blob/main/output/network_graphs/README.md) document. 
 
 ## Repository Structure
 
 ### utils
-Project python code
+Project python code.
 
 ### notebooks
-Contains short, clean notebooks to demonstrate analysis.
+Contains short, clean notebooks to demonstrate analysis. This is a dynamic folder with notebooks added/removed as per current working processes. 
 
 ### data
 
-Contains details of acquiring all raw data used in repository. If data is small (<50MB) then it is okay to save it to the repo, making sure to clearly document how to the data is obtained.
-
-If the data is larger than 50MB than you should not add it to the repo and instead document how to get the data in the README.md file in the data directory. 
-
-This [README.md file](/data/README.md) should be kept up to date.
+Contains details of acquiring all raw data used in repository.
 
 ### output
 This folder is empty by default. The final outputs of make commands will be placed here by default.
@@ -73,7 +68,7 @@ Student Email: [email protected]
 Student Name: Yangge Xu
 Student Email: [email protected]
 
-Student Name: Bhavya Pandey    
+Student Name: Bhavya Pandey
 Student Email: [email protected]
 
 Student Name: Kaya Lee

diff --git a/output/README.md b/output/README.md
@@ -1,5 +1,7 @@
-# Output README
+# Output 
 ---
-'deduplicated_UUIDs.csv' : Following record linkage work in the record_linkage pipeline, this file stores all the original uuids, and indicates the uuids to which the deduplicated uuids have been matched to.
+`deduplicated_UUIDs.csv` : Following record linkage work in the record_linkage pipeline, this file stores all the original uuids, and indicates the uuids to which the deduplicated uuids have been matched to.
 
-'network_metrics.txt' : Following the network graph creation, this file stores some summarizing metrics about the netowork including: 50 nodes of highest centrality (in-degree, out-degree, eigenvector, and betweenness), density, assortativity based on classification, and clustering.
+`network_metrics.txt` : Following the network graph creation, this file stores some summarizing metrics about the netowork including: 50 nodes of highest centrality (in-degree, out-degree, eigenvector, and betweenness), density, assortativity based on classification, and clustering.
+
+This folder gets populated with output files upon running the `make` commands. The final network visualization graph outputs and metrics are housed here.
diff --git a/output/network_graphs/macro_level.png b/output/network_graphs/macro_level.png
diff --git a/src/utils/README.md b/src/utils/README.md
@@ -1,5 +1,25 @@
 # Utils README
 ---
+#### classify.py
+1. These functions take in the deduplicated and cleaned individuals and organizations 
+dataframes from the deduplication and linkage pipeline. 
+2. We classify based on substrings known to indicate clean energy or fossil fuels groups. 
+In particular, individuals are classified based on their employment by fossil fuels companies, 
+and organizations are classified by their names, prioritizing high profile corporations/PACs 
+and those which were found by a manual search of the largest donors/recipients in the dataset
+
+#### constants.py 
+Declares constants to be used in various parts of the project. Specifies relative file paths and other static information to be used 
+uniformly across all code scripts. 
+
+#### linkage.py 
+Performs record linkage across the different datasets, deduplicates records. 
+
+#### network.py 
+Writes the code for building, visualizing, and analyzing network visualizations (both micro and macro level) as the final outputs. 
+
+### linkage_and_network_pipeline.py 
+The module for running the final network visualization pipeline. Writes functions to call other relevant functions to build the networks from cleaned, transformed, and classified data. 
 
 ## Michigan Utils:
 #### preprocess_mi_campaign_data.py
@@ -72,10 +92,3 @@ Util functions for MN EDA
     candidates, committees, and nan. In order to fit the entries within the
     schema, I code nan entries as 'Organization'
 
-#### classify.py
-1. These functions take in the deduplicated and cleaned individuals and organizations 
-dataframes from the deduplication and linkage pipeline. 
-2. We classify based on substrings known to indicate clean energy or fossil fuels groups. 
-In particular, individuals are classified based on their employment by fossil fuels companies, 
-and organizations are classified by their names, prioritizing high profile corporations/PACs 
-and those which were found by a manual search of the largest donors/recipients in the dataset
diff --git a/src/utils/linkage_and_network_pipeline.py b/src/utils/linkage_and_network_pipeline.py
@@ -25,6 +25,8 @@
 from utils.network import (
     combine_datasets_for_network_graph,
     create_network_graph,
+    network_metrics,
+    plot_macro_level_graph,
     run_network_graph_pipeline,
 )
 
@@ -217,5 +219,14 @@ def clean_data_and_build_network(
     g = create_network_graph(aggreg_df)
     g_output_path = BASE_FILEPATH / "output" / "g.gml"
     nx.write_graphml(g, g_output_path)
+    centrality_metrics, communities = network_metrics(g)
 
-    run_network_graph_pipeline(2018, 2022, [individuals, organizations, transactions])
+    # this creates the micro-level visualization which is
+    # stored in the output/network_graphs location
+    run_network_graph_pipeline(2018, 2023, [individuals, organizations, transactions])
+
+    # this creates the macro-level visualization - run this file in an interactive window in
+    # case the output figure is not displayed
+    plot_macro_level_graph(
+        g, communities, {"betweenness": nx.betweenness_centrality(g, weight="amount")}
+    )
diff --git a/src/utils/network.py b/src/utils/network.py
@@ -1,12 +1,11 @@
-"""Buidling, visualizing, and analyzing networks (micro-level)"""
+"""Buidling, visualizing, and analyzing networks"""
 
 import itertools
 from pathlib import Path
 
 import matplotlib.pyplot as plt
 import networkx as nx
-
-# import numpy as np
+import numpy as np
 import pandas as pd
 import plotly.graph_objects as go
 
@@ -92,7 +91,6 @@ def combine_datasets_for_network_graph(dfs: list[pd.DataFrame]) -> pd.DataFrame:
     return aggreg_df
 
 
-# RETAINED
 def create_network_graph(df: pd.DataFrame) -> nx.MultiDiGraph:
     """Creates network with entities as nodes, transactions as edges
 
@@ -130,8 +128,20 @@ def create_network_graph(df: pd.DataFrame) -> nx.MultiDiGraph:
     return G
 
 
+# Note: dict calls retained due to conventions in visualization package
+
+
 def plot_network_graph(G: nx.MultiDiGraph, start_year: int, end_year: int) -> None:
-    """Creates a plotly visualization of the nodes and edges with arrows indicating direction, and colors indicating classification."""
+    """Creates a plotly visualization of the nodes and edges with arrows indicating direction, and colors indicating classification.
+
+    Args:
+        G: A Networkx MultiDiGraph with nodes and edges
+        start_year: starting year to begin subsetting the data to be used for the visualization
+        end_year: end year to finish subsetting the data to be used for the visualization
+
+    Returns:
+        A Networkx MultiDiGraph with nodes and edges
+    """
     pos = nx.spring_layout(
         G
     )  # position nodes using the spring layout - retained from original code
@@ -288,17 +298,17 @@ def network_metrics(net_graph: nx.Graph) -> None:
         "density": density,
     }
 
-    # with Path("output/network_metrics.txt").open("w") as file:
-    #     file.write(f"in degree centrality: {in_degree}\n")
-    #     file.write(f"out degree centrality: {out_degree}\n")
-    #     file.write(f"eigenvector centrality: {eigenvector}\n")
-    #     file.write(f"betweenness centrality: {betweenness}\n\n")
+    with Path("output/network_metrics.txt").open("w") as file:
+        file.write(f"in degree centrality: {in_degree}\n")
+        file.write(f"out degree centrality: {out_degree}\n")
+        file.write(f"eigenvector centrality: {eigenvector}\n")
+        file.write(f"betweenness centrality: {betweenness}\n\n")
 
-    #     file.write(f"assortativity based on 'classification': {assortativity}\n\n")
+        file.write(f"assortativity based on 'classification': {assortativity}\n\n")
 
-    #     file.write(f"density': {density}\n\n")
+        file.write(f"density': {density}\n\n")
 
-    #     file.write(f"communities where k = 5': {communities}\n\n")
+        file.write(f"communities where k = 5': {communities}\n\n")
 
     return metrics, communities
 
@@ -328,14 +338,14 @@ def run_network_graph_pipeline(
     plot_network_graph(G, start_year, end_year)
 
 
-# added for macro-level viz - Work in Progress
+# added function for macro-level cluster viz
 def additional_network_metrics(G: nx.Graph) -> None:
     """Calculate and print additional network metrics
 
     Args:
-        G: network graph created
+        G: network graph created with edges and nodes
     Returns:
-        some metrics requried for clustering viz
+        prints some additional metrics that may be requried for clustering viz
     """
     # switch the MultiDiGraph to DiGraph for computing
     simple_graph = nx.DiGraph(G)
@@ -351,13 +361,6 @@ def additional_network_metrics(G: nx.Graph) -> None:
     print("Average Clustering Coefficient:", clustering_coeff)
 
 
-# for testing
-individuals = pd.read_csv("output/cleaned/individuals_table.csv")
-organizations = pd.read_csv("output/cleaned/organizations_table.csv")
-transactions = pd.read_csv("output/cleaned/transactions_table.csv")
-run_network_graph_pipeline(2018, 2021, [individuals, organizations, transactions])
-
-
 def plot_macro_level_graph(
     net_graph: nx.Graph, communities: list, centrality_metrics: list
 ) -> None:
@@ -367,16 +370,19 @@ def plot_macro_level_graph(
         net_graph (nx.Graph): The networkx graph object.
         communities (list of lists): Each sublist contains nodes that form a community.
         centrality_metrics (dict): Dictionary containing various centrality measures.
+
+    Returns:
+        None, creates visualization
     """
     pos = nx.spring_layout(net_graph)
-    plt.figure(figsize=(12, 8))
+    plt.figure(figsize=(15, 8))
 
     # mapping each node to its community
-    # community_map = {
-    #     node: idx for idx, community in enumerate(communities) for node in community
-    # }
-    # obtaining colors for each community
-    # community_colors = np.array([community_map[node] for node in net_graph.nodes()])
+    community_map = {
+        node: idx for idx, community in enumerate(communities) for node in community
+    }
+    # obtaining colors for each community for coloring of nodes
+    community_colors = np.array([community_map[node] for node in net_graph.nodes()])
 
     # putting down nodes
     node_sizes = [
@@ -385,51 +391,44 @@ def plot_macro_level_graph(
     nx.draw_networkx_nodes(
         net_graph,
         pos,
-        # node_color=community_colors,
+        node_color=community_colors,
         node_size=node_sizes,
-        cmap=plt.cm.jet,
+        cmap=plt.get_cmap("viridis"),
+        ax=plt.gca(),
         alpha=0.7,
     )
 
     # drawing edges
     nx.draw_networkx_edges(net_graph, pos, alpha=0.5)
 
-    # labels for high centrality nodes
+    # adding labels for high centrality nodes
     high_centrality_nodes = [
         node
         for node in centrality_metrics["betweenness"]
         if centrality_metrics["betweenness"][node]
         > sorted(centrality_metrics["betweenness"].values())[-10]
-    ]  # have to adjust threshold
+    ]  # can adjust threshold here to display labels
     nx.draw_networkx_labels(
         net_graph,
         pos,
         labels={node: node for node in high_centrality_nodes},
         font_size=10,
     )
+    mapper = plt.cm.ScalarMappable(cmap=plt.cm.viridis)
+    ax = plt.gca()
+    plt.colorbar(
+        mapper,
+        ax=ax,
+        orientation="horizontal",
+        label="Community ID",
+        fraction=0.036,
+        pad=0.04,
+    )
 
-    plt.title("Macro-Level Clustering View of Network Graph")
-    # plt.colorbar(
-    #     plt.cm.ScalarMappable(cmap=plt.cm.jet),
-    #     orientation="horizontal",
-    #     label="Community ID",
-    # )
+    plt.title("Macro-Level Clustering View of Network Graph", fontsize=16)
     plt.axis("off")
+    graphs_directory = Path("output/network_graphs")
+    graphs_directory.mkdir(parents=True, exist_ok=True)
+    filename = graphs_directory / f"macro_level_{centrality_metrics[0]}.png"
+    plt.savefig(str(filename))
     plt.show()
-
-
-# testing usage of macro level viz function - change paths if needed and RUN IN AN INTERACTIVE WINDOW TO DISPLAY GRAPH
-# TODO: make default paths more robust
-# TODO: move script to scripts directory
-individuals = pd.read_csv("/project/output/cleaned/individuals_table.csv")
-organizations = pd.read_csv("/project/output/cleaned/organizations_table.csv")
-transactions = pd.read_csv("/project/output/cleaned/transactions_table.csv")
-
-aggreg_df = combine_datasets_for_network_graph(
-    [individuals, organizations, transactions]
-)
-G = create_network_graph(aggreg_df)
-centrality_metrics, communities = network_metrics(G)
-plot_macro_level_graph(
-    G, communities, {"betweenness": nx.betweenness_centrality(G, weight="amount")}
-)