Skip to content

Commit

Permalink
Merge branch 'refactor-network.py' of github.com:dsi-clinic/2024-wint…
Browse files Browse the repository at this point in the history
…er-climate-cabinet-campaign-finance-tracker into refactor-network.py
  • Loading branch information
trevorspreadbury committed May 22, 2024
2 parents 6bdfb6a + 648f995 commit 4294e3c
Show file tree
Hide file tree
Showing 6 changed files with 100 additions and 80 deletions.
21 changes: 8 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,9 @@
1. Collect: Gather key states' political campaign finance report data which should include recipient information, donor information, and transaction information.
2. Transform: Define database schema for storing transaction and entity information and write code to transform and validate raw data to fit appropriate schema.
3. Clean: Perform record linkage and fix likely data entry errors.
4. Classify: Label all entities as fossil fuel, clean energy, or other
5. Graph: Construct a network graph of campaign finance contributions
6. Analyze: Perform analysis on network data and join with other relevant dataset
4. Classify: Label all entities as fossil fuel, clean energy, or other.
5. Graph: Construct a network graph of campaign finance contributions with mirco-level and macro-level views.
6. Analyze: Perform analysis on network data and join with other relevant dataset.


## Setup
Expand All @@ -32,24 +32,19 @@ For developing, please use either a Docker dev container or slurm computer clust

### Network Visualization

# TODO: #101 document what we want to see in the visualization and decide how many types of visual are needed

The network visualizations created and their associated relevant metrics are housed in the `\output` directory. Specifically, [this](https://github.com/dsi-clinic/2024-winter-climate-cabinet-campaign-finance-tracker/tree/main/output/network_graphs) folder. Details about the approaches adopted for these visuals are present in [this](https://github.com/dsi-clinic/2024-winter-climate-cabinet-campaign-finance-tracker/blob/main/output/network_graphs/README.md) document.

## Repository Structure

### utils
Project python code
Project python code.

### notebooks
Contains short, clean notebooks to demonstrate analysis.
Contains short, clean notebooks to demonstrate analysis. This is a dynamic folder with notebooks added/removed as per current working processes.

### data

Contains details of acquiring all raw data used in repository. If data is small (<50MB) then it is okay to save it to the repo, making sure to clearly document how to the data is obtained.

If the data is larger than 50MB than you should not add it to the repo and instead document how to get the data in the README.md file in the data directory.

This [README.md file](/data/README.md) should be kept up to date.
Contains details of acquiring all raw data used in repository.

### output
This folder is empty by default. The final outputs of make commands will be placed here by default.
Expand All @@ -73,7 +68,7 @@ Student Email: [email protected]
Student Name: Yangge Xu
Student Email: [email protected]

Student Name: Bhavya Pandey
Student Name: Bhavya Pandey
Student Email: [email protected]

Student Name: Kaya Lee
Expand Down
8 changes: 5 additions & 3 deletions output/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# Output README
# Output
---
'deduplicated_UUIDs.csv' : Following record linkage work in the record_linkage pipeline, this file stores all the original uuids, and indicates the uuids to which the deduplicated uuids have been matched to.
`deduplicated_UUIDs.csv` : Following record linkage work in the record_linkage pipeline, this file stores all the original uuids, and indicates the uuids to which the deduplicated uuids have been matched to.

'network_metrics.txt' : Following the network graph creation, this file stores some summarizing metrics about the netowork including: 50 nodes of highest centrality (in-degree, out-degree, eigenvector, and betweenness), density, assortativity based on classification, and clustering.
`network_metrics.txt` : Following the network graph creation, this file stores some summarizing metrics about the netowork including: 50 nodes of highest centrality (in-degree, out-degree, eigenvector, and betweenness), density, assortativity based on classification, and clustering.

This folder gets populated with output files upon running the `make` commands. The final network visualization graph outputs and metrics are housed here.
Binary file added output/network_graphs/macro_level.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
27 changes: 20 additions & 7 deletions src/utils/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,25 @@
# Utils README
---
#### classify.py
1. These functions take in the deduplicated and cleaned individuals and organizations
dataframes from the deduplication and linkage pipeline.
2. We classify based on substrings known to indicate clean energy or fossil fuels groups.
In particular, individuals are classified based on their employment by fossil fuels companies,
and organizations are classified by their names, prioritizing high profile corporations/PACs
and those which were found by a manual search of the largest donors/recipients in the dataset

#### constants.py
Declares constants to be used in various parts of the project. Specifies relative file paths and other static information to be used
uniformly across all code scripts.

#### linkage.py
Performs record linkage across the different datasets, deduplicates records.

#### network.py
Writes the code for building, visualizing, and analyzing network visualizations (both micro and macro level) as the final outputs.

### linkage_and_network_pipeline.py
The module for running the final network visualization pipeline. Writes functions to call other relevant functions to build the networks from cleaned, transformed, and classified data.

## Michigan Utils:
#### preprocess_mi_campaign_data.py
Expand Down Expand Up @@ -72,10 +92,3 @@ Util functions for MN EDA
candidates, committees, and nan. In order to fit the entries within the
schema, I code nan entries as 'Organization'

#### classify.py
1. These functions take in the deduplicated and cleaned individuals and organizations
dataframes from the deduplication and linkage pipeline.
2. We classify based on substrings known to indicate clean energy or fossil fuels groups.
In particular, individuals are classified based on their employment by fossil fuels companies,
and organizations are classified by their names, prioritizing high profile corporations/PACs
and those which were found by a manual search of the largest donors/recipients in the dataset
13 changes: 12 additions & 1 deletion src/utils/linkage_and_network_pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@
from utils.network import (
combine_datasets_for_network_graph,
create_network_graph,
network_metrics,
plot_macro_level_graph,
run_network_graph_pipeline,
)

Expand Down Expand Up @@ -217,5 +219,14 @@ def clean_data_and_build_network(
g = create_network_graph(aggreg_df)
g_output_path = BASE_FILEPATH / "output" / "g.gml"
nx.write_graphml(g, g_output_path)
centrality_metrics, communities = network_metrics(g)

run_network_graph_pipeline(2018, 2022, [individuals, organizations, transactions])
# this creates the micro-level visualization which is
# stored in the output/network_graphs location
run_network_graph_pipeline(2018, 2023, [individuals, organizations, transactions])

# this creates the macro-level visualization - run this file in an interactive window in
# case the output figure is not displayed
plot_macro_level_graph(
g, communities, {"betweenness": nx.betweenness_centrality(g, weight="amount")}
)
111 changes: 55 additions & 56 deletions src/utils/network.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,11 @@
"""Buidling, visualizing, and analyzing networks (micro-level)"""
"""Buidling, visualizing, and analyzing networks"""

import itertools
from pathlib import Path

import matplotlib.pyplot as plt
import networkx as nx

# import numpy as np
import numpy as np
import pandas as pd
import plotly.graph_objects as go

Expand Down Expand Up @@ -92,7 +91,6 @@ def combine_datasets_for_network_graph(dfs: list[pd.DataFrame]) -> pd.DataFrame:
return aggreg_df


# RETAINED
def create_network_graph(df: pd.DataFrame) -> nx.MultiDiGraph:
"""Creates network with entities as nodes, transactions as edges
Expand Down Expand Up @@ -130,8 +128,20 @@ def create_network_graph(df: pd.DataFrame) -> nx.MultiDiGraph:
return G


# Note: dict calls retained due to conventions in visualization package


def plot_network_graph(G: nx.MultiDiGraph, start_year: int, end_year: int) -> None:
"""Creates a plotly visualization of the nodes and edges with arrows indicating direction, and colors indicating classification."""
"""Creates a plotly visualization of the nodes and edges with arrows indicating direction, and colors indicating classification.
Args:
G: A Networkx MultiDiGraph with nodes and edges
start_year: starting year to begin subsetting the data to be used for the visualization
end_year: end year to finish subsetting the data to be used for the visualization
Returns:
A Networkx MultiDiGraph with nodes and edges
"""
pos = nx.spring_layout(
G
) # position nodes using the spring layout - retained from original code
Expand Down Expand Up @@ -288,17 +298,17 @@ def network_metrics(net_graph: nx.Graph) -> None:
"density": density,
}

# with Path("output/network_metrics.txt").open("w") as file:
# file.write(f"in degree centrality: {in_degree}\n")
# file.write(f"out degree centrality: {out_degree}\n")
# file.write(f"eigenvector centrality: {eigenvector}\n")
# file.write(f"betweenness centrality: {betweenness}\n\n")
with Path("output/network_metrics.txt").open("w") as file:
file.write(f"in degree centrality: {in_degree}\n")
file.write(f"out degree centrality: {out_degree}\n")
file.write(f"eigenvector centrality: {eigenvector}\n")
file.write(f"betweenness centrality: {betweenness}\n\n")

# file.write(f"assortativity based on 'classification': {assortativity}\n\n")
file.write(f"assortativity based on 'classification': {assortativity}\n\n")

# file.write(f"density': {density}\n\n")
file.write(f"density': {density}\n\n")

# file.write(f"communities where k = 5': {communities}\n\n")
file.write(f"communities where k = 5': {communities}\n\n")

return metrics, communities

Expand Down Expand Up @@ -328,14 +338,14 @@ def run_network_graph_pipeline(
plot_network_graph(G, start_year, end_year)


# added for macro-level viz - Work in Progress
# added function for macro-level cluster viz
def additional_network_metrics(G: nx.Graph) -> None:
"""Calculate and print additional network metrics
Args:
G: network graph created
G: network graph created with edges and nodes
Returns:
some metrics requried for clustering viz
prints some additional metrics that may be requried for clustering viz
"""
# switch the MultiDiGraph to DiGraph for computing
simple_graph = nx.DiGraph(G)
Expand All @@ -351,13 +361,6 @@ def additional_network_metrics(G: nx.Graph) -> None:
print("Average Clustering Coefficient:", clustering_coeff)


# for testing
individuals = pd.read_csv("output/cleaned/individuals_table.csv")
organizations = pd.read_csv("output/cleaned/organizations_table.csv")
transactions = pd.read_csv("output/cleaned/transactions_table.csv")
run_network_graph_pipeline(2018, 2021, [individuals, organizations, transactions])


def plot_macro_level_graph(
net_graph: nx.Graph, communities: list, centrality_metrics: list
) -> None:
Expand All @@ -367,16 +370,19 @@ def plot_macro_level_graph(
net_graph (nx.Graph): The networkx graph object.
communities (list of lists): Each sublist contains nodes that form a community.
centrality_metrics (dict): Dictionary containing various centrality measures.
Returns:
None, creates visualization
"""
pos = nx.spring_layout(net_graph)
plt.figure(figsize=(12, 8))
plt.figure(figsize=(15, 8))

# mapping each node to its community
# community_map = {
# node: idx for idx, community in enumerate(communities) for node in community
# }
# obtaining colors for each community
# community_colors = np.array([community_map[node] for node in net_graph.nodes()])
community_map = {
node: idx for idx, community in enumerate(communities) for node in community
}
# obtaining colors for each community for coloring of nodes
community_colors = np.array([community_map[node] for node in net_graph.nodes()])

# putting down nodes
node_sizes = [
Expand All @@ -385,51 +391,44 @@ def plot_macro_level_graph(
nx.draw_networkx_nodes(
net_graph,
pos,
# node_color=community_colors,
node_color=community_colors,
node_size=node_sizes,
cmap=plt.cm.jet,
cmap=plt.get_cmap("viridis"),
ax=plt.gca(),
alpha=0.7,
)

# drawing edges
nx.draw_networkx_edges(net_graph, pos, alpha=0.5)

# labels for high centrality nodes
# adding labels for high centrality nodes
high_centrality_nodes = [
node
for node in centrality_metrics["betweenness"]
if centrality_metrics["betweenness"][node]
> sorted(centrality_metrics["betweenness"].values())[-10]
] # have to adjust threshold
] # can adjust threshold here to display labels
nx.draw_networkx_labels(
net_graph,
pos,
labels={node: node for node in high_centrality_nodes},
font_size=10,
)
mapper = plt.cm.ScalarMappable(cmap=plt.cm.viridis)
ax = plt.gca()
plt.colorbar(
mapper,
ax=ax,
orientation="horizontal",
label="Community ID",
fraction=0.036,
pad=0.04,
)

plt.title("Macro-Level Clustering View of Network Graph")
# plt.colorbar(
# plt.cm.ScalarMappable(cmap=plt.cm.jet),
# orientation="horizontal",
# label="Community ID",
# )
plt.title("Macro-Level Clustering View of Network Graph", fontsize=16)
plt.axis("off")
graphs_directory = Path("output/network_graphs")
graphs_directory.mkdir(parents=True, exist_ok=True)
filename = graphs_directory / f"macro_level_{centrality_metrics[0]}.png"
plt.savefig(str(filename))
plt.show()


# testing usage of macro level viz function - change paths if needed and RUN IN AN INTERACTIVE WINDOW TO DISPLAY GRAPH
# TODO: make default paths more robust
# TODO: move script to scripts directory
individuals = pd.read_csv("/project/output/cleaned/individuals_table.csv")
organizations = pd.read_csv("/project/output/cleaned/organizations_table.csv")
transactions = pd.read_csv("/project/output/cleaned/transactions_table.csv")

aggreg_df = combine_datasets_for_network_graph(
[individuals, organizations, transactions]
)
G = create_network_graph(aggreg_df)
centrality_metrics, communities = network_metrics(G)
plot_macro_level_graph(
G, communities, {"betweenness": nx.betweenness_centrality(G, weight="amount")}
)

0 comments on commit 4294e3c

Please sign in to comment.