Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring API data pipeline for fetching institutions data and updating Readme #18

Merged
merged 3 commits into from
Apr 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 22 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,16 +14,16 @@ pyenv install 3.11.4

### Node

This code base is compatible with node 18 and above. Please use [the following instructions](https://nodejs.org/en/learn/getting-started/how-to-install-nodejs)
This code base is compatible with node 18 and above. Please use [these instructions](https://nodejs.org/en/learn/getting-started/how-to-install-nodejs)
to install node for your operating system if needed.

### Git

Please [follow the instructions on GitHub](https://github.com/git-guides/install-git) to install git on your system.
Please follow the [instructions on GitHub](https://github.com/git-guides/install-git) to install git on your system.

### Poetry

Please [follow the instractions on the Poetry website](https://python-poetry.org/docs/#installation) to install poetry on your system.
Please follow the [instructions on the Poetry website](https://python-poetry.org/docs/#installation) to install poetry on your system.

## Getting Started

Expand All @@ -45,14 +45,22 @@ You can then install project dependencies as follows:
poetry install
```

You need a `.env` file to store secrets as follows:
You need a `.env` file to store secrets and other environment variables as follows:

```
[email protected]
INSTITUTIONS_FETCH_FILTER=hbcus
INSTITUTIONS_FETCH_COUNT=5
```

The OPENALEX_EMAIL secret is used to [speed up calls](https://docs.openalex.org/how-to-use-the-api/api-overview) to the OpenAlex REST API.

INSTITUTIONS_FETCH_FILTER (allowed values = `hbcus` or `howardu`) is used to configure which institutions will be fetched from the OpenAlex API and saved to `observable/docs/data/institutions.json`.

INSTITUTIONS_FETCH_COUNT determines how many institutions will be loaded in the application.

>**NOTE:** INSTITUTIONS_FETCH_FILTER and INSTITUTIONS_FETCH_COUNT are only used when running `fetch_custom_institutions.py` as a script. When using `invoke fetch` the default values of `hbcus` and `5` are used respectively.

## Running

This project uses [Observable Framework](https://observablehq.com/framework/). You can run the site locally in development mode as follows
Expand All @@ -72,13 +80,19 @@ Deployments to this project on the Observable Cloud take place through the **Dep

You can run various other commands using `invoke` as follows.

Deploy the site to Observable Cloud.
Fetch first 5 HBCUs institutions data from the OpenAlex API and save it to `observable/docs/data/institutions.json`:

```bash
invoke fetch
```

Deploy the site to Observable Cloud:

```bash
invoke deploy
```

Build the static web site locally.
Build the static web site locally:

```bash
invoke build
Expand All @@ -87,13 +101,13 @@ invoke build
Manually case a graph.json refresh. This is needed because currently
observable framework doesn't notice if a dependent python module
has been changed when developing. It only monitors changes to
the particular page that is being displayed.
the particular page that is being displayed.:

```bash
invoke touch
```

Delete local git branches that have already been merged.
Delete local git branches that have already been merged:

```bash
invoke clean-branches
Expand Down
37 changes: 0 additions & 37 deletions collabnext/custom.py

This file was deleted.

20 changes: 17 additions & 3 deletions collabnext/openalex/institutions.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,20 @@
from pyalex import Institution, Institutions
import json
import sys


def get_institutions() -> list[Institution]:
# Get 5 random institutions for now
return [Institutions().random() for _ in range(5)]
def get_institutions(institutions_file_path: str = "docs/data/institutions.json") -> list[Institution]:
institutions = []

# Load institutions from JSON file
try:
institutions = json.load(open(institutions_file_path))
except Exception as e:
print("\nError loading institutions from JSON file", institutions_file_path, ":", e, "\n", file=sys.stderr)

# Get 5 random institutions in case of error
if institutions is None or len(institutions) == 0:
print("No institutions found in JSON file, fetching random institutions\n", file=sys.stderr)
institutions = [Institutions().random() for _ in range(5)]

return institutions
5 changes: 0 additions & 5 deletions collabnext/settings.py

This file was deleted.

1 change: 0 additions & 1 deletion data/institutions_hbcus.json

This file was deleted.

1 change: 0 additions & 1 deletion data/institutions_howardu.json

This file was deleted.

14 changes: 1 addition & 13 deletions observable/docs/data/graph.sqlite.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,20 +23,8 @@
from collabnext.openalex.topics import get_work_topics
from collabnext.openalex.works import get_works_by_authors

from collabnext import settings, custom

# Get institutions
institutions = []
try:
if settings.INSTITUTION_FILTER == "howardu":
institutions = custom.get_institutions_howardu()
elif settings.INSTITUTION_FILTER == "hbcus":
institutions = custom.get_institutions_hbcus(settings.DATA_LOAD_TYPE)
except Exception as e:
print("\nError getting custom institutions:", e, "\n")

if institutions is None or len(institutions) == 0:
institutions = get_institutions()
institutions = get_institutions()

# Create nodes
institution_nodes = make_institution_nodes(institutions)
Expand Down
1 change: 1 addition & 0 deletions observable/docs/data/institutions.json

Large diffs are not rendered by default.

5 changes: 0 additions & 5 deletions requirements.txt

This file was deleted.

100 changes: 100 additions & 0 deletions scripts/fetch_custom_institutions.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@

from pyalex import Institutions, Institution
import pandas as pd
import json
import os
import sys


def fetch_institutions_from_api(
institutions_fetch_filter: str = "hbcus",
institutions_fetch_count: int = 5,
institutions_names_list_path: str = "scripts/hbcus_names_list.csv",
save_to_file: bool = True,
institutions_save_path: str = "observable/docs/data/institutions.json"
) -> list[Institution]:
"""
Fetch institutions from the OpenAlex API based on the specified filter and save the data to a JSON file

Args:
institutions_fetch_filter (str): The filter to determine which institutions to fetch from the API
institutions_fetch_count (int): The number of institutions for which to fetch data
institutions_names_list_path (str): CSV file path containing the list of HBCUs names
save_to_file (bool): Whether to save the institutions data to a JSON file
institutions_save_path (str): JSON file path to save the institutions data to

Returns:
list[Institution]: The list of institutions fetched from the API
"""

institutions = []

try:
if institutions_fetch_filter == "howardu":
# Fetch Howard University based on OpenAlex ID
institutions = Institutions().filter(openalex="I137853757").get()
print("\nFetched institution data for Howard University")

elif institutions_fetch_filter == "hbcus":

# Read list of HBCUs Names from Eligibility Data
inst_df = pd.read_csv(institutions_names_list_path)
print("\nLoaded list of HBCUs names from:", institutions_names_list_path, "\n")
inst_df["query"] = inst_df["name"].str.lower()
inst_df["query"] = inst_df["query"].str.replace(" &", "")

# Run API search for each HBCU name
hbcu_inst_ids = []
hbcu_inst_count = 0
for query in inst_df["query"].tolist():
# Break if the required number of institutions have been fetched, else proceed with search query
if hbcu_inst_count >= institutions_fetch_count:
break
institutions_query = Institutions().filter(display_name={"search": query}).get()

# Check search results for name matches and add to institutions list if not already present
for inst in institutions_query:
if (inst["display_name"] in inst_df["name"].tolist()) and (inst["id"] not in hbcu_inst_ids) and (hbcu_inst_count < institutions_fetch_count):
print("Adding institution:", inst["display_name"])
institutions.append(inst)
hbcu_inst_ids.append(inst["id"])
hbcu_inst_count += 1

print("\nFetched data for", len(institutions), "out of", inst_df.shape[0], "institutions\n")

else:
print("Invalid value of institutions_fetch_filter, make sure to set it to 'hbcus' or 'howardu' (without the quotes) in your .env file")

# Save institutions data to JSON file if required
if (save_to_file) and (len(institutions) > 0):
with open(institutions_save_path, "w") as f:
json.dump(institutions, f)
print("Institutions data saved to", institutions_save_path, "\n")

except Exception as e:
print("\nError fetching institutions from the API:", e, "\n")

return institutions


if __name__ == "__main__":

# Check system arguments and environment variables for filter and count of institutions to fetch
try:
institutions_fetch_filter = str(sys.argv[1])
institutions_fetch_count = int(sys.argv[2])
except Exception as e:
print("\nError parsing system arguments:", e, "\n")
try:
institutions_fetch_filter = os.getenv("INSTITUTIONS_FETCH_FILTER")
institutions_fetch_count = int(os.getenv("INSTITUTIONS_FETCH_COUNT"))
except Exception as e:
print("\nError fetching environment variables:", e, "\n")
institutions_fetch_filter = "hbcus"
institutions_fetch_count = 5
if institutions_fetch_count is None or institutions_fetch_count <= 0:
institutions_fetch_count = 5

# Make the API call to fetch data
institutions = fetch_institutions_from_api(institutions_fetch_filter, institutions_fetch_count)
print("Completed fetching institutions data from the OpenAlex API\n")
10 changes: 5 additions & 5 deletions data/institutions_hbcus.csv → scripts/hbcus_names_list.csv
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
name
Alabama A & M University
Alabama Agricultural and Mechanical University
Fisk University
Howard University
Morehouse College
Texas Southern University
Alabama State University
Albany State University
Alcorn State University
Expand All @@ -26,7 +30,6 @@ Dillard University
Edward Waters College
Elizabeth City State University
Fayetteville State University
Fisk University
Florida Agricultural and Mechanical University
Florida Memorial University
Fort Valley State University
Expand All @@ -36,7 +39,6 @@ H Councill Trenholm State Community College
Hampton University
Harris-Stowe State University
Hinds Community College
Howard University
Huston-Tillotson University
J. F. Drake State Community and Technical College
Jackson State University
Expand All @@ -53,7 +55,6 @@ Livingstone College
Meharry Medical College
Miles College
Mississippi Valley State University
Morehouse College
Morehouse School of Medicine
Morgan State University
Morris Brown College
Expand Down Expand Up @@ -84,7 +85,6 @@ Stillman College
Talladega College
Tennessee State University
Texas College
Texas Southern University
Tougaloo College
Tuskegee University
University of Arkansas at Pine Bluff
Expand Down
6 changes: 6 additions & 0 deletions tasks.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,3 +55,9 @@ def clean_branches(c):
def touch(c):
with cwd("observable/docs/data"):
c.run("touch graph.sqlite.py")


@task
def fetch(c):
with cwd("."):
c.run("python scripts/fetch_custom_institutions.py hbcus 5")