Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Commitment erd #47

Closed
wants to merge 6 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 96 additions & 0 deletions .github/workflows/test_update_api_database.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
on: [push]

jobs:
etl:
runs-on: ubuntu-latest
services:
postgres:
image: postgis/postgis:15-3.4-alpine
env:
POSTGRES_PASSWORD: postgres
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
ports:
# Maps tcp port 5432 on service container to the host
- 5432:5432
steps:
- name: check out repo code
uses: actions/checkout@v4
- name: Load Secrets
uses: 1password/load-secrets-action@v1
with:
export-env: true
env:
OP_SERVICE_ACCOUNT_TOKEN: ${{ secrets.OP_SERVICE_ACCOUNT_TOKEN }}
DO_SPACES_ENDPOINT: "op://AE Data Flow/Digital Ocean - S3 file storage/DO_SPACES_ENDPOINT"
DO_SPACES_ACCESS_KEY: "op://AE Data Flow/Digital Ocean - S3 file storage/DO_SPACES_ACCESS_KEY"
DO_SPACES_SECRET_KEY: "op://AE Data Flow/Digital Ocean - S3 file storage/DO_SPACES_SECRET_KEY"
DO_SPACES_BUCKET_DISTRIBUTIONS: "op://AE Data Flow/Digital Ocean - S3 file storage/DO_SPACES_BUCKET_DISTRIBUTIONS"
DO_ZONING_API_DB_HOST: "op://AE Data Flow/Digital Ocean DB Cluster - Zoning API/host"
DO_ZONING_API_DB_PORT: "op://AE Data Flow/Digital Ocean DB Cluster - Zoning API/port"
DO_ZONING_API_DB_USERNAME_DEV: "op://AE Data Flow/Digital Ocean DB Cluster - Zoning API dev/username"
DO_ZONING_API_DB_PASSWORD_DEV: "op://AE Data Flow/Digital Ocean DB Cluster - Zoning API dev/password"
DO_ZONING_API_DB_DATABASE_DEV: "op://AE Data Flow/Digital Ocean DB Cluster - Zoning API dev/database"
- name: Set .env file
run: |
echo "BUILD_ENGINE_HOST=127.0.0.1" >> .env
echo "BUILD_ENGINE_PORT=5432" >> .env
echo "BUILD_ENGINE_USER=postgres" >> .env
echo "BUILD_ENGINE_PASSWORD=postgres" >> .env
echo "BUILD_ENGINE_DB=postgres" >> .env
echo "DO_SPACES_ENDPOINT=$DO_SPACES_ENDPOINT" >> .env
echo "DO_SPACES_ACCESS_KEY=$DO_SPACES_ACCESS_KEY" >> .env
echo "DO_SPACES_SECRET_KEY=$DO_SPACES_SECRET_KEY" >> .env
echo "DO_SPACES_BUCKET_DISTRIBUTIONS=$DO_SPACES_BUCKET_DISTRIBUTIONS" >> .env
echo "ZONING_API_HOST=$DO_ZONING_API_DB_HOST" >> .env
echo "ZONING_API_PORT=$DO_ZONING_API_DB_PORT" >> .env
echo "ZONING_API_USER=$DO_ZONING_API_DB_USERNAME_DEV" >> .env
echo "ZONING_API_PASSWORD=$DO_ZONING_API_DB_PASSWORD_DEV" >> .env
echo "ZONING_API_DB=$DO_ZONING_API_DB_DATABASE_DEV" >> .env

- name: Install prerequisite packages
run: |
sudo apt-get update
sudo apt-get install -y wget
sudo apt-get install -y git

- name: Setup PostgreSQL
uses: tj-actions/install-postgresql@v3
with:
postgresql-version: 15

- name: Check postgres install
run: pg_dump --version

- name: Install minio client
run: |
sudo wget https://dl.min.io/client/mc/release/linux-amd64/mc
sudo chmod +x mc
sudo mv mc /usr/local/bin

- name: Setup python
uses: actions/setup-python@v5
with:
python-version-file: ".python-version"

- name: Install python dependencies
run: pip install -r requirements.txt

- name: Install dbt dependencies
run: dbt deps

- name: Download
run: ./bash/download.sh

- name: Import
run: ./bash/import.sh

- name: Transform
run: ./bash/transform.sh

- name: Export
run: ./bash/export.sh

53 changes: 53 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
FROM ubuntu:jammy

RUN apt-get update

# RUN apt install -y wget gpg gnupg2 software-properties-common apt-transport-https lsb-release ca-certificates
RUN apt-get install -y wget
RUN apt-get install -y software-properties-common

# ogr2ogr
RUN add-apt-repository ppa:ubuntugis/ppa
RUN apt-get update
RUN apt-get install -y gdal-bin

# psql from postgres-client
RUN sh -c 'echo "deb https://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
RUN wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | apt-key add -
RUN apt-get update
RUN apt-get install -y postgresql-client-15


# minio client
RUN wget https://dl.min.io/client/mc/release/linux-amd64/mc
RUN chmod +x mc
RUN mv mc /usr/local/bin

# python
COPY requirements.txt /requirements.txt
RUN apt-get install -y python3 python3-pip libpq-dev
RUN pip install -r requirements.txt

# dbt
## config
COPY dbt_project.yml /dbt_project.yml
COPY package-lock.yml /package-lock.yml
COPY packages.yml /packages.yml
COPY profiles.yml /profiles.yml
## install
RUN apt-get install -y git
RUN dbt deps
## tests
COPY tests /tests

# etl
## scripts
COPY bash ./bash
## commands
COPY sql /sql
## local source files
COPY borough.csv /borough.csv
COPY land_use.csv /land_use.csv
COPY zoning_district_class.csv /zoning_district_class.csv

CMD ["sleep", "infinity"]
110 changes: 18 additions & 92 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,25 +5,23 @@ This is the primary repository for the data pipelines of the Application Enginee
These pipelines are used to populate the databases used by our APIs and are called "data flows".

## Design
For all AE data flows, there is an ephemeral database within a docker-ized runner

For all AE data flows, there is one database cluster with a `staging` and a `prod` database. There are also `dev` databases. These are called data flow databases.

For each API, there is a database cluster with a `staging` and a `prod` database. The only tables in those databases are those that an API uses. These are called API databases.
For each API, there is a database cluster with a `data-qa` and a `prod` database. The only tables in those databases are those that an API uses. These are called API databases.

For each API and the relevant databases, this is the approach to updating data:

1. Load source data into the data flow database
1. Load source data into the data flow ephemeral database
2. Create tables that are identical in structure to the API database tables
3. Replace the rows in the API database tables

These steps are first performed on the `staging` sets of databases. When that process has succeeded and the API's use of it has passed QA, the same process is performed on the `prod` set of databases.
The exact data flow steps are refined while working in a `local` docker environment. After the steps are stable, they are merged into `main`. From there, they are run first against a `data-qa` API database from within the `data-flow` GitHub action. After receiving quality checks, the `data-flow` GitHub Action is targeted against the `prod` API database.

This is a more granular description of those steps:

1. Download CSV files from Digital Ocean file storage
2. Copy CSV files into source data tables
3. Test source data tables
4. Create API tables in the data flow database
4. Create API tables in the data flow ephemeral database
5. Populate the API tables in data flow database
6. Replace rows in API tables in the API database

Expand All @@ -41,112 +39,40 @@ We use a github action to perform API database updates.

We have three [environments](https://docs.github.com/en/actions/deployment/targeting-different-environments/using-environments-for-deployment) to configure the databases and credentials used for an API database update.

The `dev` environment can used on any branch. The `staging` and `production` environments can only be used on the `main` branch.
The `dev` environment can used on any branch. The `data-qa` and `production` environments can only be used on the `main` branch.

When an action attempts to use the `production` environment, specific people or teams specified in this repo's settings must approve the action run's access of environment.

## Local setup

### Setup MiniO for S3 file transfers

> [!NOTE]
> These instructions are for local setup on macOS.

For non-public files like our CSVs in `/edm/distribution/`, we can use [minio](https://github.com/minio/minio) for authenticated file transfers.

#### Install

```bash
brew install minio/stable/mc
```

#### Add DO Spaces to the `mc` configuration

```bash
mc alias set spaces $DO_SPACES_ENDPOINT $DO_SPACES_ACCESS_KEY $DO_SPACES_SECRET_KEY
```

We use `spaces` here but you can name the alias anything. When you run `mc config host list` you should see the newly added host with credentials from your `.env`.

### Setup python virtual environment

> [!NOTE]
> These instructions are for use of [pyenv](https://github.com/pyenv/pyenv) to manage python virtual environments. See [these instructions](https://github.com/pyenv/pyenv?tab=readme-ov-file#installation) to install it.
>
> If you are using a different approach like [venv](https://docs.python.org/3/library/venv.html) or [virtualenv](https://virtualenv.pypa.io/en/latest/), follow comparable instructions in the relevant docs.

The `.python-version` file defines which version of python this project uses.

#### Install

```bash
brew install pyenv
brew install pyenv-virtualenv
```

#### Create a virtual environment named `venv_ae_data_flow`

```bash
pyenv virtualenv venv_ae_data_flow
pyenv virtualenvs
```

#### Activate `venv_ae_data_flow` in the current terminal

```bash
pyenv activate venv_ae_data_flow
pyenv version
```

#### Install dependencies

```bash
python3 -m pip install --force-reinstall -r requirements.txt
pip list
dbt deps
```

### Setup postgres

We use `postgres` version 15 in order to use the `psql` CLI.

```bash
brew install postgresql@15
# Restart the terminal
psql --version
```

## Local usage
> These instructions depend on docker and docker compose
> If you need to install docker compose, follow [these instructions](https://docs.docker.com/compose/install/).

### Set environment variables

Create a file called `.env` in the root folder of the project and copy the contents of `sample.env` into that new file.

Next, fill in the blank values.

> [!IMPORTANT]
> To use a local database, `sample_local.env` likely has the environment variable values you need.
>
> To use a deployed database in Digital Ocean, the values you need can be found in the AE 1password vault.
### Run the local zoning api database
The `data-flow` steps are run against the `zoning-api` database. Locally, this relies on these two containers running on the same network. The zoning-api creates the network, which the data-flow then joins.
Before continuing with the `data-flow` setup, follow the steps within `nycplanning/ae-zoning-api` to get its database running in a container.

### Run local database with docker compose

Next, use [docker compose](https://docs.docker.com/compose/) to stand up a local PostGIS database.
### Run data-flow local database with docker compose

```bash
./bash/utils/setup_local_db.sh
```

If you need to install docker compose, follow [these instructions](https://docs.docker.com/compose/install/).

### Run each step
### Run each step to complete the data flow

```bash
./bash/download.sh
./bash/import.sh
./bash/transform.sh
./bash/export.sh
./bash/update_api_db.sh
docker compose exec data-flow bash ./bash/download.sh
docker compose exec data-flow bash ./bash/import.sh
docker compose exec data-flow bash ./bash/transform.sh
docker compose exec data-flow bash ./bash/export.sh
docker compose exec data-flow bash ./bash/update_api_db.sh
```

If you receive an error, make sure the script has the correct permissions:
Expand Down
3 changes: 3 additions & 0 deletions bash/download.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,9 @@ source $ROOT_DIR/bash/utils/set_environment_variables.sh
# Setting Environmental Variables
set_envars

# set alias
mc alias set spaces $DO_SPACES_ENDPOINT $DO_SPACES_ACCESS_KEY $DO_SPACES_SECRET_KEY

# Download CSV files from Digital Ocean file storage
DATA_DIRECTORY=.data/
mkdir -p ${DATA_DIRECTORY} && (
Expand Down
5 changes: 5 additions & 0 deletions bash/import-project.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
export PGPASSWORD=$BUILD_ENGINE_PASSWORD
ogr2ogr -nln project_point Pg:"dbname=$BUILD_ENGINE_DB host=$BUILD_ENGINE_HOST user=$BUILD_ENGINE_USER port=$BUILD_ENGINE_PORT" cpdb_projects_pts_23adopt -lco precision=NO -lco GEOMETRY_NAME=geom
ogr2ogr -nln project_polygon Pg:"dbname=$BUILD_ENGINE_DB host=$BUILD_ENGINE_HOST user=$BUILD_ENGINE_USER port=$BUILD_ENGINE_PORT" cpdb_projects_poly_23adopt -lco precision=NO -nlt PROMOTE_TO_MULTI -lco GEOMETRY_NAME=geom


30 changes: 0 additions & 30 deletions bash/utils/setup_local_db.sh

This file was deleted.

20 changes: 17 additions & 3 deletions compose.yml
Original file line number Diff line number Diff line change
@@ -1,13 +1,27 @@
services:
db:
container_name: ${BUILD_ENGINE_CONTAINER_NAME}
build:
context: ./db
context: db/.
environment:
- POSTGRES_USER=${BUILD_ENGINE_USER}
- POSTGRES_PASSWORD=${BUILD_ENGINE_PASSWORD}
- POSTGRES_DB=${BUILD_ENGINE_DB}
networks:
- data
ports:
- "8001:5432"
runner:
build:
context: .
env_file:
- .env
networks:
- data
volumes:
- ./db-volume:/var/lib/postgresql/data
- ./tests:/tests
- ./bash:/bash
- ./sql:/sql
networks:
data:
name: ae-zoning-api_data
external: true
2 changes: 1 addition & 1 deletion db/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FROM postgis/postgis:15-3.4
FROM postgres:15-bookworm

RUN apt update
RUN apt install -y postgresql-15-postgis-3
Expand Down
Binary file modified diagrams/infrastructure_api_data_flow.drawio.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Loading