Data flow for AE Zoning API #660

damonmcc · 2024-03-07T21:54:29Z

Project Description

Application Engineering and Data Engineering will build a way to update the database used by the new Zoning API.

Outcomes

"dev", "staging", and "production" instances of the Zoning API database, Zoning API, and AE Zoning Map PoC are hosted in Digital Ocean, Heroku, and Netlify, respectively.
When a new version of PLUTO has been distributed by Data Engineering to a folder in Digital Ocean Spaces, AE can validate and transform it into the tables needed for the Zoning API database.
Application Engineering can trigger a workflow to update the staging API database.
An engineer or QA team member can QA the new data in the staging environment.
Once the new data are deemed ready to promote to production, a workflow can be triggered to update the production API database.

Timing

Complete by end of March

Tasks

Bonus tasks

add data flow dockerfile for local DB use ae-data-flow#36
PLUTO version shouldn't be hardcoded in AE Data Flow code ae-data-flow#41
generalize AE Data Flow repo to not be Zoning API only ae-data-flow#42
improve approach to determining API table schema ae-data-flow#43
~~improve testing data flow tables to align with API expectations~~
~~deprecate use of psql CLI~~

Design

We're now keeping the ae-data-flow README up to date with the latest design approach.

Diagram

Outline from 3/12

source data
- PLUTO
  - will have been QAd by GIS and will include Zoning Districts with it
- from edm-distributions/dcp_pluto
  - DE will be pushing QA'd data to this ASAP, will manually put it there for now
- csvs preferred because they're easy to copy into a database
load
- copy csvs into an AE database
transform
- validate source data here?
- transform tables via SQL queries
replace
- replace tables that the API is pointing to
- how?
  - create new tables
    - tax_lots_new
  - rename old and new tables
    - old first, new second

Motivations

Relevant discussion in ae-private repo here

The backend of the Zoning API (repo) is a Postgres database. The structures of the API tables are defined in that repo here in the folder /src/schema.

This API is used by the Zoning Map PoC (repo).

PLUTO and Zoning Tax Lots (ZTL) are the sources of the data in the API DB. These sources are updated monthly by Data Engineering. So some of the tables in API DB must be updated monthly.

Notes

@TylerMatteo's poetry-boilerplate repo here

AE's equity-tool repo has an example here of a github action using staging & production environments.

The text was updated successfully, but these errors were encountered:

damonmcc · 2024-03-08T16:18:00Z

Data Engineering scripts that currently generate AE's normalized files here

DE data flow for building data products

When we build a data product:
- run a github action
- creates a draft schema in the product's DB
- populates tables from files in S3
- run SQL queries on those tables
- export final tables to a "draft" folder
- copy files of a specific draft to a "staging" folder
- QA the staged data
- promote the staged data to production
data platform diagram
github actions
code
- import_datasets() from dcpy.connectors.edm.recipes
- copy a csv into a table here

fvankrieken · 2024-03-08T20:11:45Z

Actual logic to generate the outputs for AE are here. They're currently only run by a special gha triggered manually by us, building tables in a specific build schema and then exporting.

Given that it's currently most convenient for us to generate these tables in the context of a pluto build (given that they use zoning district data and pluto db table), we could do a couple things

actually build these tables during pluto build, then export (as in, the dumped tables are in the draft/publish folder with other pluto outputs, and these can be pushed directly when we publish/package/trigger AE etl). Could export as csv, could export subset of tables as pg_dump (not sql file, but --data-only compressed custom output that could be run using pg_restore). A bit tidier than a sql file and don't have to worry about type safety. I think we could export a subset of tables as a single file this way too, but need to refresh my memory on pg_dumps
when we package, we build these tables. maybe a tad cleaner conceptually, but it means zoning data would need to get pulled in during packaging, which is maybe a bit less clean
keep pluto build as is, and AE etl would need to actually process pluto to generate normalized output for loading. They would also need to pull in zoning data separately. Maybe a nice exercise to get them digesting our data and reading data from edm-recipes. But all of that is maybe adding unnecessary complexity right now.

I'm assuming we're thinking something along the lines of 1, right?

damonmcc · 2024-03-11T15:45:50Z

per meeting today, we're going with option 3

keep pluto build as is, and AE etl would need to actually process pluto to generate normalized output for loading. They would also need to pull in zoning data separately. Maybe a nice exercise to get them digesting our data and reading data from edm-recipes. But all of that is maybe adding unnecessary complexity right now.

damonmcc · 2024-04-01T19:39:28Z

One of our project outcomes is:

"dev", "staging", and "production" instances of the Zoning API database, Zoning API, and AE Zoning Map PoC are hosted in Digital Ocean, Heroku, and Netlify, respectively.

Instances of the Zoning API database are now hosted in Digital Ocean.

But deploying instances of the API and PoC seems unlikely to be accomplished by this project. Barriers seems to be AE deciding on:

deployment strategies (dev, staging, and prod environments)
hosting approaches (Heroku and Netlify were suggested, but AE is hesitant to keep using those cloud providers)
how/whether to containerize the API and PoC codebase (currently run locally via npm [command])

ideas

Zoning API deployment

The most modern AE approach seems to be the Population FactFinder API (repo), but it's hosted on Heroku.

We should probably host this on Digital Ocean or Azure.

Zoning Map PoC

The most modern AE approach seems to be the Population FactFinder app (repo), but it's hosted on Netlify.

We should probably host this on Digital Ocean or Azure.

suggested next steps

AE decides how to host new APIs and Apps
- per this closed issue, a proposal to host things on Azure has been written

damonmcc · 2024-04-03T19:51:38Z

notes from Project Review meeting on 4/3

It's got good bones.

next steps

deploy API and PoC on Heroku and Netlify
create issues for most impactful bonus tasks
close this project issue

fvankrieken · 2024-04-03T19:54:50Z

Just noting that pg_dump --data-only does use a copy command under the hood

damonmcc · 2024-04-05T16:43:56Z

Closing out this project!

All 'bonus tasks" have been converted into issue in the ae-data-flow repo.

AE's new Data Flow workflow works with the current state of the Zoning API and Zoning Map PoC. Those two downstream consumers of data are currently developed and deployed locally.

Deploying remote instances of them are significant tasks that are related to but no longer requirements of this work being considered complete.

damonmcc added the project Long-term work with multiple issues label Mar 7, 2024

github-project-automation bot added this to Data Engineering Mar 7, 2024

github-project-automation bot moved this to New in Data Engineering Mar 7, 2024

damonmcc mentioned this issue Mar 7, 2024

data flow for AE's Zoning API #637

Closed

damonmcc moved this from New to 🏗 In progress in Data Engineering Mar 7, 2024

damonmcc moved this from 🏗 In progress to 📬 Next in Data Engineering Mar 8, 2024

damonmcc self-assigned this Mar 11, 2024

damonmcc mentioned this issue Mar 11, 2024

draft data flow design for Zoning API NYCPlanning/ae-data-flow#1

Closed

damonmcc assigned pratishta, bmarchena and dhochbaum-dcp Mar 13, 2024

damonmcc mentioned this issue Mar 14, 2024

update readme with local setup notes NYCPlanning/ae-data-flow#9

Merged

pratishta mentioned this issue Mar 14, 2024

Add brew minio instructions NYCPlanning/ae-data-flow#12

Merged

damonmcc mentioned this issue Mar 14, 2024

use requirements.txt for python setup NYCPlanning/ae-data-flow#11

Merged

damonmcc moved this from 📬 Next to 🏗 In progress in Data Engineering Mar 15, 2024

This was referenced Mar 15, 2024

update digram for Zoning API data NYCPlanning/ae-data-flow#15

Merged

update readme and Zoning API diagram NYCPlanning/ae-data-flow#23

Merged

improve readme NYCPlanning/ae-data-flow#24

Merged

transform tables in an API DB NYCPlanning/ae-data-flow#4

Closed

damonmcc moved this from 🏗 In progress to 🔍 In review in Data Engineering Apr 4, 2024

damonmcc closed this as completed Apr 5, 2024

github-project-automation bot moved this from 🔍 In review to Done in Data Engineering Apr 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data flow for AE Zoning API #660

Data flow for AE Zoning API #660

damonmcc commented Mar 7, 2024 •

edited

Loading

damonmcc commented Mar 8, 2024 •

edited

Loading

fvankrieken commented Mar 8, 2024

damonmcc commented Mar 11, 2024

damonmcc commented Apr 1, 2024 •

edited

Loading

damonmcc commented Apr 3, 2024 •

edited

Loading

fvankrieken commented Apr 3, 2024

damonmcc commented Apr 5, 2024

Data flow for AE Zoning API #660

Data flow for AE Zoning API #660

Comments

damonmcc commented Mar 7, 2024 • edited Loading

Project Description

Outcomes

Timing

Tasks

Bonus tasks

Design

Diagram

Outline from 3/12

Motivations

Notes

damonmcc commented Mar 8, 2024 • edited Loading

DE data flow for building data products

fvankrieken commented Mar 8, 2024

damonmcc commented Mar 11, 2024

damonmcc commented Apr 1, 2024 • edited Loading

ideas

Zoning API deployment

Zoning Map PoC

suggested next steps

damonmcc commented Apr 3, 2024 • edited Loading

notes from Project Review meeting on 4/3

next steps

fvankrieken commented Apr 3, 2024

damonmcc commented Apr 5, 2024

damonmcc commented Mar 7, 2024 •

edited

Loading

damonmcc commented Mar 8, 2024 •

edited

Loading

damonmcc commented Apr 1, 2024 •

edited

Loading

damonmcc commented Apr 3, 2024 •

edited

Loading