Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data flow for AE Zoning API #660

Closed
13 tasks done
damonmcc opened this issue Mar 7, 2024 · 7 comments
Closed
13 tasks done

Data flow for AE Zoning API #660

damonmcc opened this issue Mar 7, 2024 · 7 comments
Assignees
Labels
project Long-term work with multiple issues

Comments

@damonmcc
Copy link
Member

damonmcc commented Mar 7, 2024

Project Description

Application Engineering and Data Engineering will build a way to update the database used by the new Zoning API.

Outcomes

  1. "dev", "staging", and "production" instances of the Zoning API database, Zoning API, and AE Zoning Map PoC are hosted in Digital Ocean, Heroku, and Netlify, respectively.
  2. When a new version of PLUTO has been distributed by Data Engineering to a folder in Digital Ocean Spaces, AE can validate and transform it into the tables needed for the Zoning API database.
  3. Application Engineering can trigger a workflow to update the staging API database.
  4. An engineer or QA team member can QA the new data in the staging environment.
  5. Once the new data are deemed ready to promote to production, a workflow can be triggered to update the production API database.

Timing

Complete by end of March


Tasks

Bonus tasks

Design

We're now keeping the ae-data-flow README up to date with the latest design approach.

Diagram

workflow_zoning_api_update

Outline from 3/12

  • source data
    • PLUTO
      • will have been QAd by GIS and will include Zoning Districts with it
    • from edm-distributions/dcp_pluto
      • DE will be pushing QA'd data to this ASAP, will manually put it there for now
    • csvs preferred because they're easy to copy into a database
  • load
    • copy csvs into an AE database
  • transform
    • validate source data here?
    • transform tables via SQL queries
  • replace
    • replace tables that the API is pointing to
    • how?
      • create new tables
        • tax_lots_new
      • rename old and new tables
        • old first, new second

Motivations

Relevant discussion in ae-private repo here

The backend of the Zoning API (repo) is a Postgres database. The structures of the API tables are defined in that repo here in the folder /src/schema.

This API is used by the Zoning Map PoC (repo).

PLUTO and Zoning Tax Lots (ZTL) are the sources of the data in the API DB. These sources are updated monthly by Data Engineering. So some of the tables in API DB must be updated monthly.

Notes

@TylerMatteo's poetry-boilerplate repo here

AE's equity-tool repo has an example here of a github action using staging & production environments.

@damonmcc damonmcc added the project Long-term work with multiple issues label Mar 7, 2024
@damonmcc damonmcc moved this from New to 🏗 In progress in Data Engineering Mar 7, 2024
@damonmcc damonmcc moved this from 🏗 In progress to 📬 Next in Data Engineering Mar 8, 2024
@damonmcc
Copy link
Member Author

damonmcc commented Mar 8, 2024

Data Engineering scripts that currently generate AE's normalized files here

DE data flow for building data products

@fvankrieken
Copy link
Contributor

Actual logic to generate the outputs for AE are here. They're currently only run by a special gha triggered manually by us, building tables in a specific build schema and then exporting.

Given that it's currently most convenient for us to generate these tables in the context of a pluto build (given that they use zoning district data and pluto db table), we could do a couple things

  1. actually build these tables during pluto build, then export (as in, the dumped tables are in the draft/publish folder with other pluto outputs, and these can be pushed directly when we publish/package/trigger AE etl). Could export as csv, could export subset of tables as pg_dump (not sql file, but --data-only compressed custom output that could be run using pg_restore). A bit tidier than a sql file and don't have to worry about type safety. I think we could export a subset of tables as a single file this way too, but need to refresh my memory on pg_dumps
  2. when we package, we build these tables. maybe a tad cleaner conceptually, but it means zoning data would need to get pulled in during packaging, which is maybe a bit less clean
  3. keep pluto build as is, and AE etl would need to actually process pluto to generate normalized output for loading. They would also need to pull in zoning data separately. Maybe a nice exercise to get them digesting our data and reading data from edm-recipes. But all of that is maybe adding unnecessary complexity right now.

I'm assuming we're thinking something along the lines of 1, right?

@damonmcc damonmcc self-assigned this Mar 11, 2024
@damonmcc
Copy link
Member Author

per meeting today, we're going with option 3

keep pluto build as is, and AE etl would need to actually process pluto to generate normalized output for loading. They would also need to pull in zoning data separately. Maybe a nice exercise to get them digesting our data and reading data from edm-recipes. But all of that is maybe adding unnecessary complexity right now.

@damonmcc
Copy link
Member Author

damonmcc commented Apr 1, 2024

One of our project outcomes is:

"dev", "staging", and "production" instances of the Zoning API database, Zoning API, and AE Zoning Map PoC are hosted in Digital Ocean, Heroku, and Netlify, respectively.

Instances of the Zoning API database are now hosted in Digital Ocean.

But deploying instances of the API and PoC seems unlikely to be accomplished by this project. Barriers seems to be AE deciding on:

  • deployment strategies (dev, staging, and prod environments)
  • hosting approaches (Heroku and Netlify were suggested, but AE is hesitant to keep using those cloud providers)
  • how/whether to containerize the API and PoC codebase (currently run locally via npm [command])

ideas

Zoning API deployment

The most modern AE approach seems to be the Population FactFinder API (repo), but it's hosted on Heroku.

We should probably host this on Digital Ocean or Azure.

Zoning Map PoC

The most modern AE approach seems to be the Population FactFinder app (repo), but it's hosted on Netlify.

We should probably host this on Digital Ocean or Azure.

suggested next steps

  • AE decides how to host new APIs and Apps

@damonmcc
Copy link
Member Author

damonmcc commented Apr 3, 2024

notes from Project Review meeting on 4/3

It's got good bones.

next steps

  • deploy API and PoC on Heroku and Netlify
  • create issues for most impactful bonus tasks
  • close this project issue

@fvankrieken
Copy link
Contributor

Just noting that pg_dump --data-only does use a copy command under the hood

@damonmcc damonmcc moved this from 🏗 In progress to 🔍 In review in Data Engineering Apr 4, 2024
@damonmcc
Copy link
Member Author

damonmcc commented Apr 5, 2024

Closing out this project!

All 'bonus tasks" have been converted into issue in the ae-data-flow repo.

AE's new Data Flow workflow works with the current state of the Zoning API and Zoning Map PoC. Those two downstream consumers of data are currently developed and deployed locally.

Deploying remote instances of them are significant tasks that are related to but no longer requirements of this work being considered complete.

@damonmcc damonmcc closed this as completed Apr 5, 2024
@github-project-automation github-project-automation bot moved this from 🔍 In review to Done in Data Engineering Apr 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
project Long-term work with multiple issues
Projects
Status: Done
Development

No branches or pull requests

5 participants