Skip to content

Repository Structure

fvankrieken edited this page Sep 19, 2024 · 5 revisions

admin

This folder is a bit of a catch all for more admin/devops-related things. For now it contains two subfolders

run_environment

This contains all files related to maintaining our run environments. This includes

  • python requirements. We define our requirements in requirements.in, and compile them using pip-tools to resolve versions in requirements.txt. We also generate a constraints.txt which is requirements.txt but slightly reformatted so that it can be used by pip as a constraints file. This constraints file is then used to ensure that all of our docker images, regardless of what packages are actually installed on that image, are using the same versions of each python package we use. The script that compiles them lives in admin/ops
  • a docker subfolder. This folder contains bash scripts, DockerFiles, and more that are used to manage our various docker images. In addition to being used for our dev container image for development, these images are used for most of our github actions. See more about our docker images here

ops

This folder contains various narrow-scoped scripts that we use for various devops-related tasks.

apps

Our apps folder contains any apps that we produce. For now, this is one - our QA streamlit app, which is deployed on Digital Ocean. Each app folder should contain all code necessary for running it and deploying it (outside of GitHub Actions)

bash

This folder contains bash utilities that are used across product builds, as well as a few bash scripts that are used either in setting up environments, compiling python requirements, and other similar tasks in our builds/development. We're increasingly moving away from our bash utilities in favor of managing control flow of our processes in python, but for now they're still used across the codebase.

dcpy

dcpy is our internal python package. Python is increasingly our language of choice for various parts of our product lifecycle, and dcpy contains numerous submodules for things like utilities, connectors to third parties, and our orchestrating lifecycle code. For more info, see dcpy

docs

Various code-generated documentation

products

In products are one folder for each of our data products (and an extra one - "template" is our sandbox data product for testing out new workflows and technology.

Each of these folders contains all information and code needed to build a product. The goal is for this to really be two things

  • a recipe file. This is a yaml file used by dcpy.lifecycle.builds to resolve versions of source datasets and load them into our build engine database
  • transformation logic. We're moving in the direction of this being sql files (postgres) that are run by dbt, but have a variety of structures and approaches across our products at the moment. In addition, every product still has some amount of bash scripting specific to that product, be it for running specific transformation steps (specifying order of sql files for many products), or generating export files. This logic will likely be moved eventually to dcpy as well, so that our product definitions can really be just two things - declarative metadata/instructions in yaml, and actual transformation logic in sql.