Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analytics Engineering Project Guidance #415

Merged
merged 36 commits into from
Dec 18, 2024
Merged

Conversation

alex-pavlopoulos
Copy link
Contributor

Here we want to update AE related guidance to be inline with the ongoing discussion in AE. We are working through the DBT best practices and agreeing on where we want to diverge. This will serve as the basis for the guidance that will be available on the AP guide.

The main focus of this PR is to change the source/documentation/tools/create-a-derived-table/project-structure.md document and the source/documentation/tools/create-a-derived-table/data-modelling-concepts.md document.

Discussions for changes can be found here

- ❌ **Subdirectories based on loader.** Some people attempt to group by how the data is loaded (Fivetran, Stitch, custom syncs), but this is too broad to be useful on a project of any real size.
- ✅ **Subdirectories based on business grouping.** Dbt recommends against this practice, however crate-a-derved-table has been built in a way that necessitates domains as subdirectories so that we can control access through [data egineering database access](https://github.com/moj-analytical-services/data-engineering-database-access/tree/main/database_access/create_a_derived_table). This is a key deviation from Dbt guidance.
- **File names.** Creating a consistent pattern of file naming is [crucial in dbt](https://docs.getdbt.com/blog/on-the-importance-of-naming). File names must be unique and correspond to the name of the model when selected and created in the warehouse. We recommend putting as much clear information into the file name as possible, including a prefix for the layer the model exists in, important grouping information, and specific information about the entity or transformation in the model.
- ❌ `stg_[source]__[entity]s.sql` - ALthough it is recommended by DBT to follow this convention, where a double underscore is used between source system and entity, this will not work with create-a-derived-table. The naming convension in create-a-derived-table relies on the double underscore to distiguish between database and model names. Therefore, **you cannot use double underscores anywhere else in the name of a model**, this will result in an error when running the model.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Response to the below comment, I have updated the point to this @hollyfurniss-moj ^^

#415 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took this as 'stg_source' being the database name, and 'entity' the table name, in which case DBT's suggestion would work in CaDeT? But we do still need to make clear that double underscore can only be used like this in our environment

Copy link
Contributor

@benwaterfield benwaterfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few suggestions on the style section - thanks Alex!

Copy link
Contributor

@hollyfurniss-moj hollyfurniss-moj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I've included a few very pedantic comments about typos - mainly so I don't lose track of them (v happy to go through & fix myself). A couple of proper comments too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth renaming this file / title - style guide

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just remembered we'll need to update the contents of the CaDeT guidance overall so our new pages are discoverable. Think it's here: https://github.com/moj-analytical-services/user-guidance/blob/main/source/documentation/tools/create-a-derived-table/index.md

Copy link

@ian-rickard ian-rickard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is great - I learnt a lot.

I am keen to resolve the natural key vs. surrogate key issue

```

- **Folders.** Folder structure is extremely important in dbt. Not only do we need a consistent structure to find our way around the codebase, as with any software project, but our folder structure is also one of the key interfaces for understanding the knowledge graph encoded in our project (alongside the DAG and the data output into our warehouse). It should reflect how the data flows, step-by-step, from a wide variety of source-conformed models into fewer, richer business-conformed models. Moreover, we can use our folder structure as a means of selection in dbt [selector syntax](https://docs.getdbt.com/reference/node-selection/syntax). For example, with the above structure, if we got fresh xhibit data loaded and wanted to run all the models that build on our xhibit data, we can easily run `dbt build --select staging/xhibit_stg+` and we’re all set for building more up-to-date reports on payments.
- ✅ **Subdirectories based on the source system**. Our internal transactional database is one system, the data we get from Stripe's API is another, and lastly the events from our Snowplow instrumentation. We've found this to be the best grouping for most companies, as source systems tend to share similar loading methods and properties between tables, and this allows us to operate on those similar sets easily.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be clearer as many readers will lack context. Is 'events from our Snowplow instrumentation' another system? I have no idea what Snowplow or Stripe are.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benwaterfield @hollyfurniss-moj This does need changing to fit CaDT and Moj specific data, any thoughts on how this should read?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could drop the example altogether?

Suggested change
-**Subdirectories based on the source system**. Our internal transactional database is one system, the data we get from Stripe's API is another, and lastly the events from our Snowplow instrumentation. We've found this to be the best grouping for most companies, as source systems tend to share similar loading methods and properties between tables, and this allows us to operate on those similar sets easily.
-**Subdirectories based on the source system**. Source systems tend to share similar loading methods and properties between tables, and this allows us to operate on those similar sets easily.

select

---------- ids
id as cost_centre_id, -- primary key

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a best practice example we would not be using natural keys, but creating surrogate keys, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested new code:

        ----------  ids
        {{ dbt_utils.generate_surrogate_key(
          ['id', 
          'a_id', ]) }} as unique_id, -- primary key
        id as cost_centre_id, -- natural key
        a_id as analysis_code_id,
        objective as cobjective_code_id,

Copy link

@ian-rickard ian-rickard Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I know that we're not all in agreement about synthetic/surrogate keys and following Kimball in making all joins be based on surrogate keys. I think this is good as it shows how to generate a primary key while still maintaining ambiguity about what we might actually do with the keys. Others might disagree though!

- `✅ int_[entity]s_[verb]s.sql` - the variety of transformations that can happen inside of the intermediate layer makes it harder to dictate strictly how to name them. The best guiding principle is to think about _verbs_ (e.g. `pivoted`, `aggregated_to_user`, `joined`, `fanned_out_by_quantity`, `funnel_created`, etc.) in the intermediate layer. In our example project, we use an intermediate model to pivot payments out to the order grain, so we name our model `int_payments_pivoted_to_orders`. It’s easy for anybody to quickly understand what’s happening in that model, even if they don’t know [SQL](https://mode.com/sql-tutorial/). That clarity is worth the long file name. It’s important to note that we’ve dropped the double underscores at this layer. In moving towards business-conformed concepts, we no longer need to separate a system and an entity and simply reference the unified entity if possible. In cases where you need intermediate models to operate at the source system level (e.g. `int_shopify__orders_summed`, `int_core__orders_summed` which you would later union), you’d preserve the double underscores. Some people like to separate the entity and verbs with double underscores as well. That’s a matter of preference, but in our experience, there is often an intrinsic connection between entities and verbs in this layer that make that difficult to maintain.

:::tip Don’t over-optimize too early!
The example project is very simple for illustrative purposes. This level of division in our post-staging layers is probably unnecessary when dealing with these few models. Remember, our goal is a _single_ _source of truth._ We don’t want finance and marketing operating on separate `orders` models, we want to use our dbt project as a means to bring those definitions together! As such, don’t split and optimize too early. If you have less than 10 marts models and aren’t having problems developing and using them, feel free to forego subdirectories completely (except in the staging layer, where you should always implement them as you add new source systems to your project) until the project has grown to really need them. Using dbt is always about bringing simplicity to complexity.
Copy link

@ian-rickard ian-rickard Aug 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it not be "except in the staging layer where you should always use them if you add new source systems"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about replacing "We don’t want finance and marketing operating on separate orders models" with something like "If two teams of analysts both have questions about counts of criminal court cases, we want them to be drawing upon the same data table(s)"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re first point: Whats the difference between the two sentences?
re second poin, I have added the suggested changes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you should ignore that first point in hindsight. although I think having a staging domain will change the guidance.

alex-pavlopoulos and others added 3 commits August 21, 2024 11:35
* Making suggested edits

* adding marts discussion

* quick changes

* some updates

* base model update

* reading through and updating

* continuing edit

* first complete edit

* small changes

* adding links

* second edit of section 1

* second edit on staging

* second edit on intermediate

* quick fixes

* quick fixes

* quick fixes

* adding derived section and making suggested changes

---------

Co-authored-by: “Alex <“[email protected]”>
@alex-pavlopoulos alex-pavlopoulos marked this pull request as ready for review December 12, 2024 14:17
@alex-pavlopoulos alex-pavlopoulos requested review from a team as code owners December 12, 2024 14:17
Analytics Engineering is a relatively new function in the MoJ (as of November 2024). We sit in Data Modelling and Engineering Team (DMET), we are the 'Data Modelling' part! The title, [Analytics Engineer](https://www.getdbt.com/what-is-analytics-engineering) comes from the fact that we sit between Data Engineers on one side, who load the data onto our data warehouse; and Data Analysts (and other data users), who use the data on the other. There is a growing need for clean, standardised data that is consistent across the business and easy to use for any downstream users. That is where the Analytics Engineering function comes in. We aim to provide consistent, reusable data models conformed to common standards so that:
- Analysts and Modellers (the predicting ones) don't have to self-service and can focus on their analytical work.
- Analysts working in different parts of the organisation have consistent experiences when using data models
- Apparent discrepancies among numbers reported between different analytical teams is minimised
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Apparent discrepancies among numbers reported between different analytical teams is minimised
- Apparent discrepancies among numbers reported between different analytical teams are minimised

│ │
│ ├── courts_datamarts
│ │
│ └── criminal_courts_derived
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potentially confusing that we have both courts and criminal_courts here - are the models common between criminal courts & CFT until the 'derived' layer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was originally trying to show that a derived layer could be a more specific purpose, but I think it is confusin gI will just make them all criminal_courts_...


### Datamarts: Introduction

Here is where everything comes together, where our atoms and molecules are brought together to make cells with well-defined identity and purpose. This is also where, normally, the end user will see the data. For this reason a lot of thought has been put into the naming conventions we want to abide by in the MoJ. This guidance document has come after many years of iteratively changing and improving our processes in the Data Modelling and Engineering team, as well as across Data and Analysis. We have therefore put a lot of thought into how we can clearly signify the different kinds of databases that will be and have already been developed, and who they have been developed by. In the following two sections (datamarts and then derived) we will lay out what we expect from a project and how we see the categorisation of these pieces of work.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Noting the atom/molecule/cell reference here

- ✅ `seeds` for lookup tables. The most common use case for seeds is loading lookup tables that are helpful for modelling but don’t exist in any source systems — think mapping postcodes to counties, or court codes to court meta data.
- ❌ `seeds` for loading source data. Do not use seeds to load data from a source system into your warehouse. If it exists in a system you have access to, you should be loading it with a proper EL tool into the raw data area of your warehouse. dbt is designed to operate on data in the warehouse, not as a data-loading tool.
- ✅ `tests` for testing multiple specific tables simultaneously. As dbt tests have evolved, writing singular tests has become less and less necessary. It's extremely useful for work-shopping test logic, but more often than not you'll find yourself either migrating that logic into your own custom generic tests or discovering a pre-built test that meets your needs from the ever-expanding universe of dbt packages (between the extra tests in [`dbt-utils`](https://github.com/dbt-labs/dbt-utils) and [`dbt-expectations`](https://github.com/calogica/dbt-expectations) almost any situation is covered). One area where singular tests still shine though is flexibly testing things that require a variety of specific models. If you're familiar with the difference between [unit tests](https://en.wikipedia.org/wiki/Unit_testing) [and](https://www.testim.io/blog/unit-test-vs-integration-test/) [integration](https://www.codecademy.com/resources/blog/what-is-integration-testing/) [tests](https://en.wikipedia.org/wiki/Integration_testing) in software engineering, you can think of generic and singular tests in a similar way. If you need to test the results of how several specific models interact or relate to each other, a singular test will likely be the quickest way to nail down your logic.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be data_tests now?

+materialised: view
```

**:::TIP:** During development of a data model, it may be useful to materialise all your models as tables. This will allow you to more easily debug issues. When you are ready to merge your work with main then you can change the materialisation back to view. **:::**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“Alex and others added 3 commits December 17, 2024 16:50
* initial changes

* few edits

* finishing changing analogy

---------

Co-authored-by: “Alex <“[email protected]”>
@alex-pavlopoulos alex-pavlopoulos merged commit 261e21c into main Dec 18, 2024
1 check passed
@alex-pavlopoulos alex-pavlopoulos deleted the ae-style-guide branch December 18, 2024 13:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants