Analytics Engineering Project Guidance #415

alex-pavlopoulos · 2024-06-25T11:29:50Z

Here we want to update AE related guidance to be inline with the ongoing discussion in AE. We are working through the DBT best practices and agreeing on where we want to diverge. This will serve as the basis for the guidance that will be available on the AP guide.

The main focus of this PR is to change the source/documentation/tools/create-a-derived-table/project-structure.md document and the source/documentation/tools/create-a-derived-table/data-modelling-concepts.md document.

Discussions for changes can be found here

source/documentation/tools/create-a-derived-table/project-structure.md

alex-pavlopoulos · 2024-06-26T13:11:06Z

source/documentation/tools/create-a-derived-table/project-structure.md

+  - ❌ **Subdirectories based on loader.** Some people attempt to group by how the data is loaded (Fivetran, Stitch, custom syncs), but this is too broad to be useful on a project of any real size.
+  - ✅ **Subdirectories based on business grouping.** Dbt recommends against this practice, however crate-a-derved-table has been built in a way that necessitates domains as subdirectories so that we can control access through [data egineering database access](https://github.com/moj-analytical-services/data-engineering-database-access/tree/main/database_access/create_a_derived_table). This is a key deviation from Dbt guidance.
+- **File names.** Creating a consistent pattern of file naming is [crucial in dbt](https://docs.getdbt.com/blog/on-the-importance-of-naming). File names must be unique and correspond to the name of the model when selected and created in the warehouse. We recommend putting as much clear information into the file name as possible, including a prefix for the layer the model exists in, important grouping information, and specific information about the entity or transformation in the model.
+  - ❌ `stg_[source]__[entity]s.sql` - ALthough it is recommended by DBT to follow this convention, where a double underscore is used between source system and entity, this will not work with create-a-derived-table. The naming convension in create-a-derived-table relies on the double underscore to distiguish between database and model names. Therefore, **you cannot use double underscores anywhere else in the name of a model**, this will result in an error when running the model. 


Response to the below comment, I have updated the point to this @hollyfurniss-moj ^^

#415 (comment)

I took this as 'stg_source' being the database name, and 'entity' the table name, in which case DBT's suggestion would work in CaDeT? But we do still need to make clear that double underscore can only be used like this in our environment

benwaterfield

a few suggestions on the style section - thanks Alex!

source/documentation/tools/create-a-derived-table/data-modelling-concepts.md

…ser-guidance into ae-style-guide

hollyfurniss-moj

Sorry I've included a few very pedantic comments about typos - mainly so I don't lose track of them (v happy to go through & fix myself). A couple of proper comments too.

source/documentation/tools/create-a-derived-table/data-modelling-concepts.md

source/documentation/tools/create-a-derived-table/project-structure.md

hollyfurniss-moj · 2024-07-23T13:24:02Z

source/documentation/tools/create-a-derived-table/data-modelling-concepts.md

Probably worth renaming this file / title - style guide

Just remembered we'll need to update the contents of the CaDeT guidance overall so our new pages are discoverable. Think it's here: https://github.com/moj-analytical-services/user-guidance/blob/main/source/documentation/tools/create-a-derived-table/index.md

rename file

source/documentation/tools/create-a-derived-table/project-structure.md

…ng-style-guide.md Removing as I've split off into a separate PR

ian-rickard

this is great - I learnt a lot.

I am keen to resolve the natural key vs. surrogate key issue

source/documentation/tools/create-a-derived-table/project-structure.md

ian-rickard · 2024-08-19T06:40:02Z

source/documentation/tools/create-a-derived-table/project-structure.md

+```
+
+- **Folders.** Folder structure is extremely important in dbt. Not only do we need a consistent structure to find our way around the codebase, as with any software project, but our folder structure is also one of the key interfaces for understanding the knowledge graph encoded in our project (alongside the DAG and the data output into our warehouse). It should reflect how the data flows, step-by-step, from a wide variety of source-conformed models into fewer, richer business-conformed models. Moreover, we can use our folder structure as a means of selection in dbt [selector syntax](https://docs.getdbt.com/reference/node-selection/syntax). For example, with the above structure, if we got fresh xhibit data loaded and wanted to run all the models that build on our xhibit data, we can easily run `dbt build --select staging/xhibit_stg+` and we’re all set for building more up-to-date reports on payments.
+  - ✅ **Subdirectories based on the source system**. Our internal transactional database is one system, the data we get from Stripe's API is another, and lastly the events from our Snowplow instrumentation. We've found this to be the best grouping for most companies, as source systems tend to share similar loading methods and properties between tables, and this allows us to operate on those similar sets easily.


I think this could be clearer as many readers will lack context. Is 'events from our Snowplow instrumentation' another system? I have no idea what Snowplow or Stripe are.

@benwaterfield @hollyfurniss-moj This does need changing to fit CaDT and Moj specific data, any thoughts on how this should read?

we could drop the example altogether?

Suggested change

- ✅ **Subdirectories based on the source system**. Our internal transactional database is one system, the data we get from Stripe's API is another, and lastly the events from our Snowplow instrumentation. We've found this to be the best grouping for most companies, as source systems tend to share similar loading methods and properties between tables, and this allows us to operate on those similar sets easily.

- ✅ **Subdirectories based on the source system**. Source systems tend to share similar loading methods and properties between tables, and this allows us to operate on those similar sets easily.

ian-rickard · 2024-08-19T06:46:32Z

source/documentation/tools/create-a-derived-table/project-structure.md

+    select
+
+        ----------  ids
+        id as cost_centre_id, -- primary key


In a best practice example we would not be using natural keys, but creating surrogate keys, no?

Suggested new code:

---------- ids {{ dbt_utils.generate_surrogate_key( ['id', 'a_id', ]) }} as unique_id, -- primary key id as cost_centre_id, -- natural key a_id as analysis_code_id, objective as cobjective_code_id,

Thanks. I know that we're not all in agreement about synthetic/surrogate keys and following Kimball in making all joins be based on surrogate keys. I think this is good as it shows how to generate a primary key while still maintaining ambiguity about what we might actually do with the keys. Others might disagree though!

source/documentation/tools/create-a-derived-table/project-structure.md

ian-rickard · 2024-08-19T07:39:27Z

source/documentation/tools/create-a-derived-table/project-structure.md

+  - `✅ int_[entity]s_[verb]s.sql` - the variety of transformations that can happen inside of the intermediate layer makes it harder to dictate strictly how to name them. The best guiding principle is to think about _verbs_ (e.g. `pivoted`, `aggregated_to_user`, `joined`, `fanned_out_by_quantity`, `funnel_created`, etc.) in the intermediate layer. In our example project, we use an intermediate model to pivot payments out to the order grain, so we name our model `int_payments_pivoted_to_orders`. It’s easy for anybody to quickly understand what’s happening in that model, even if they don’t know [SQL](https://mode.com/sql-tutorial/). That clarity is worth the long file name. It’s important to note that we’ve dropped the double underscores at this layer. In moving towards business-conformed concepts, we no longer need to separate a system and an entity and simply reference the unified entity if possible. In cases where you need intermediate models to operate at the source system level (e.g. `int_shopify__orders_summed`, `int_core__orders_summed` which you would later union), you’d preserve the double underscores. Some people like to separate the entity and verbs with double underscores as well. That’s a matter of preference, but in our experience, there is often an intrinsic connection between entities and verbs in this layer that make that difficult to maintain.
+
+:::tip Don’t over-optimize too early!
+The example project is very simple for illustrative purposes. This level of division in our post-staging layers is probably unnecessary when dealing with these few models. Remember, our goal is a _single_ _source of truth._ We don’t want finance and marketing operating on separate `orders` models, we want to use our dbt project as a means to bring those definitions together! As such, don’t split and optimize too early. If you have less than 10 marts models and aren’t having problems developing and using them, feel free to forego subdirectories completely (except in the staging layer, where you should always implement them as you add new source systems to your project) until the project has grown to really need them. Using dbt is always about bringing simplicity to complexity.


Should it not be "except in the staging layer where you should always use them if you add new source systems"

What about replacing "We don’t want finance and marketing operating on separate orders models" with something like "If two teams of analysts both have questions about counts of criminal court cases, we want them to be drawing upon the same data table(s)"

re first point: Whats the difference between the two sentences?
re second poin, I have added the suggested changes

maybe you should ignore that first point in hindsight. although I think having a staging domain will change the guidance.

…cture.md Co-authored-by: Ben Waterfield <[email protected]>

* Making suggested edits * adding marts discussion * quick changes * some updates * base model update * reading through and updating * continuing edit * first complete edit * small changes * adding links * second edit of section 1 * second edit on staging * second edit on intermediate * quick fixes * quick fixes * quick fixes * adding derived section and making suggested changes --------- Co-authored-by: “Alex <“[email protected]”>

source/documentation/tools/create-a-derived-table/project-structure.md

hollyfurniss-moj · 2024-12-13T13:21:17Z

source/documentation/tools/create-a-derived-table/project-structure.md

+Analytics Engineering is a relatively new function in the MoJ (as of November 2024). We sit in Data Modelling and Engineering Team (DMET), we are the 'Data Modelling' part! The title, [Analytics Engineer](https://www.getdbt.com/what-is-analytics-engineering) comes from the fact that we sit between Data Engineers on one side, who load the data onto our data warehouse; and Data Analysts (and other data users), who use the data on the other. There is a growing need for clean, standardised data that is consistent across the business and easy to use for any downstream users. That is where the Analytics Engineering function comes in. We aim to provide consistent, reusable data models conformed to common standards so that: 
+- Analysts and Modellers (the predicting ones) don't have to self-service and can focus on their analytical work.
+- Analysts working in different parts of the organisation have consistent experiences when using data models
+- Apparent discrepancies among numbers reported between different analytical teams is minimised


Suggested change

- Apparent discrepancies among numbers reported between different analytical teams is minimised

- Apparent discrepancies among numbers reported between different analytical teams are minimised

hollyfurniss-moj · 2024-12-13T13:23:54Z

source/documentation/tools/create-a-derived-table/project-structure.md

+      │   │
+      │   ├── courts_datamarts
+      │   │
+      │   └── criminal_courts_derived


potentially confusing that we have both courts and criminal_courts here - are the models common between criminal courts & CFT until the 'derived' layer?

I think I was originally trying to show that a derived layer could be a more specific purpose, but I think it is confusin gI will just make them all criminal_courts_...

source/documentation/tools/create-a-derived-table/project-structure.md

hollyfurniss-moj · 2024-12-13T13:49:34Z

source/documentation/tools/create-a-derived-table/project-structure.md

+
+### Datamarts: Introduction
+
+Here is where everything comes together, where our atoms and molecules are brought together to make cells with well-defined identity and purpose. This is also where, normally, the end user will see the data. For this reason a lot of thought has been put into the naming conventions we want to abide by in the MoJ. This guidance document has come after many years of iteratively changing and improving our processes in the Data Modelling and Engineering team, as well as across Data and Analysis. We have therefore put a lot of thought into how we can clearly signify the different kinds of databases that will be and have already been developed, and who they have been developed by. In the following two sections (datamarts and then derived) we will lay out what we expect from a project and how we see the categorisation of these pieces of work.


Noting the atom/molecule/cell reference here

hollyfurniss-moj · 2024-12-13T13:52:13Z

source/documentation/tools/create-a-derived-table/project-structure.md

+- ✅ `seeds` for lookup tables. The most common use case for seeds is loading lookup tables that are helpful for modelling but don’t exist in any source systems — think mapping postcodes to counties, or court codes to court meta data.
+- ❌ `seeds` for loading source data. Do not use seeds to load data from a source system into your warehouse. If it exists in a system you have access to, you should be loading it with a proper EL tool into the raw data area of your warehouse. dbt is designed to operate on data in the warehouse, not as a data-loading tool.
+- ✅ `tests` for testing multiple specific tables simultaneously. As dbt tests have evolved, writing singular tests has become less and less necessary. It's extremely useful for work-shopping test logic, but more often than not you'll find yourself either migrating that logic into your own custom generic tests or discovering a pre-built test that meets your needs from the ever-expanding universe of dbt packages (between the extra tests in [`dbt-utils`](https://github.com/dbt-labs/dbt-utils) and [`dbt-expectations`](https://github.com/calogica/dbt-expectations) almost any situation is covered). One area where singular tests still shine though is flexibly testing things that require a variety of specific models. If you're familiar with the difference between [unit tests](https://en.wikipedia.org/wiki/Unit_testing) [and](https://www.testim.io/blog/unit-test-vs-integration-test/) [integration](https://www.codecademy.com/resources/blog/what-is-integration-testing/) [tests](https://en.wikipedia.org/wiki/Integration_testing) in software engineering, you can think of generic and singular tests in a similar way. If you need to test the results of how several specific models interact or relate to each other, a singular test will likely be the quickest way to nail down your logic.


should this be data_tests now?

…cture.md Co-authored-by: hollyfurniss-moj <[email protected]>

benwaterfield · 2024-12-17T15:55:36Z

source/documentation/tools/create-a-derived-table/project-structure.md

+            +materialised: view
+    ```
+
+**:::TIP:** During development of a data model, it may be useful to materialise all your models as tables. This will allow you to more easily debug issues. When you are ready to merge your work with main then you can change the materialisation back to view. **:::**


https://www.freecodecamp.org/news/how-to-create-notice-blocks-in-markdown/
could be useful here

source/documentation/tools/create-a-derived-table/project-structure.md

…cture.md Co-authored-by: Ben Waterfield <[email protected]>

* initial changes * few edits * finishing changing analogy --------- Co-authored-by: “Alex <“[email protected]”>

project structure

51f08ef

hollyfurniss-moj reviewed Jun 25, 2024

View reviewed changes

source/documentation/tools/create-a-derived-table/project-structure.md Outdated Show resolved Hide resolved

alex-pavlopoulos added 2 commits June 26, 2024 12:48

some updates to staging

59617ff

spelling mistake

b1fafcb

alex-pavlopoulos commented Jun 26, 2024

View reviewed changes

alex-pavlopoulos and others added 8 commits June 26, 2024 15:10

more additions

ec5d4f3

adding style guide

582d776

small changes

eb5a2ab

big update to style guide

f6bd385

adding intermediate guide

1a3260f

formatting

0994fb1

more updates

b00f2b9

Update data-modelling-concepts.md

b2f8658

benwaterfield reviewed Jul 15, 2024

View reviewed changes

alex-pavlopoulos added 4 commits July 18, 2024 08:28

edits

5b788cd

fix

efa8e46

Merge branch 'ae-style-guide' of github.com:moj-analytical-services/u…

1838093

…ser-guidance into ae-style-guide

slack

364d0f6

hollyfurniss-moj reviewed Jul 19, 2024

View reviewed changes

hollyfurniss-moj reviewed Jul 23, 2024

View reviewed changes

hollyfurniss-moj added 3 commits July 24, 2024 15:35

Rename data-modelling-concepts.md to data-modelling-style-guide.md

0c8eaa2

rename file

minor restructure of content

039ada5

add dbt guidance version

083f336

benwaterfield reviewed Jul 24, 2024

View reviewed changes

source/documentation/tools/create-a-derived-table/project-structure.md Outdated Show resolved Hide resolved

benwaterfield reviewed Jul 24, 2024

View reviewed changes

source/documentation/tools/create-a-derived-table/project-structure.md Outdated Show resolved Hide resolved

benwaterfield reviewed Jul 24, 2024

View reviewed changes

source/documentation/tools/create-a-derived-table/project-structure.md Show resolved Hide resolved

benwaterfield reviewed Jul 24, 2024

View reviewed changes

source/documentation/tools/create-a-derived-table/project-structure.md Outdated Show resolved Hide resolved

benwaterfield reviewed Jul 24, 2024

View reviewed changes

source/documentation/tools/create-a-derived-table/project-structure.md Outdated Show resolved Hide resolved

add jinja & yaml

35cae24

benwaterfield reviewed Jul 24, 2024

View reviewed changes

source/documentation/tools/create-a-derived-table/project-structure.md Outdated Show resolved Hide resolved

hollyfurniss-moj added 3 commits July 24, 2024 16:34

Update project-structure.md

5a197b8

fix some typos

280a707

Delete source/documentation/tools/create-a-derived-table/data-modelli…

40d71a8

…ng-style-guide.md Removing as I've split off into a separate PR

ian-rickard reviewed Aug 19, 2024

View reviewed changes

alex-pavlopoulos and others added 3 commits August 21, 2024 11:35

Update source/documentation/tools/create-a-derived-table/project-stru…

2f78b1f

…cture.md Co-authored-by: Ben Waterfield <[email protected]>

Making suggested edits (#437)

b327791

alex-pavlopoulos marked this pull request as ready for review December 12, 2024 14:17

alex-pavlopoulos requested review from a team as code owners December 12, 2024 14:17

“Alex added 4 commits December 12, 2024 14:50

adding derived section and making suggested changes

eec625a

Merge remote-tracking branch 'origin/main' into ae-style-guide

48bd24c

fixing some formatting and style issues

8015068

adding some commentary

b198f27

hollyfurniss-moj reviewed Dec 13, 2024

View reviewed changes

alex-pavlopoulos and others added 4 commits December 13, 2024 15:00

Update source/documentation/tools/create-a-derived-table/project-stru…

9dbe0a5

…cture.md Co-authored-by: hollyfurniss-moj <[email protected]>

Update source/documentation/tools/create-a-derived-table/project-stru…

ecc7d3c

…cture.md Co-authored-by: hollyfurniss-moj <[email protected]>

Update source/documentation/tools/create-a-derived-table/project-stru…

90853f5

…cture.md Co-authored-by: hollyfurniss-moj <[email protected]>

suggested changes

3b578f2

benwaterfield reviewed Dec 17, 2024

View reviewed changes

source/documentation/tools/create-a-derived-table/project-structure.md Outdated Show resolved Hide resolved

“Alex and others added 3 commits December 17, 2024 16:50

adding tip blocks

a80f9af

Update source/documentation/tools/create-a-derived-table/project-stru…

d3051e3

…cture.md Co-authored-by: Ben Waterfield <[email protected]>

Food preparation analogy (#477)

5ac5ac8

* initial changes * few edits * finishing changing analogy --------- Co-authored-by: “Alex <“[email protected]”>

bagg3rs approved these changes Dec 18, 2024

View reviewed changes

alex-pavlopoulos merged commit 261e21c into main Dec 18, 2024
1 check passed

alex-pavlopoulos deleted the ae-style-guide branch December 18, 2024 13:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analytics Engineering Project Guidance #415

Analytics Engineering Project Guidance #415

alex-pavlopoulos commented Jun 25, 2024

alex-pavlopoulos Jun 26, 2024

benwaterfield Jul 24, 2024

benwaterfield left a comment

hollyfurniss-moj left a comment

hollyfurniss-moj Jul 23, 2024

benwaterfield Jul 24, 2024

ian-rickard left a comment •

edited

Loading

ian-rickard Aug 19, 2024

alex-pavlopoulos Aug 21, 2024

benwaterfield Aug 21, 2024

ian-rickard Aug 19, 2024

alex-pavlopoulos Aug 21, 2024

ian-rickard Aug 22, 2024 •

edited

Loading

ian-rickard Aug 19, 2024 •

edited

Loading

ian-rickard Aug 19, 2024

alex-pavlopoulos Aug 21, 2024

ian-rickard Aug 22, 2024

hollyfurniss-moj Dec 13, 2024

hollyfurniss-moj Dec 13, 2024

alex-pavlopoulos Dec 13, 2024

hollyfurniss-moj Dec 13, 2024

hollyfurniss-moj Dec 13, 2024

benwaterfield Dec 17, 2024

	- Apparent discrepancies among numbers reported between different analytical teams is minimised
	- Apparent discrepancies among numbers reported between different analytical teams are minimised


		### Datamarts: Introduction

		Here is where everything comes together, where our atoms and molecules are brought together to make cells with well-defined identity and purpose. This is also where, normally, the end user will see the data. For this reason a lot of thought has been put into the naming conventions we want to abide by in the MoJ. This guidance document has come after many years of iteratively changing and improving our processes in the Data Modelling and Engineering team, as well as across Data and Analysis. We have therefore put a lot of thought into how we can clearly signify the different kinds of databases that will be and have already been developed, and who they have been developed by. In the following two sections (datamarts and then derived) we will lay out what we expect from a project and how we see the categorisation of these pieces of work.

Analytics Engineering Project Guidance #415

Analytics Engineering Project Guidance #415

Conversation

alex-pavlopoulos commented Jun 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benwaterfield left a comment

Choose a reason for hiding this comment

hollyfurniss-moj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ian-rickard left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ian-rickard Aug 22, 2024 • edited Loading

Choose a reason for hiding this comment

ian-rickard Aug 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ian-rickard left a comment •

edited

Loading

ian-rickard Aug 22, 2024 •

edited

Loading

ian-rickard Aug 19, 2024 •

edited

Loading