NHSDigital · andyblundell · Feb 27, 2024 · Mar 18, 2024 · Mar 18, 2024 · Mar 19, 2024
@@ -40,7 +40,7 @@ The framework is a companion to:
 
 The framework consists of:
 
-* [Engineering principles](principles.md)
+* [Engineering principles](principles.md), [blueprints](blueprints.md) and [red lines](red-lines.md)
 * [Engineering quality review tool](insights/review.md)
 * [Communities of practice guidelines](communities/communities-of-practice.md) and active communities:
   * [Product Development Test Automation Working Group](communities/pd-test-automation-working-group.md)

@@ -0,0 +1,15 @@
+# Engineering blueprints
+
+This is a list of blueprint solutions to common problems which are referenced within this quality framework.
+
+Where possible this will be a set of fully working components / solutions you can use. Where that's not possible, it will be instructions.
+
+| Topic                                                                                    | Type of blueprint | Classification | Status     |
+| :--------------------------------------------------------------------------------------- | :---------------- | :------------- | :--------- |
+| [Creating GitHub repositories](https://github.com/nhs-england-tools/repository-template) | Full solution     | Recommended    | Published  |
+| [Purging commits on GitHub](practices/guides/commit-purge.md)                            | Instructions      | Mandatory      | Published  |
+| [Signing commits on GitHub](practices/guides/commit-signing.md)                          | Instructions      | Recommended    | Published  |
+| [Cross-account backups on AWS](blueprints/backups-aws.md)                                | Instructions      | In progress    | Draft      |
+| [Automating performance-test decisions using APDEX](practices/performance-testing.md)    | Instructions      | Recommended    | Published  |
+| [Scanning source code for secrets](tools/nhsd-git-secrets/README.md)                     | Full solution     | Recommended    | Published  |
+
@@ -0,0 +1,37 @@
+# Cross-account backups on AWS
+
+## Context
+
+- These notes are part of a broader set of [blueprints](../blueprints.md)
+- This blueprint relates to [service reliability](../practices/service-reliability.md) and specifically to use of [cloud services](../practices/cloud-services.md)
+
+## Requirements
+
+The Backup Policy is supported by two additional papers – the Backup Standard and the Backup Design Pattern. These documents include the following backup requirements on the NHS England services:
+
+- *"critical data is saved in multiple backup locations"*
+- *"at least 3 copies"*
+- *"on 2 separate devices"*
+- *"1 copy being stored off-site and offline or be immutable by online means"*
+- *"ensure information & systems can be restored after an incident including but not limited to ransomware and insider attack"*
+- *"in line with (RPO & RTO) Recovery Point Objectives and Recovery Time Objectives"*
+
+## Purpose
+
+- To protect NHS patients' data
+- To ensure NHS services remain available to the patients
+- To comply with legal and regulatory requirements
+
+Associated documents:
+
+- Backup Policy
+- Backup Standard
+- Backup Design Pattern
+
+## Supported services
+
+![alt text](backup-aws-supported-services.png)
+
+## Example: DynamoDB & S3
+
+![alt text](backup-aws-example.png)
@@ -15,6 +15,10 @@
   - [ARCHITECTURE-CLOUD](https://digital.nhs.uk/about-nhs-digital/our-work/nhs-digital-architecture/principles/public-cloud-first)
   - [ARCHITECTURE-REUSE](https://digital.nhs.uk/about-nhs-digital/our-work/nhs-digital-architecture/principles/reuse-before-buy-build)
 
+## Red lines
+
+This pattern relates to [**RED-LINE**](red-lines.md): All new services must be developed on public cloud
+
 ## The pattern
 
 Use managed services where available and appropriate. The aim is to reduce operational burden by handing responsibility to the cloud provider. They have made a business from doing this better than most organisations can.

@@ -21,23 +21,34 @@
 - Prefer serverless platform as a service (PaaS) over infrastructure as a service (IaaS) (see [outsource bottom up](../patterns/outsource-bottom-up.md)).
 - Where not serverless use ephemeral and immutable infrastructure.
 - Engage your cloud supplier early on in the development process. They have various tools and processes to help you (e.g. [AWS Well-Architected Review](https://aws.amazon.com/architecture/well-architected/?wa-lens-whitepapers.sort-by=item.additionalFields.sortDate&wa-lens-whitepapers.sort-order=desc)).
-- Understand cloud supplier SLAs.
-- Make systems self-healing.
-  - Prefer technologies which are resilient by default.
-  - Favour global-scoped (e.g. [CloudFront](https://aws.amazon.com/cloudfront/) or [Front Door](https://azure.microsoft.com/en-gb/pricing/details/frontdoor/)) or region-scoped services (e.g. [S3](https://aws.amazon.com/s3/), [Lambda](https://aws.amazon.com/lambda/), [Azure Functions](https://azure.microsoft.com/en-gb/products/functions/)) to availability-zone (AZ) scoped (e.g. [VMs](https://azure.microsoft.com/en-gb/services/virtual-machines/), [RDS DBs](https://aws.amazon.com/rds/)) or single-instance services (e.g. [EC2 instance storage](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html)).
+- Make systems self-healing and resilient:
+  - Be aware that terms such as "region" have different meanings across different cloud vendors
+    - For example, it is not valid to compare the number of UK "regions" in AWS and Azure
+    - High levels of resilience can be achieved using UK-based cloud services for providers such as AWS and Azure, if the full scope & resilience of the clouds is used
+  - Also sometimes conflated in terms of resilience are cross-account and cross-region:
+    - As a minimum, all systems should have a tamper-proof cross-account backup to protect against account compromise, e.g. ransomware atttack: see [blueprint for AWS-based systems](../blueprints/backups-aws.md)
+    - You may wish to additionally consider cross region backups to protect against region failure
+  - Be aware of the resilience of any systems on which your system depends - for example, in a region-failure scenario, a standby for your system in a second region won't help if your system relies on another system which only runs in the single region which has failed
+  - Be aware of the difference between the resilience of cloud and your system's resilience in cloud
+    - Understand the SLAs of the cloud services you use.
+    - Every cloud service you use introduces more dependencies and more opportunities for service issues ...
+    - ... but, bespoke engineering to avoid using cloud vendor services introduces additional complexity and opportunities for reliability issues
+    - ... and, the risks are typically far greater for bespoke engineering, therefore: favour cloud services over bespoke engineering
+  - Prefer technologies which are resilient by default: favour global-scoped (e.g. [CloudFront](https://aws.amazon.com/cloudfront/) or [Front Door](https://azure.microsoft.com/en-gb/pricing/details/frontdoor/)) or region-scoped services (e.g. [S3](https://aws.amazon.com/s3/), [Lambda](https://aws.amazon.com/lambda/), [Azure Functions](https://azure.microsoft.com/en-gb/services/functions/)) to availability-zone (AZ) scoped (e.g. [VMs](https://azure.microsoft.com/en-gb/services/virtual-machines/), [RDS DBs](https://aws.amazon.com/rds/)) or single-instance services (e.g. [EC2 instance storage](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html)).
   - For AZ-scoped services, use redundancy to create required resilience (e.g. [AWS Auto Scaling Groups](https://docs.aws.amazon.com/autoscaling/ec2/userguide/AutoScalingGroup.html) or [Azure Scale/Availability Sets](https://docs.microsoft.com/en-us/azure/virtual-machines/availability)), and:
     - For stateless components use active-active configurations across AZs (e.g. running stateless containers across multiple AZs using [AWS Elastic Kubernetes Service](https://aws.amazon.com/eks/))
     - For stateful components, e.g. databases, consider use of active-active configurations across AZs (e.g. [Aurora Multi-Master](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-multi-master.html)), but be aware of the added complexity conflict resolution for asynchronous replication can bring and potential performance impact where synchronous replication is chosen.
   - Consider use of multiple regions (e.g. for AWS eu-west-1 [Dublin] as well as eu-west-2 [London]) as a way to improve availability, though ensure data sovereignty implications are understood and accepted (see below).
   - Understand failover (e.g. [RDS failover](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.MultiAZ.html#:~:text=Failover%20times%20are%20typically%2060%E2%80%93120%20seconds.)) and failed instance replacement times and engineer to accommodate these.
-- Be aware of data sovereignty implications of using any systems hosted outside the UK.
-  - Make sure your information governance lead is aware and included in decision making.
-  - Consider SaaS tools the team uses as well as the systems we build.
+  - Be aware of data sovereignty implications of using any systems hosted outside the UK.
+    - Make sure your information governance lead is aware and included in decision making.
+    - Consider SaaS tools the team uses as well as the systems we build.
 - Services should scale automatically up and down.
   - If possible, drive scaling based on metrics which matter to users (e.g. response time), but balance this with the benefits of choosing leading indicators (e.g. CPU usage) to avoid slow scaling from impacting user experience.
   - Understand how rapidly demand can spike and ensure scaling can meet these requirements. Balance scaling needs with the desire to avoid over provisioning and use [pre-warming](https://petrutandrei.wordpress.com/2016/03/18/pre-warming-the-load-balancer-in-aws/) of judiciously where required. Discuss this with the cloud provider well before go live they can assist with pre-warming processes ([AWS](https://aws.amazon.com/premiumsupport/programs/iem/)).
 - Infrastructure should always be fully utilised (if it isn't, it's generating waste).
   - Though balance this with potential need to run with some overhead to accommodate failed instance replacement times without overloading remaining instances.
+  - [**RED-LINE**](../red-lines.md): Development and test environments must not be run 24 by 7
 - Keep up to date.
   - Services/components need prompt updates to dependencies where security vulnerabilities are found &mdash; even if they are not under active development.
   - Services which use deprecated or unsupported technologies should be migrated onto alternatives as a priority.

@@ -14,13 +14,16 @@
 
 Our principles guide the way we work and interact with each other. They are based on the seven Lean principles as expressed in Lean Software Development: An Agile Toolkit by Mary Poppendieck and Tom Poppendieck.
 
+A subset of these principles relate to our [red lines](red-lines.md) - these are identified by links like this: [**RED-LINE**](red-lines.md)
+
 ### 1. Eliminate waste
 
 Waste is anything that interferes with giving customers what they really value at the time and place where it will provide the most value. Here are some examples, listed against the seven types of waste identified by Lean.
 
 **Inventory &mdash; partially done work**, e.g. plans and designs, code. Limit work in progress (WIP) and use a pull-based approach.
 
 **[Inventory &mdash; unnecessary resources](practices/cloud-services.md)** [ARCHITECTURE-SUSTAINABILITY](https://digital.nhs.uk/about-nhs-digital/our-work/nhs-digital-architecture/principles/deliver-sustainable-services), e.g. server over-provisioning, complicated tools where simple ones would do. Adopt a "just enough, not just in case" mindset.
+  * [**RED-LINE**](red-lines.md): Development and test environments must not be run 24 by 7
 
 **Overproduction &mdash; building unnecessary features.** Start simple and basic, get feedback and iterate.
 
@@ -31,6 +34,7 @@ Waste is anything that interferes with giving customers what they really value a
 **Overproduction &mdash; reinventing the wheel.** Solving the same problem repeatedly in an organisation. Make sure there are effective ways to share knowledge between teams to avoid this.
 
 **[Overproduction &mdash; building when you could instead reuse or buy](practices/cloud-services.md).** [ARCHITECTURE-REUSE](https://digital.nhs.uk/about-nhs-digital/our-work/nhs-digital-architecture/principles/reuse-before-buy-build) Remember to consider all these alternatives.
+  * [**RED-LINE**](red-lines.md): All new services must be developed on public cloud
 
 **Overproduction &mdash; premature optimisation for reusability.** Before making something reusable, first make it usable. Prefer explicit logic to implicit. Excessively generic systems create accidental complexity. [KISS](http://principles-wiki.net/principles:keep_it_simple_stupid) and [YAGNI](https://www.martinfowler.com/bliki/Yagni.html) and the caveats in [structured code](practices/structured-code.md) again.
 

@@ -0,0 +1,34 @@
+# Engineering red lines
+
+## Context
+
+- This is part of a broader [quality framework](README.md)
+
+## Overview
+
+The engineering principles, practices and patterns in this framework provide guidance around best-practice engineering.
+
+There is a lot of this guidance - it's all important, but we consider some of these principles, and some specific practices related to them, to be especially significant: we refer to these as our engineering "red lines", and we consider them to be requirements rather than guidance.
+
+We have chosen to put in place a governance process for:
+  * Any exceptions to these red lines in the services we build
+  * The ongoing re-assessment of any exceptions to the red lines
+  * Changes to the list of red lines
+
+You will find references to the red lines throughout this framework (the references look like this: [**RED-LINE**](red-lines.md)) - and for convenience this list is the complete set of red lines.
+
+Drafting notes for any changes to this list:
+  * Red lines must be specific and measurable, for example [Bake in security](practices/security.md) is a good principle but would not be a valid red line, because it's open-ended. Some of the specific security practices that fall under this general security principle would however be suitable candidates for red lines.
+
+## Details
+
+### Cloud / Infrastructure
+
+1. All new services must be developed on public cloud
+    * Red line for the principle [overproduction &mdash; building when you could instead reuse or buy](principles.md#1-eliminate-waste) and the pattern [outsource from the bottom up](patterns/outsource-bottom-up.md)
+    * Relates to the architecture principles [ARCHITECTURE-CLOUD](https://digital.nhs.uk/about-nhs-digital/our-work/nhs-digital-architecture/principles/public-cloud-first) and [ARCHITECTURE-REUSE](https://digital.nhs.uk/about-nhs-digital/our-work/nhs-digital-architecture/principles/reuse-before-buy-build)
+    * For further details please see [cloud services](practices/cloud-services.md)
+2. Development and test environments must not be run 24 by 7 and either need to be serverless so incur minimal charges when not being run or on demand and shutdown daily or run for less than 8 hours a day
+    * Red line for the principle [inventory &mdash; unnecessary resources](principles.md#1-eliminate-waste)
+    * Relates to the architecture principle [ARCHITECTURE-SUSTAINABILITY](https://digital.nhs.uk/about-nhs-digital/our-work/nhs-digital-architecture/principles/deliver-sustainable-services)
+    * For further details please see [cloud services](practices/cloud-services.md) ("services should scale automatically up and down", "infrastructure should always be fully utilised", etc)