Part of the Multi-team Software Delivery Assessment (README)
Copyright © 2018-2019 Conflux Digital Ltd
Licenced under CC BY-SA 4.0
Permalink: SoftwareDeliveryAssessment.com
Based on the operability assessment questions from Team Guide to Software Operability by Matthew Skelton, Alex Moore, and Rob Thatcher at OperabilityQuestions.com
Purpose: Assess the awareness and practices of the team in relation to software operability - readiness for Production
Method: Use the Spotify Squad Health Check approach to assess the team's answers to the following questions, and also capture the answers:
Question | Tired (1) | Inspired (5) |
---|---|---|
1. Collaboration - How often and in what ways do we collaborate with other teams on operational aspects of the system such as operational features (logging, monitoring, alerting, etc.) and NFRs? | We respond to the need for operational aspects after go-live when tickets are raised by the live service teams | We collaborate on operational aspects from the very first week of the engagement/project |
2. Spend on operability - What proportion of product budget and team effort is spent addressing operational aspects? How do you track this? [Ignore infrastructure costs and focus on team effort] | We try to spend as little time and effort as possible on operational aspects / We do not track the spend on operational aspects at all | We spend around 30% of our time and budget addressing operational aspects |
3. Feature Toggles - How do we know which feature toggles (feature switches) are active for this subsystem? | We need to run diffs against config files to determine which feature toggles are active | We have a simple UI or API to report the active/inactive feature flags in an environment |
4. Config deployment - How do we deploy a configuration change without redeploying the software? | We cannot deploy a configuration change without deploying the software or causing an outage | We simply run a config deployment separately from the software / We deploy config together with the software without an outage |
5. System health - How do we know that the system is healthy (or unhealthy)? | We wait for checks made manually by another team to tell us if our software is healthy | We query the software using a standard HTTP healthcheck URL, returning HTTP 200/500, etc. based on logic that we write in the code, and with synthetic transaction monitoring for key scenarios |
6. Service KPIs - How do we track the main service/system Key Performance Indicators (KPIs)? What are the KPIs? | We do not have service KPIs defined | We use logging and/or time series metrics to emit service KPIs that are picked up by a dashboard |
7. Logging working - How do we know that logging is working correctly? | We do not test if logging is working | We test that logging is working using BDD feature tests that search for specific log message strings after a particular application behaviour is executed and we can see logs appear correctly in the central log aggregation/search system |
8. Testability - How do we show that the software system is easy to test? What do we provide and to whom? | We do not explicitly aim to make our software easily testable | We run clients and external test packs against all parts of our software within our deployment pipeline |
9. TLS Certs - How do we know when an SSL/TLS certificate is close to expiry? | We do not know when our certificates are going to expire | We use auto-renewal of certificates combined with certificate monitoring/alerting tools to keep a live check on when certs will expire so we can take remedial action ahead of time |
10. Sensitive data - How do we ensure that sensitive data in logs is masked or hidden? | We do not test for sensitive data in logs | We test that data masking is happening by using BDD feature tests that search for specific log message strings after a particular application behaviour is executed |
11. Performance - How do we know that the system/service performs within acceptable ranges? | We rely solely on the Performance team to validate the performance of our service or application | We run a set of indicative performance tests within our deployment pipeline that are run on every check-in to version control |
12. Failure modes - How can we see and share the different known failure modes (failure scenarios) for the system? | We do not really know how the system might fail | We use a set of error identifiers to define the failure modes in our software and we use these identifiers in our log messages |
13. Call tracing - How do we trace a call/request end-to-end through the system? | We do not trace calls through the system | We use a standard tracing library such as OpenTracing to trace calls through the system. We collaborate with other teams to ensure that the correct tracing fields are maintained across component boundaries. |
14. Service status - How do we display the current service/system status to operations-facing teams? | Operations teams tend to discover the status indicators themselves | We build a dashboard in collaboration with the Operations teams so they have all the details they need in a user-friendly way with UX a key consideration |