website: refine introduction to be more concise

holos-run · Sep 24, 2024 · a95abe6 · a95abe6
1 parent c58510b
commit a95abe6
Showing 1 changed file with 24 additions and 87 deletions.
diff --git a/doc/md/introduction.md b/doc/md/introduction.md
@@ -19,93 +19,29 @@ platform.
 
 ## Backstory
 
-At [Open Infrastructure Services], each of us has helped dozens of companies
-build and operate their software development platform over the course of our
-career.  During the U.S. presidential election just before the pandemic our
-second largest client, Twitter, had a global outage lasting the better part of a
-day.  At the time, we were helping the core infrastructure team by managing
-their production configuration management system so the team could focus on
-delivering key objectives for the business.  In this position, we had a front
-row seat into what happened that day.
-
-One of Twitter's employees, a close friend of ours and engineer on the team,
-landed a trivial one line change to the firewall configuration.  Less than 30
-minutes later literally everything was down.  The one line change, which passed
-code review and seemed harmless, resulted in the host firewall reverting to the
-default state on hundreds of thousands of servers.  All connections to all
-servers globally were blocked and dropped.  Except SSH, thankfully.  At least
-one Presidential candidate complained loudly.
-
-This incident led us to deeply consider a few key problems about Twitter's
-software development platform, problems which made their way all the way up to
-the board of directors to solve.
-
-1. **Lack of Visibility** - There was no way to see clearly the impact a simple one-line
-change could have on the platform.  This lack of visibility made it difficult
-for engineers to reason about changes they were writing and reviewing.
-2. **Large Blast Radius** - All changes, no matter how small or large, affected the
-entire global fleet within 30 minutes.  Twitter needed a way to cap the
-potential blast radius to prevent global outages.
-3. **Incomplete Tooling** - Twitter had the correct processes in place, but
-their tooling didn't support their process.  The one line change was tested and
-peer reviewed prior to landing, but the tooling didn't surface the information
-they needed when they needed it.
-
-Over the next few years we built features for Twitter's configuration management
-system that solved each of these problems.  At the same time, I started
-exploring my curiosity of what these solutions would look like in the context of
-Kubernetes and cloud native software instead of a traditional configuration
-management context.
-
-As Google Cloud partners, we had the opportunity to work with Google's largest
-customers and learn how they built their software development platforms on
-Kubernetes.  Over the course of the pandemic, we built a software development
-platform made largely in the same way, taking off the shelf CNCF projects like
-ArgoCD, the Kubernetes Prometheus Stack, Istio, Cert Manager, External Secrets
-Operator, etc... and integrating them into a holistic software development
-platform.  We started with the packaging recommended by the upstream project.
-Helm was and still is the most common distribution method, but many projects
-also provided plain yaml, kustomize bases, or sometimes even jsonnet in the case
-of the prometheus community.  We then wrote scripts to mix-in the resources we
-needed to integrate each piece of software with the platform as a whole. For
-example, we often passed Helm's output to Kustomize to add common labels or fix
-bugs in the upstream chart, like missing namespace fields.  We wrote umbrella
-charts to mix in Ingress, HTTPRoute, and ExternalSecret resources to the vendor
-provided chart.
-
-We viewed these scripts as necessary glue to assemble and fasten the components
-together into a platform, but we were never fully satisfied with them.  Umbrella
-charts became difficult to maintain once there were multiple environments,
-regions, and cloud providers in the platform.  Nested for loops in yaml
-templates created significant friction and were a challenge to troubleshoot
-because they obfuscated what was happening. The scripts, too, made it difficult
-to see what was happening, when, and fix issues in them that affected all
-components in the platform.
-
-Despite the makeshift scripts and umbrella charts, the overall approach had a
-significant advantage. The scripts always produced fully rendered manifests
-stored in plain text files.  We committed these files to version control and
-used ArgoCD to apply them. The ability to make a one-line change, render the
-whole platform, then see clearly what changed platform-wide resulted in less
-time spent troubleshooting and fewer errors making their way to production.
-
-For awhile, we lived with the scripts and charts.  I couldn't stop thinking
-about the [Why are we templating
-YAML?](https://hn.algolia.com/?dateRange=all&page=0&prefix=false&query=https%3A%2F%2Fleebriggs.co.uk%2Fblog%2F2019%2F02%2F07%2Fwhy-are-we-templating-yaml&sort=byDate&type=story)
-post on Hacker News though.  I was curious what it would look like to replace
-our scripts and umbrella charts with something that helped address the 3 main
-problems Twitter experienced.
-
-After doing quite a bit of digging and talking with folks, I found
-[CUE](https://cuelang.org).  I took our scripts and charts and re-wrote the same
-functionality we needed from the glue-layer in Go and CUE. The result is a new
-tool, `holos`, built to complement Helm, Kustomize, and JSonnet, but not replace
-them.  Holos leverages CUE to make it easier and safer for teams to define
-golden paths and paved roads without having to write bespoke, makeshift scripts
-or template text.
-
-Thanks for reading this far, give Holos a try locally with out [Quickstart]
-guide.
+At [Open Infrastructure Services], we've each helped dozens of companies build and operate their software development platforms. During the U.S. presidential election just before the pandemic, our second-largest client, Twitter, experienced a global outage that lasted nearly a full day. We were managing their production configuration system, allowing the core infrastructure team to focus on business-critical objectives. This gave us a front-row seat to the incident.
+
+A close friend and engineer on the team made a trivial one-line change to the firewall configuration. Less than 30 minutes later, everything was down. That change, which passed code review, caused the host firewall to revert to its default state on hundreds of thousands of servers, blocking all connections globally—except for SSH, thankfully. Even a Presidential candidate complained loudly.
+
+This incident forced us to reconsider key issues with Twitter's platform:
+
+1. **Lack of Visibility** - Engineers couldn't foresee the impact of even a small change, making it difficult to assess risks.
+2. **Large Blast Radius** - Small changes affected the entire global fleet in under 30 minutes. There was no way to limit the impact of a single change.
+3. **Incomplete Tooling** - The right processes were in place, but the tooling didn't fully support them. The change was tested and reviewed, but critical information wasn't surfaced in time.
+
+Over the next few years, we built features to address these issues. Meanwhile, I began exploring how these solutions could work in the Kubernetes and cloud-native space.
+
+As Google Cloud partners, we worked with large customers to understand how they built their platforms on Kubernetes. During the pandemic, we built a platform using CNCF projects like ArgoCD, Prometheus Stack, Istio, Cert Manager, and External Secrets Operator, integrating them into a cohesive platform. We started with upstream recommendations—primarily Helm charts—and wrote scripts to integrate each piece into the platform. For example, we passed Helm outputs to Kustomize to add labels or fix bugs, and wrote umbrella charts to add Ingress, HTTPRoute, and ExternalSecret resources.
+
+These scripts served as necessary glue to hold everything together but became difficult to manage across multiple environments, regions, and cloud providers. YAML templates and nested loops created friction, making them hard to troubleshoot. The scripts themselves made it difficult to see what was happening and to fix issues affecting the entire platform.
+
+Still, the scripts had a key advantage: they produced fully rendered manifests in plain text, committed to version control, and applied via ArgoCD. This clarity made troubleshooting easier and reduced errors in production.
+
+Despite the makeshift nature of the scripts, I kept thinking about the "[Why are we templating YAML]?" post on Hacker News. I wanted to replace our scripts and charts with something more robust and easier to maintain—something that addressed Twitter's issues head-on.
+
+I rewrote our scripts and charts using CUE and Go, replacing the glue layer. The result is **Holos**—a tool designed to complement Helm, Kustomize, and Jsonnet, making it easier and safer to define golden paths and paved roads without bespoke scripts or templates.
+
+Thanks for reading. Take Holos for a spin on your local machine with our [Quickstart] guide.
 
 [Guides]: /docs/guides/
 [API Reference]: /docs/api/
@@ -114,3 +50,4 @@ guide.
 [Author API]: /docs/api/author/
 [Core API]: /docs/api/core/
 [Open Infrastructure Services]: https://openinfrastructure.co/
+[Why are we templating YAML]: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&query=https%3A%2F%2Fleebriggs.co.uk%2Fblog%2F2019%2F02%2F07%2Fwhy-are-we-templating-yaml&sort=byDate&type=story