Skip to content

Commit

Permalink
website: refine introduction to be more concise
Browse files Browse the repository at this point in the history
  • Loading branch information
jeffmccune committed Sep 24, 2024
1 parent c58510b commit a95abe6
Showing 1 changed file with 24 additions and 87 deletions.
111 changes: 24 additions & 87 deletions doc/md/introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,93 +19,29 @@ platform.

## Backstory

At [Open Infrastructure Services], each of us has helped dozens of companies
build and operate their software development platform over the course of our
career. During the U.S. presidential election just before the pandemic our
second largest client, Twitter, had a global outage lasting the better part of a
day. At the time, we were helping the core infrastructure team by managing
their production configuration management system so the team could focus on
delivering key objectives for the business. In this position, we had a front
row seat into what happened that day.

One of Twitter's employees, a close friend of ours and engineer on the team,
landed a trivial one line change to the firewall configuration. Less than 30
minutes later literally everything was down. The one line change, which passed
code review and seemed harmless, resulted in the host firewall reverting to the
default state on hundreds of thousands of servers. All connections to all
servers globally were blocked and dropped. Except SSH, thankfully. At least
one Presidential candidate complained loudly.

This incident led us to deeply consider a few key problems about Twitter's
software development platform, problems which made their way all the way up to
the board of directors to solve.

1. **Lack of Visibility** - There was no way to see clearly the impact a simple one-line
change could have on the platform. This lack of visibility made it difficult
for engineers to reason about changes they were writing and reviewing.
2. **Large Blast Radius** - All changes, no matter how small or large, affected the
entire global fleet within 30 minutes. Twitter needed a way to cap the
potential blast radius to prevent global outages.
3. **Incomplete Tooling** - Twitter had the correct processes in place, but
their tooling didn't support their process. The one line change was tested and
peer reviewed prior to landing, but the tooling didn't surface the information
they needed when they needed it.

Over the next few years we built features for Twitter's configuration management
system that solved each of these problems. At the same time, I started
exploring my curiosity of what these solutions would look like in the context of
Kubernetes and cloud native software instead of a traditional configuration
management context.

As Google Cloud partners, we had the opportunity to work with Google's largest
customers and learn how they built their software development platforms on
Kubernetes. Over the course of the pandemic, we built a software development
platform made largely in the same way, taking off the shelf CNCF projects like
ArgoCD, the Kubernetes Prometheus Stack, Istio, Cert Manager, External Secrets
Operator, etc... and integrating them into a holistic software development
platform. We started with the packaging recommended by the upstream project.
Helm was and still is the most common distribution method, but many projects
also provided plain yaml, kustomize bases, or sometimes even jsonnet in the case
of the prometheus community. We then wrote scripts to mix-in the resources we
needed to integrate each piece of software with the platform as a whole. For
example, we often passed Helm's output to Kustomize to add common labels or fix
bugs in the upstream chart, like missing namespace fields. We wrote umbrella
charts to mix in Ingress, HTTPRoute, and ExternalSecret resources to the vendor
provided chart.

We viewed these scripts as necessary glue to assemble and fasten the components
together into a platform, but we were never fully satisfied with them. Umbrella
charts became difficult to maintain once there were multiple environments,
regions, and cloud providers in the platform. Nested for loops in yaml
templates created significant friction and were a challenge to troubleshoot
because they obfuscated what was happening. The scripts, too, made it difficult
to see what was happening, when, and fix issues in them that affected all
components in the platform.

Despite the makeshift scripts and umbrella charts, the overall approach had a
significant advantage. The scripts always produced fully rendered manifests
stored in plain text files. We committed these files to version control and
used ArgoCD to apply them. The ability to make a one-line change, render the
whole platform, then see clearly what changed platform-wide resulted in less
time spent troubleshooting and fewer errors making their way to production.

For awhile, we lived with the scripts and charts. I couldn't stop thinking
about the [Why are we templating
YAML?](https://hn.algolia.com/?dateRange=all&page=0&prefix=false&query=https%3A%2F%2Fleebriggs.co.uk%2Fblog%2F2019%2F02%2F07%2Fwhy-are-we-templating-yaml&sort=byDate&type=story)
post on Hacker News though. I was curious what it would look like to replace
our scripts and umbrella charts with something that helped address the 3 main
problems Twitter experienced.

After doing quite a bit of digging and talking with folks, I found
[CUE](https://cuelang.org). I took our scripts and charts and re-wrote the same
functionality we needed from the glue-layer in Go and CUE. The result is a new
tool, `holos`, built to complement Helm, Kustomize, and JSonnet, but not replace
them. Holos leverages CUE to make it easier and safer for teams to define
golden paths and paved roads without having to write bespoke, makeshift scripts
or template text.

Thanks for reading this far, give Holos a try locally with out [Quickstart]
guide.
At [Open Infrastructure Services], we've each helped dozens of companies build and operate their software development platforms. During the U.S. presidential election just before the pandemic, our second-largest client, Twitter, experienced a global outage that lasted nearly a full day. We were managing their production configuration system, allowing the core infrastructure team to focus on business-critical objectives. This gave us a front-row seat to the incident.

A close friend and engineer on the team made a trivial one-line change to the firewall configuration. Less than 30 minutes later, everything was down. That change, which passed code review, caused the host firewall to revert to its default state on hundreds of thousands of servers, blocking all connections globally—except for SSH, thankfully. Even a Presidential candidate complained loudly.

This incident forced us to reconsider key issues with Twitter's platform:

1. **Lack of Visibility** - Engineers couldn't foresee the impact of even a small change, making it difficult to assess risks.
2. **Large Blast Radius** - Small changes affected the entire global fleet in under 30 minutes. There was no way to limit the impact of a single change.
3. **Incomplete Tooling** - The right processes were in place, but the tooling didn't fully support them. The change was tested and reviewed, but critical information wasn't surfaced in time.

Over the next few years, we built features to address these issues. Meanwhile, I began exploring how these solutions could work in the Kubernetes and cloud-native space.

As Google Cloud partners, we worked with large customers to understand how they built their platforms on Kubernetes. During the pandemic, we built a platform using CNCF projects like ArgoCD, Prometheus Stack, Istio, Cert Manager, and External Secrets Operator, integrating them into a cohesive platform. We started with upstream recommendations—primarily Helm charts—and wrote scripts to integrate each piece into the platform. For example, we passed Helm outputs to Kustomize to add labels or fix bugs, and wrote umbrella charts to add Ingress, HTTPRoute, and ExternalSecret resources.

These scripts served as necessary glue to hold everything together but became difficult to manage across multiple environments, regions, and cloud providers. YAML templates and nested loops created friction, making them hard to troubleshoot. The scripts themselves made it difficult to see what was happening and to fix issues affecting the entire platform.

Still, the scripts had a key advantage: they produced fully rendered manifests in plain text, committed to version control, and applied via ArgoCD. This clarity made troubleshooting easier and reduced errors in production.

Despite the makeshift nature of the scripts, I kept thinking about the "[Why are we templating YAML]?" post on Hacker News. I wanted to replace our scripts and charts with something more robust and easier to maintain—something that addressed Twitter's issues head-on.

I rewrote our scripts and charts using CUE and Go, replacing the glue layer. The result is **Holos**—a tool designed to complement Helm, Kustomize, and Jsonnet, making it easier and safer to define golden paths and paved roads without bespoke scripts or templates.

Thanks for reading. Take Holos for a spin on your local machine with our [Quickstart] guide.

[Guides]: /docs/guides/
[API Reference]: /docs/api/
Expand All @@ -114,3 +50,4 @@ guide.
[Author API]: /docs/api/author/
[Core API]: /docs/api/core/
[Open Infrastructure Services]: https://openinfrastructure.co/
[Why are we templating YAML]: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&query=https%3A%2F%2Fleebriggs.co.uk%2Fblog%2F2019%2F02%2F07%2Fwhy-are-we-templating-yaml&sort=byDate&type=story

0 comments on commit a95abe6

Please sign in to comment.