The philosophy of “fail fast” has long been part of the Agile lexicon. Both internally and externally, we champion this mantra of being willing to accept failure - even encourage it. The idea is that by quickly testing ideas and prototypes and being willing to accept failure, teams can iterate and improve upon their ideas more quickly. On its face, this seems like a great message. It aligns with Agile and DevOps values; we’re encouraging feedback loops and small batches of work. But the truth is that “fail fast” is harmful messaging to everyone involved. Externally - it attaches the word “failure” to the transformation we’re trying to affect. It gives the impression that failure is an intentional part of the process or even a goal of the process. Internally, it can incentivize teams to prioritize speed at the cost of long-term stability, security, scalability, and maintainability. This is especially problematic in regulated environments like the Federal Government.
We can change this perception and how we think about digital transformation internally and externally.
Every change we make carries risk. Risk that it’s not what users want, will break something when we release it, or will impact upstream or downstream systems. There’s also risk in not making changes. Unpatched production systems, software supply chain vulnerabilities, or simply changes in how users do their jobs - business requires change. In this way, software is like driving a car - every time you choose to steer or not steer, you risk crashing. What if you could only steer the car once every 5 minutes?
We address the first part of this problem by working in small batches*. We make small changes frequently rather than large changes infrequently. Of course, more is needed - just because we push to production frequently doesn’t mean we’re prepared to recover if something goes wrong or if we need to pivot. There are practices we can put in place to make this possible or even easy; here are just a few.
Deployment Automation If changes to production systems require manual work, they are slow and error-prone. Deployments (and rollbacks)should be managed by source control and 100% automated. This is not just source code; it also means automating changes to databases with database migrations, declaring and automating changes to infrastructure and platform with infrastructure as Code, and potentially more, depending on your system.
Canary Releases There are multiple ways to release changes to a subset of users, like establishing UAT or Beta environments and limiting access. With canary releases, we can control access to new features and perform A/B testing to get feedback in production. If we want to pivot for any reason(if it breaks or if users hate it, for example) - we can simply revert the change, and the majority of users will never have seen it.
Chaos Engineering For more mature organizations, chaos engineering can help incentivize teams to think about resiliency in production as part of their development process. We can identify the impact of system-level failures and architect product and release strategies to account for them.
There are plenty of other ways we can build resiliency and fault tolerance to reduce the cost of failure. This is just the first part of shifting the mindset away from “fail fast.” Next, we’ll discuss minimizing the risk of failure.