Testing in production is not a dirty word with Paige Cruz of Chronosphere
Here’s a sneak peek:
After rolling back, I was able to uncover the SNAFU that occurred by inspecting each step of the journey from the original pull request to the production rollout. Surprise, surprise, there was a major difference between what was running in lower environments and what was running in production.
For reasons I’ll explain in the talk at Shift Miami 2023. – the major difference between environments was not immediately apparent to me or my pull request reviewers in the file I introduced the change, didn’t raise any flags in the continuous integration testing phase, and had passed through lower environments with flying colors.
The reason Paige likes to tell that story is to remind engineers that while, yes, testing is important – there is no place like production!
No matter how close of an approximation your local development environment is to production, it's just that – an approximation.
Everybody tests in production
Testing in production does not have to be ‘a dirty word’, especially if you keep in mind that everybody tests in production. Some just won’t admit it.
And, of course, there is a wrong way to do it (which involves a tonne of stress and possibly massive implications for business), but there are the right ways:
If you deployed a code change that passed all checks in local development, staging and then hit some case that was not and could not be accounted for in production and had to roll back or toggle a feature flag then… Surprise! Production “tested” your code.
People tend to get hung up on the word test and sidestep that I rephrase it as “there’s no place like production”. What it comes down to is no matter how close of an approximation your local development environment is to production it’s just that – an approximation.
What’s crow got to do with it?
Operators and SREs, Paige says, likely have a good sense of the differences between environments, specifically the configuration and cloud components.
But if you’re a developer and that is mostly hidden from you – trying to suss out what the difference between production and staging can be like playing #CrowOrNo if you’re not a birder.
To the untrained eye – many black birds look like a crow or a raven.
There is not just one way to test in production, and there is not just a good or a bad way to do it. Paige thinks of it as a spectrum:
Looking at the ideal case would be having a great experience for instrumentation and local development, guardrails for experimentation (feature flags, canary or blue/green deploys), actionable alerts tied to adverse customer/business impact via SLOs, and a culture of continuous collaborative learning.
The middle ground, where many organizations are today, is a mix of fractured instrumentation spread across multiple tools, nonexistent training or enablement to use available data and tooling, “set it and forget it” instrumentation with a mishmash of unstructured logs, metrics that aren’t even referenced by an alert or been queried for in eons, and a sea of out-of-the-box default dashboards to parse through.
Moving from the middle ground toward the ideal use case – the level of investment and effort required will depend on the nuances of your sociotechnical system.
There is such a thing as over-instrumentation!
Have no shame in your learning game
The other reason for Paige sharing the story of, as she calls it, ‘The Big One’ is that incident triggered a chain of learning for her and had a severe enough impact on getting the executives’ attention:
In the aftermath, all eyes were on me waiting for answers. If I played my cards right, I got a chance to secure buy-in for reliability investments.
The outage allowed me more time to investigate and reconstruct the chain of events. Without distributed tracing it was laborious and time-consuming. With the powers of hindsight… what could have happened in a highly observable system? Had I been able to compare traces of requests before and after my change was rolled out it would’ve been straightforward to know what changed.
Most critically I did not blame myself and had no shame in my learning game. In addition to a magnificent incident review, I assembled a tl;dr following “The Life of a Request” and presented it to multiple developer groups and among my team and department.
From a horror story to a new movement
Not so long ago “testing in production” was a phrase that would strike fear in developers’ and engineers’ hearts. But the proliferation of modern observability tools providing real-time insights into the behavior of a system has enabled almost a new movement of advocating for it.
Chronosphere is one of those tools – a cloud-native observability platform. At Chronosphere their take is that observability enables DevOps teams to know, triage, and understand what’s happening in the infrastructure and apps so they can take action before they turn into incidents that impact the customer experience.
Paige adds observability tools should opt for open-source instrumenation (We want you to choose us because of the value we bring, not because we’ve made leaving a Sisyphean task!), standardized data and friendly UI:
Troubleshooting happens across teams and silos and layers of the stack – when engineers have a familiar set of attributes available across the board it makes passing information along easier and better.
Observability tools should also build products with cohesive and friendly UI, so that Engineering could focus on training and democratizing access to observability data beyond engineering.
Meet Paige and find out more on this topic at Shift Miami on May 23rd.