Over the last 10 years, I have had the chance to be part of significant engineering changes at Poppulo. Our recent transition to a (micro-)service oriented architecture was a conscious decision we made to enable our department to scale up, so that we could fulfil our vision to become the leader in internal communication.
Aiming for perfection is counterproductive
Building distributed systems comes with a set of harsh realities: more places where things may start failing and more ways for things to fail. As complexity increases, it becomes quickly clear that focusing solely on pre-production is not enough to guarantee the quality of what gets deployed. Worse even than diminishing returns, the belief that we can build the perfect thing tends to be counterproductive: when something inevitably fails, we won’t be prepared for it and recovery will be painful.
Pre-production testing has some value, but no test data in the world (even with sophisticated data generation) will ever be a substitute for the entropy of real-life usage. The best we can do is embrace the fact that things will fail (maybe in ways we never thought of) and dedicate time to be ready for it.
Giving unknown-unknowns the time they deserve
Production issues can be grouped in 2 categories: known-unknowns and unknown-unknowns. Known-unknowns are things that we know can go wrong, for reasons we can’t control. Most times, we have plans on what to do if this happens. This class of issues is typically covered by monitoring: we know what to look for, what normal looks like and when to alert outside of normal thresholds.
In its essence, monitoring isn’t enough to deal with the second category. Unknown-unknowns are situations that were never on the radar and we’re left to ourselves to understand them, through exploration, investigation and debugging. This is where observability matters.
Observability is a system quality, defined as how well internal states of a system can be inferred from knowledge of its external output. Good observability is achieved through extensive instrumentation, such as metrics, traces, events, structured and correlated logs.
Upcoming talk at QCon London
I am thrilled to talk about this subject at QCon London 2018. Join me to hear more details about why we should all care about observability and how you can combine the different instrumentation techniques to build a clearer picture of distributed systems in production.
Who am I?
I am leading the Poppulo Site Reliability Engineering team within a department of more than 50 engineers. I strongly believe that DevOps is a key turning point for our industry, bringing together Continuous Delivery practices and Lean philosophy, all supported by a safe culture of learning and innovation.
You can find me on Twitter at @PierreVincent