It all boils down to three components. Having Empathy. Awareness and Recovery.

Not very long ago, I wanted to print a confirmation of the transaction from my bank account. Not a very complicated thing, I thought. In fact, I downloaded one a week before. Couldn't be any simpler. This time however, the download button wouldn't listen.

How often does a similar thing happen to you?

Whether it's an online banking, grocery shopping or entertainment, it's inevitable — things break.

Have Empathy

The natural first reaction is frustration. Then comes understanding.

Someone might have been in a rush. Or perhaps under a pressure of delivering on time, or fixing a more important bug at a cost of breaking a small feature.

We all make mistakes, some have bigger impact, some smaller. When was the last time you've committed a faulty change on purpose?

I guess your answer is very likely to be never.

Seeking blame is not the answer. Seek understanding.

Awareness

When we are full of understanding and empathy, it's time to learn.

Could we make system more resilient to similar issues? Can we at least make them less likely to happen? Or at least lower the impact?

Not, if we don't know when they happen.

Be it passively or actively, monitoring the health of the systems we build is key.

Passive monitoring means someone has done the work for us and takes a notice of everything that happens in our system. We pay the price later — blocking some time to analyse the collected data, seeking for patterns and trying to make sense of them.

Does this error look familiar? Is this event important?

Very useful — but quite mundane, we lack the context.

On the other side we have active monitoring. We pay the price in the beginning — defining the set of capabilities we promise to deliver, deciding which of them are crucial to the business, and how are we going to measure if they are healthy. Then we measure.

This capability seems to be under-performing recently, do we know why?

Also useful — but takes an intention and some effort to work.

Implement both. One gives us clues while the other helps ask the right questions.

Recovery

Visibility into failures is one thing. Time to recovery is another.

Take a look at the table below and try to answer the following question.

Does every failure require human intervention?

Great Even Better
Check health dashboard daily Receive notification when something goes wrong
Investigate an issue during the incident Follow a playbook from the last time similar thing happened
Follow the playbook Have the system heal itself
Learn about system performance during an outage, under stress Exercise and stress the system, in a controlled environment

While things on the left are very important, things on the right can make the experience even more delightful.

In summary

It's a process. Things will go wrong. Start with empathy and understanding. Help your team develop the necessary skills to stress the system in a controlled environment and monitor it — to discover the foreseen as well as unpredicted consequences. Finally, make it more resilient by helping the system recover by itself.

There are some tools, techniques and good practices which might come in handy, for example:

  • Defined Capabilities
  • Tolerance Thresholds
  • Monitoring / Health Checks
  • Alerts / Notifications
  • Playbooks
  • Self-Healing
  • Game Days

What are your thoughts on reliability — should we pay more attention to it?

Please let me know your thoughts!