Sunday, April 20, 2014

IT Catastrophes: Triage and Compression with Fries and Coke

Triage (noun)
(b) the sorting of patients (as in an emergency room) according to the urgency of their need for care
Compression ()
The state of being compressed (re: reduced in size or volume, as by pressure)
(source: Merriam-Webster online dictionary)
This is another one of my silly-brained IT monologues about subjects which are rarely discussed.

What I'm talking about is a loose comparison and contrast with these two words as they relate to medical and technology fields.  It is however a very real subject (or subjects) for those of us who occasionally deal with critical outages, especially those which involve things like:

  • Highly-Available Hosting Services (think: Google, Microsoft, Facebook, etc.)
  • Mission Critical Systems (think: Defense, Lifesaving, etc.)
  • Service Level Agreements (the dreaded SLA's)
It's kind of funny how most businesses feel their operations are "mission critical" or "highly-available", when they're being subjective.  From an objective view however, it's not always as "critical" when things go "down" for a few minutes; even a few hours.

By the way: Compression, as it pertains to this article relates to the compression of time in which you have to operate in.  The time between a failure and sufficient restoration of services.

When dealing with a system outage in a truly "critical" environment, the first steps are pretty much the same as what an Emergency Medical Technician (EMT) would have to consider:
  1. What exactly is not working?
  2. How serious is the impact?
  3. What is known about what led to this outage?
  4. How long has it been down?
  5. How much time is left?
You were probably thinking of the Who, What, Where, When, Why and How sequence.  I kind of tripped you up with two What's and three How's.  (Technically, #4 could be a "when", and #2 could be a "who" or "where", but whatever).  Let's move along.

With regards to a human, the general rule of thumb is 4-6 minutes, total.  That's about how long the brain go without Oxygen and still recover.  Compression CPR is usually the first course of action to sustain blood flow; keeping the remaining oxygen-rich blood reserves moving through the brain.  Enough pseudo-medical blabbering.  The main point is that there is a "first-course of action" to resort to in most cases.

What aspects are shared between a medical outage and an IT system outage?
  • There are measurable limits to assessing what can be saved and how
  • There are identifiable considerations with regards to impact on various courses of action
  • Techniques can be developed and stored for more efficient use when needed
  • Steps can be taken to identify probable risks and applying risk mitigation
With regards to a system-wide outage, the general rule of thumb is not so clear-cut as the 4-6 minute rule.  It truly varies by what the systems does and who (or what) it supports.  Consider the two following scenarios:

Scenario 1

The interplanetary Asteroid tracking system you maintain is monitoring a projectile traveling at an extremely high velocity towards planet Earth.  The system "goes down" during a window of time in which it would be able to assess a specific degree of variation of its trajectory.  The possible margin of error from the last known projected path could have it hit the Earth, or miss it by a few hundred miles.  The sooner the system is back on line, the sooner a more precise forecast can be derived.

Every hour the system is offline, the margin of error could potentially be re-factored (and reduced) by a considerable amount, possibly ruling out a direct hit.  The best estimate of a direct impact places the date and time somewhere around one year from right now.  Your advisers state that it would require at least six months to prepare and launch an interceptor vehicle in time to deflect or divert the projectile away from a direct Earth impact.

Scenario 2

Your order-tracking system for is down and customers are unable to place orders for new sucky shoes.  Your financial manager estimates that during this particular period of the year, using past projections, combined with figures collected up until the outage, every hour the system is offline, you are losing $500,000 of potential sales revenue.  The system has reportedly been offline for two hours.  So far, that's $1 million bucks.

Which of these scenarios is more critical? 

Answer: It depends

What are the takeaways from each scenario?
  • How long do you have to restore operations before things get really bad?
  • Having the time window defined, what options can you consider to diagnose and restore services?
  • How prepared are you with regards to the outage at hand?
  • What resources are at your disposal, for how long, and how soon?
In the first scenario, you have roughly six months to get things going.  Odds would be generally good to assume you can restore services sooner than that, but what if the outage was caused by an Earthquake that decimated your entire main data center?  Ouch.

In the second scenario, the margin would depend on the objective scale of revenue your business could withstand losing.  If you're Google, a million dollar outage might be bad, but not catastrophic.  If you're a much smaller business, it could wipe you out entirely.

What's really most important (besides the questions about what systems are down, why, when and how) is knowing what the "limits" are.  Remember the 4-6 minutes rule?  SLAs are obviously important, but an SLA is like a life insurance policy; not like a record of discussion between the EMT in the ambulance with the attending physician back at the hospital ER.  One is prescriptive and didactic.  The other is matter-of-fact, holy shit, no time to f*** around.

QUESTION:  When was the last time you or your organization sat down and clearly defined what losses it can absorb and where the line exists whereby you would have to consider filing for bankruptcy?

Is your IT infrastructure REALLY critical to the business, or just really important?  In other words: could your business continue to operate at ANY level without the system in operation?

Forget all the confidence you have in your DR capabilities for just a minute.  Imagine if ALL of your incredibly awesome risk avoidance preparation were to fail.  How long could you last as a business?  At what point would you lose your job?  At what point would your department, division or unit fail?  At what point would the organization fail?  Or do you think it's fail-proof?

Post a Comment