Tuesday, February 12, 2008

Why the crack(berry) stopped working... and counting down to the next outage.

The WSJ is reporting that yesterday's Blackberry outage came during an "expansion of their network operations center" due to service growth. That sounds a lot like RIM spinning the outage so Wall Street hears: "hey, we are growing like crazy so please ignore the fact that we're a communications company and cut off most of our US customers for the better part of a workday".

This was RIM's second major outage in less than a year. Surely the media will place a big portion of the blame (as they previously have) on RIM's highly centralized hub-and-spoke style architecture. But I hoping savvy tech readers will realize that the redundancy issue is a cop out. Blaming the lack of a safety net doesn't change the fact that these failures are caused by poor change control and ineffective processes.

This isn't exactly an uncommon problem. Many popular services are able to survive deployment screw-ups with various forms of a brownout, enabling them to play fast and loose with "uptime" statistics. But why is that even tolerated?

Until the discipline and science of operations is discussed openly and freely within IT communities as it is within mature industries like manufacturing or energy, I really don't expect much to change. Money and goodwill will keep hitting the floor and CEO's will just scratch their heads and shrug at the convoluted excuses offered up by their IT organization.

No comments: