Delta’s recent system-wide IT meltdown took out pretty much every mission-critical, customer-facing system they had: reservations, check-in, flight tracking, even their website and mobile app. All it took to spark the massive failure was a malfunction in a power control module at their technology command center, which caused a surge to the on-site transformer and subsequently a loss of power.
Though power was quickly restored, core systems and network gear either failed to switch over to backups or were unstable once they cut over. Delta ended up canceling close to 2,000 flights and stranding passengers nationwide as it scrambled to restore normal flight operations and crew rotations over two-plus days.
How much will all those missed bookings, ticket refunds, flight vouchers, hotel rooms, etc. cost the world’s largest airline? Southwest Airlines puts the price tag for a similar outage it suffered last month, which lasted just an hour but resulted in the cancellation of 2,300 flights, somewhere between $54 million and $82 million.
Those sobering figures don’t even include reputational impacts, such as consumers choosing other airlines. (I have to fly to Atlanta soon, and I’m thinking twice about relying on Delta’s IT systems right now.) Delta had the best on-time record and probably the strongest reputation among major US airlines prior to this event. Ironically, having a solid reputation could make the impact to their reputation even worse because of this blunder.
What happened to Delta wasn’t solely the failure of a power control module, but also a failure of disaster recovery planning and testing. Delta’s most critical IT systems were taken out by a single point of failure.
Even if Delta did plan for a power outage, they clearly didn’t adequately test their plan. They may have thought—or hoped—that they had a viable recovery capability, but clearly they didn’t. Given today’s complex and interdependent IT environments, relying on a single location, a single data center or a single component is a disaster waiting to happen. The only way to establish confidence that your systems will failover as required is to test them and find out what you don’t know.
That’s not easy for organizations like major airlines that have 24×7 uptime requirements for critical, customer-facing systems. But obviously, the consequences of not doing it are far worse.
Delta’s CEO acknowledged in a public message that despite investing “hundreds of millions of dollars in technology infrastructure, including backup systems. It’s not clear the priorities in our investment have been in the right place.” In other words, Delta not only failed to test its DR capability, but also it failed to perform adequate technical infrastructure review to look for single points of failure and assess their risk and business impact in the first place.
One can only hope that organizations large and small will take a lesson from this wakeup call. Because there is a high likelihood that many companies have likewise failed to perform adequate reviews, risk assessment, and DR testing.
You’re never too big or too small to look humbly at your operations. You need to find out where your vulnerabilities lie and take effective steps to mitigate them. That means playing “what-if” and taking the results seriously.
You don’t have the luxury of saying that “it won’t happen here”—it’s going to happen. It’s just a question of when and how bad.
To help you see “the big picture” of risk and vulnerabilities in your organization, and how best to apply resources to ensure business continuity, contact Pivot Point Security.