Outage

Virginia's IT outage doesn't pass management sniff test

It has been a rough few days for anyone interacting with the state of Virginia following an IT outage that affected 26 state agencies. Can a storage area networking failure really cripple a state’s IT systems?

It has been a rough few days for anyone interacting with the state of Virginia following an IT outage that affected 26 state agencies. Can a storage area networking failure really cripple a state's IT systems?

Virginia's IT infrastructure, which is managed by Northrop Grumman, has led to a few statements from agencies. Notably, Virginia's Department of Motor Vehicles hasn't been able to process requests for licenses and ID cards. These systems are supposed to be up and running on Tuesday, six days after the outages started to appear.

Meanwhile, the Virginia Information Technologies Agency (VITA) said in a statement that teams have been working throughout the weekend to restore data. In a nutshell, the IT infrastructure of the state of Virginia was reportedly crushed by an EMC storage area network failure. The Richmond Times-Dispatch reports that several systems are still down. The same paper said that Northrop Grumman will have to pay a fine for the failure. And the real kicker is that recently revised its contract with Northrop Grumman and extended the deal for three years. The state paid an additional $236 million for better service from Northrop Grumman.

Needless to say Virginia residents aren't pleased. We've received a few emails and calls and the comments on the Richmond Times Dispatch site are summed up by this one:

Highlights of the Revised Contract

Operational Efficiencies

Consolidates and strengthens Performance Level Standards with a 15% increase in penalties across the board if Northrop Grumman fails to perform on clearly identified and measured performance standards. - PAY-UP

Improves Incident Response teams to determine technology failures and expedite repair - FAILED

Institutes clear performance measurements for Northrop Grumman that agencies can easily track - FAILED

Adds new services to contract such as improved disaster recovery and enhanced security features - FAILED

Among the key parts of the VITA statement:

  • Successful repair to the storage system hardware is complete, and all but three or possibly four agencies out of the 26 agency systems have been restored. Agencies continue to perform verification testing.
  • Progress continues, but work is not yet complete for the three or four agencies that have some of the largest and most complex databases. These databases make the restoration process extremely time consuming. The unfortunate result is the agencies will not be able to process some customer transactions until additional testing and validation are complete.
  • According to the manufacturer of the storage system (EMC), the events that led to the outage appear to be unprecedented. The manufacturer reports that the system and its underlying technology have an exemplary history of reliability, industry-leading data availability of more than 99.999% and no similar failure in one billion hours of run time.

The official explanation for the outage leaves a bit to be desired and frankly doesn't pass the sniff test. The outage was blamed on the failure of two circuit boards installed and maintained by EMC.

Simply put, it's a big disconcerting that two circuit boards can bring down a state's IT infrastructure for nearly a week. Talk about a lack of redundancy.

Among the things that don't add up in the Virginia IT outage:

  • Why wouldn't these boards be replaced quickly?
  • Why was there a single point of failure?
  • According to the Washington Post, service was restored for 16 agencies, but 10 require "a lengthy restoration of data." Where was the disaster planning? After all, Northrop Grumman touted its disaster recovery for the state just two years ago.
  • Where did the IT management fail?

We're told that Northrop Grumman knows about its IT management issues and is working on correcting the problems. Northrop Grumman was awarded a $2.3 billion IT services contract in 2005. And the company has touted some of the state's successes. Meanwhile, Northrop Grumman even relocated to Virginia. Hopefully, that proximity will lead to better IT management.

10 comments
stokesje3
stokesje3

Question, where is the profit in this project? Where has Northrop Grumman spent it's capital in all this? My thought is that what they want is data storage, build the server farms and out source the rest, once I have your data stored I have you!

kanugent
kanugent

VA state employees, who have been furloughed this year due to budget shortfalls, are not happy that Northrup Grumman has delivered such poor service at such a high price. I invite the news media to look into what Northrup Grumman charges for their services and how slow they have been to deliver on all of their services, not just Contingency Planning.

WishtobeIT
WishtobeIT

I work in Northern VA and this situation was by far the WORSE in the nearly 6 years that I've been a county employee. This nightmare actually started on Thursday, 8/26 when our agency (not counting the other agencies that were affected had TWO MAJOR systems go down. To this date, the Commissioner of our agency (statewide) HAD NOT offered an explaination as to WHY THIS HAPPENED--what was the cause? That is UNACCEPTABLE. We could not do work. I work in social services and this outage has AND STILL WILL affect client's benefits, etc. We cannot serve the customers when our tools are INOPERABLE. I called our IT department (Help Desk-Tier 1) around 7:15 a.m. that morning to report it. They DIDN'T EVEN KNOW. This is so intolerable and I hope that your email reaches senior IT management in the state of VA and the Commissioner as things still are not 100 Percent. The DMV (Motor Vehicles) is STILL CLOSED (closed yesterday and today) because of this FIASCO!

josh.krischer
josh.krischer

"According to the manufacturer of the storage system (EMC), the events that led to the outage appear to be unprecedented. The manufacturer reports that the system and its underlying technology have an exemplary history of reliability, industry-leading data availability of more than 99.999% and no similar failure in one billion hours of run time. The official explanation for the outage leaves a bit to be desired and frankly doesn?t pass the sniff test. The outage was blamed on the failure of two circuit boards installed and maintained by EMC" These statements given by the vendor are questionable: 1. A statistics proving that DMX-3 has industry-leading data availability does not exist and most probably not true because Hitachi commits to 100% availability with their comparable product. 2. How the 99.999% availability is calculated? 3. The probability that two cards (in a system with 99.999% availability ) failing at once is close to 0. Some other questions: 1. When is the last time that Northrop Grumman tested DR failover for VITA 2. Was the DR infrastructure audited by DR external specialist? 3. If the platform was System z then Hyperswap function could prevent this outage, if it is other platform, Hitachi Storage Cluster of USP-V could do the same. Josh Krischer, Analyst and consultant specialized in DR and Business Continuity with many years of hardware engineering experience (repairing mainframe to chip level, e.g.). www.joshkrischer.com

ksec2960
ksec2960

This does not add up. How can you have such monumental single point of failure? My home network / data systems have better redundancy than this. Seems like poor managagement ong NG's side and on the states side. No disaster recovery planning or testing?

aikimark
aikimark

One reason for the poor service is that the former state employees who went to work for NG were let go after ~18 months. NG had a bad track record for such IT takeovers before being awarded the initial contract. After the failures mentioned in this article, they got a raise. Someone needs to audit the Virginia politicians as well as NG. As far as I'm concerned, NG could just as easily be an acronym for "No Good", or "Nothing Gained".

josh.krischer
josh.krischer

It may happened in outsourcing (ERP) deals; initially the deal is lucrative for the outsourced (usually ca. -20%) but when he is ?locked-in? the savings starting to erode. In addition to savings made by economies of scale, some of the ERP providers trying to increase the margin by saving on cheap infrastructure. ERP deals should be benchmarked and re-negotiated before renewals however being in a lock-in situation impacts the leverage. Maybe it will make prompt some companies considering ?external clouds? to second thought.

smankinson
smankinson

I worked in another situation that was quite disastrous. The media never had a grasp on the situation, and never will. How can someone outside of such a complicated situation even begin to speculate or know when they are close to the truth? The best answer outside of those in the situation is, unfortunately, any resulting court cases. Good luck understanding that process!

Ron_007
Ron_007

(snip) According to the Washington Post, service was restored for 16 agencies, but 10 require ?a lengthy restoration of data.? Where was the disaster planning? (/snip) How do you know that this isn't the agreed upon time frame in DR plan? You can't recover everything at once. Maybe these apps are designated "low priority" for recovery? Dept Motor Vehicles is obviously not an priority system to recover (heavy sarcasm). I do agree that a single point of failure for 26 state agencies represents very poor DR planning. Isn't centralized IT great.

Editor's Picks