"Do not ask whether any particular data center has failures. Ask what they do when they have a failure," asserts Jason Weckworth, senior vice president and COO at RagingWire Data Centers. Weckworth, a battle-hardened veteran, offers a personal example involving a debilitating power outage at a RagingWire data center. "I distinctly recall sitting in front of the Board of Directors and Executives at 2:00 AM trying to explain what we knew up to that point in time," writes Weckworth. "But we didn't yet have a root-cause analysis."
The conversation then became tense. "You haven't slept in two days," says one of the chief executives. "We know we are stable at the current time, but we don't yet have an answer for the root cause of the failure, and we have Fortune 500 companies that are relying on us to give them an answer immediately as our entire business is at risk."
Next came the ultimatum. "So make no mistake. There will be a fall guy, and it's going to be you if we don't have the answer to prove this will never happen again," continues the executive. "You have four hours, or you and all your engineers are fired!"
The story ends well. However, nine years later, Weckworth has not forgotten the incident. "Our common goal as operators is to mitigate risk, address incidents quickly and thoroughly," explains Weckworth, "and return the facility to its original, normal condition, with full redundancy."
Uptime management and operations-approved sites
Weckworth feels operators can reduce the impact of data-center incidents by joining the Uptime Institute Network. "The Uptime Institute Network offers meaningful peer-to-peer interaction and a safe forum for knowledge transfer free from the influence of vendors or concern over trade secrets," asserts Uptime Institute. "Membership includes access to evidence-based best-practice information; benchmarking and reports; detailed error and incident tracking; and trends, regional events, and behind-the-scenes tours of state-of-the-art data-center facilities."
CenturyLink, a member company, agrees with Uptime Institute's silo-busting philosophy. During a phone conversation, Joel Stone, vice-president of global data-center operations at CenturyLink, said: "The simple fact is data centers require human intervention to clear up failures and problems that are not programmed or recognized by automated processes."
Besides belonging to the peer network, and getting tier-rating certifications from Uptime Institute, CenturyLink is committed to having every one of its 58 data centers receive Uptime Institute's Management and Operations (M&O) Stamp of Approval within the next two years, which Stone feels will reduce incidents, ensure consistency, and provide transparency to its personnel to make certain all are following the same processes globally. CenturyLink is the first data-center hosting provider to commit globally to Uptime Institute's M&O guidelines.
Uptime Institute's M&O guidelines
Obtaining Uptime Institute's M&O Stamp of Approval means the data center passed muster in the following management and operational processes. This link is to the exact criteria examined during the M&O approval audit.
Staffing and Organization: Having enough qualified staff to run the data center and perform maintenance is a must. Uptime Institute wants to verify that roles and responsibilities are defined and approved by management, making certain the entire organization is focused on achieving the desired uptime objective.
Maintenance: Uptime considers preventive and predictive maintenance programs, vendor support, adequate resources, and tracking capabilities necessary. "A preventive maintenance program that keeps equipment in top performance condition is the most effective way to minimize equipment failures," suggests the Uptime site. "Fully-scripted processes and procedures for accomplishing all necessary maintenance activities need to exist."
Training: Training is another obvious consideration. However, changes are not always taken into account. "As the uptime objective or site complexity increases, so does the requirement for a more comprehensive and rigorous training program to prevent human error," the guide asserts.
Uptime Institute also checks third-party vendor training, in order to ensure visitors are aware of site-specific policies and procedures.
Planning, Coordination, and Management: This portion of the approval process looks at site policies, financial-management policies, and site-infrastructure libraries — specifically how well they are understood and followed. Also a complete infrastructure reference library and current as-built drawings of the data center should be on-site and available.
Operating Conditions: Uptime Institute wants consistent and documented management of capacity and setpoints that will make sure adequate power and cooling exist for the IT equipment. Regarding electrical capacity, the guide mentions, "Load management decisions need to be established, documented, and practiced based on electrical capacity components to ensure maximum loads are not exceeded and capacity is reserved for switching between components."
Why this is important to Uptime Institute and CenturyLink
Uptime Institute, as a certifying organization, feels approved behavior in each of the above process areas will provide the best chance of obtaining a solid 24x7 data-center operation. "Focus on the recommended behaviors will assist in attaining the full performance and site uptime potential with the installed infrastructure, improve the efficiency of operations, and realize opportunities for energy efficiency," adds the guide.
Something else to consider: CenturyLink believes the M&O Stamp of Approval will become as prevalent as the Uptime tier-rating certification.
Information is my field...Writing is my passion...Coupling the two is my mission.