Vigilent CTO Cliff Federspiel authored the following article for the Data Centre Dynamics website, discussing how 99.999% uptime avoids logically looking at the real problems:
Deep down, everyone knows that the five nines (as we see in 99.999 percent uptime promises everywhere) is merely a concept for reliability. The ‘five nines’ mean that there is only a 0.001 percent probability of failure in an interval of time. From a time perspective, it means that a given service will never be down for more than 0.001 percent of the time, which translates to just five minutes per year.
This type of “high nine” reliability metric is commonly applied to components in technical buildings, such as line cards in a switch or power supplies. But a data center is a complex interconnected system of components. Its overall reliability will be driven by its least reliable components. Since all components are not equally reliable, this means the concept of high-nines reliability, even if correct for some components, is more of a marketing statement than an accurate assessment of overall data center reliability.
“The five nines, in the majority of cases, is a marketing figure that doesn’t stand up to practice, isn’t supported by evidence, and doesn’t show forward-looking risk,” according to Andy Lawrence, VP of Research at Uptime. In addition, a 2018 survey by 451 Research found that 48 percent of respondents experienced a major IT/data center outage within the last three years, with two of those failures involving 911 Emergency switching gear that had been moved into data centers.
Clearly data centers don’t deliver five nines of service reliability. So, what are the weak links in the reliability chain? Take a look at cooling.
Data center cooling is typically designed to withstand a 50-year or a 100-year weather event, which sounds like very high reliability. But a 100-year design means that there is a one-percent probability of such an event occurring every year, which is just two nines, not five! If the life of the data center is 20 years, then designing it for a 100-year weather event translates to an 18 percent chance that the weather will exceed the design condition at some time during the life of the data center.
What has made this risk factor more tenable is that most data centers don’t run at full load. But this doesn’t make it an acceptable business strategy. Everyone in the data center business is pushing for higher loads in existing facilities. So if your design is only two nines, and the only thing saving you from failure is a sales guy who can’t make quota, you have a business problem.
Another consideration is that cooling capacity degrades with time due to wear and tear. So, if a data center started off able to withstand a 100-year weather event at full load, it may only be able to withstand a 50-year weather event after a number of years.
Cooling system reliability becomes an even bigger concern with climate change. Average temperatures around the globe are increasing while extreme weather events are getting even more extreme. 100-year weather events are becoming much more frequent. Both the imminent and probable impact of these climate-changing conditions is well documented by 451 research and others.
What users really care about
I agree with Andy Lawrence’s opinion about five nines. Even if five nines is a reasonable reliability standard for the internal components of a data center, it has no bearing on the overall reliability of a data center.
Ultimately, consumers of data center services don’t care why service outages happen. They just care that they might happen, and did happen. I think it is time for more focus on the weak links in the reliability chain and less reliance on five nines statements. The impact of both natural wear and tear along with climate change makes the cooling system one of the weakest links in the entire data center reliability chain and worthy of reliability and optimization focus.