In leaving those old components in place, they incorrectly came back with an error about usage being at zero. The outage would have occurred earlier if not for a grace period the company had put in place. Unfortunately, that fix expired, and its automated systems started to behave as if the problem was real. Google had safeguards in place to prevent those types of issues, but they weren’t built to handle the exact case that occurred on Monday morning.
“We would like to apologize for the scope of impact that this incident had on our customers and their businesses,” Google said. “We take any incident that affects the availability and reliability of our customers extremely seriously, particularly incidents which span multiple regions.”
While the company’s engineers were able to address the problem relatively quickly, Google says it plans to implement new measures to prevent a similar situation in the future. In particular, one of its goals is to do a better job of communicating when an outage takes out its services. It also plans to improve its monitoring systems so that it can catch incorrect configurations sooner.