Google Cloud Preemptible

Creating the reliable cloud with unreliable components


Team Agilicus has been working very hard to build a secure, reliable, economical hybrid cloud for municipalities moving infrastructure applications online. The secure part is done with a set of best practices, cryptography, etc. The reliable part and economical part, however, can often conflict. How do we achieve both, creating a reliable cloud with unreliable components?

Traditionally reliability came from making each individual component as close to infinitely reliable as possible. Servers has redundant power supplies, redundant fan, ECC memory, redundant power grids, etc. This creates great cost. But, ironically, it also creates a limit to reliability. All of those extra components themselves introduce failure modes. Eventually there is diminishing and then reducing returns.

So we have to focus on system reliability. And this is the big mental leap in cloud-native: each individual component is expected to fail often enough to observe. How do we then make the overall system reliable enough that failures are not observed?

Once you have this mindset in place, you start looking to embrace the failures. Is there something that is 10% less available and 50% cheaper? Let’s use that! Enter the concept of the preemptible node. We are using Google Cloud, and it has the concept of a preemptible VM. In a nutshell, you tell Google, hey, this node I’m running on, if you need to move that server or do maintenance in the datacentre, go ahead, 30 seconds notice is fine, pull the power. Since this allows Google greater flexibility, they make this capacity available for less money. And, since we expect nodes to fail anyway, and have designed a system that makes that unobservable, we embrace it.

Now, how often does this occur? If it occurred almost never, we would not have confidence that we handled it. Looking at the trailing 30 days of metrics for our clusters in Google Cloud Montreal, and showing a histogram of when the events occur vs time of day, we see there is a cluster around 6pm. My guess is a lot of people hit ‘git commit && push’ which cranks up their CI just before they head home, and this causes a capacity spike and rebalance.

We did have a choice. We could have chosen non-preemptible nodes, costing us (and our customers) more money. But, and this is key, we would have either reduced reliability (by assuming the non-preemptible nodes were infinitely reliable) or needed to do the same work involving Kubernetes and service meshes etc., and thus leave the cost saving behind for no reason. We chose to create a reliable cloud with unreliable components.

The moral of the story: embrace the failure, design for it, and use that to reduce your cost while increasing your reliability.