Google Compute Engine – Preemptible Instance Considerations


Overview

Compute Engine, the virtual machine infrastructure-as-a-service offering from Google Cloud Platform (GCP), and the equivalent of AWS EC2, provides the concept of Preemptible Instances.

Preemptible instances are similar to EC2 Spot Instances in that they provide access to lower cost compute resources, but with a trade off in terms of the availability of the resources. However, Preemptible Instances on GCP differ from Spot Instances on AWS in that there is no concept of a “spot price” or bidding for the resources.  With Preemptible Instances, you simply get a 80% discount on the price of a regular instance, at all times.  This approach is much simpler than the Spot Instances concept, and can also be cheaper in certain scenarios, but at the same time does not offer the possibility of bidding a higher price to retain your instances.

Preemptible Instance Limitations

Preemptible Instances have three key limitations when compared to regular instances:

  1. They can be terminated at any time by the GCP framework, with only 30 seconds notice.
  2. They never run for more than 24 hours, and are always terminated after this period of time, if not before.
  3. Availability of resources to allow creation of new preemptible instances is not guaranteed.

In practice, these limitations are easily accommodated, provided that you use preemptible instances for the right types of workload, and provided that you implement them correctly.

Preemptible Instance Use Cases

The limitations and characteristics described above mean that preemptible instances are generally not appropriate for use with long-running services, such as a web server or database.  However, they are a good match for running batch processes and other computational tasks that will run for a discrete period of time.  One example is the use of preemptible instances for build agents as part of your Continuous Integration / Continous Delivery infrastructure.  We often use preemptible instances at Priocept to implement TeamCity build agents, for example.  The build agents never last for more than 24 hours, but this is not an issue, and the total cost of the build agent infrastructure is one fifth of what it would be with regular instances.

Effective Preemptible Instance Implementation

To effectively implement preemptible instances, you will need to ensure that launching of your instances is completely automated so that the termination of an instance and its replacement with a new instance every 24 hours is as transparent to your end users as possible.

To achieve this, you can use the following pattern:

1. Automated Instance Configuration

Ensure that the launching (“bootstrapping”) of your instances is completely automatic.  You can use either a custom image to achieve this, or a configuration management tool such as Ansible or Puppet that automatically configures each new instance when it is launched, or a combination of both.  At Priocept we typically use a combination of Ansible and cmbootstrap to automatically configure a new instance. cmbootstrap is launched from a standardized Compute Engine startup script (equivalent to userdata on EC2), and based on the configured instance metadata, will automatically then retrieve the appropriate Ansible playbooks from the specified repository and then install and run Ansible in local mode to configure the instance.  Where this process is excessively slow, the configuration can be “baked” into a custom image so that it starts immediately with all the required configuration already in place.

2. Placement of Instances within Instance Groups

The step above will sure that the instance will be configured automatically when it launches.  However, you also need to ensure that when the instance is terminated after 24 hours, it is automatically replaced with a new instance immediately thereafter.  If you were to just create a preemptible instance manually via the GCP web console, then once it was terminated by the GCP framework, it would be gone forever and there would be no replacement.  To address this, the instance must be configured to launch within an Instance Group.  Instance Groups are the GCP equivalent of Auto Scaling Groups in AWS.  You will specify a template for your instance (including the base image to use and startup-script configuration described above), plus rules on how many instances should be created.  It is perfectly valid to create an Instance Group that is configured with a minimum of one instance and maximum of one instance.  This means that the Instance Group will never auto-scale, but it does mean that when the preemptible instance is terminated after 24 hours by the GCP framework, it will be automatically replaced with a new instance.  This process will repeat indefinitely every 24 hours.

Using this pattern, you can reduce your compute costs for batch processing tasks by 80%, without introducing any new overhead in terms of having to manually configure or launch new instances when instances are lost.

Note that if your preemptible instances need to allow inbound connectivity to a network service that they implement, then you will most likely need a load balancer in front of the instance(s) to present a consistent address as new instances come and go. This will introduce additional cost which may cancel out a significant proportion of savings achieved by using preemptible instances. However, preemptible instances are generally used where they can “pull” data from an external source to complete their work, such as the TeamCity build agent example above, or other similar batch processes where load balancing of incoming requests is not necessary. Note also that if you implement inbound network services using preemptible instances behind a load balancer, there will be periods when inbound requests are lost due to a lag between instances being terminated and the load balancer ceasing to send traffic to them – this means that this architecture is only suitable where you are in control of the retry logic of the associated client software.

Regional Considerations

Note the third limitation described above, which means that the availability of a new preemptible instance is not guaranteed.  This is because the instances are given lower priority within a GCP region, relative to regular instances, and if the region is short of compute resources, then you can expect that preemptible instances may be terminated or fail to start at all.

Not only does this mean that instances may be terminated earlier than 24 hours and with virtually no notice – it also means that once an instance is terminated, the Instance Group may not be able to replace it with a new instance.  You may see your Instance Group “stuck” with a target instance count of 1, but an actual instance count of 0, as spare compute resources are not available to launch your new instance.

You can protect against this scenario by spreading your preemptible instances across multiple regions.  For example, on a recent Priocept project we found that our preemptible instance count in the London (europe-west-2) region went down from the usual number of instances to zero.  Many hours later this was still the case and the instances were still not available.  But by splitting the original number of instances that were part of a single Instance Group in the London region, into two separate Instance Groups in two different regions (Belgium europe-west-1 as well as London) and with half the total instance count allocated to each region, we were able to ensure that at least half the instances were available at all times.  A simultaneous lack of capacity in both regions would now need to occur to end up with zero running instances.  If necessary, this strategy can be extended to span the preemptible instances across three, four or as many regions as you wish. An alternative strategy would be to use preemptible instances as your primary compute resources, with a separate pool of standard instances as backup resources.

Summary

Preemptible instances on Google Cloud Platform can reduce compute costs by 80% with little to no downside for the right kinds of workload, but it is necessary to make some investment in infrastructure automation to achieve this.

However, this automation is considered best practice for cloud infrastructure and should be part of your plans in any case, and since preemptible instances are inherently ephemeral they bring the advancing of physically enforcing an automated infrastructure-as-code approach in place of manually configured server instances. Using preemptible instances will give you the confidence that if anyone goes “out of process” and makes manual changes to an instance, these changes will never persist more than 24 hours and will automatically be unwound when the instance is replaced. For example, let’s say that a member of your ops team decided to manually SSH into an instance with root privileges, to diagnose and temporarily fix a problem. You can no longer be sure that the server is in a known and secure state, but do know that any associated risk exposure will never be for more than 24 hours.

The overhead of using preemptible instance is therefore very small in practice, if you are already planning to “do things  right” and build fully software defined infrastructure.

Leave a Comment

(required)