Atomic Operations on EC2

You have many EC2 instances on Amazon’s EC2.

Each instance is stateless, identical and load balanced.

You need to perform an ssh command across all instances, and if any fail you will need to make a compensating rollback transaction.

Incidently, this is exactly the problem myself and James were in whilst constructing our continuous delivery pipeline. We were following suggestions straight out of Jez Humble’s brilliant book: Continuous Delivery, and our deployment stage needed to update (and potentially rollback) code across multiple live EC2 instances.

What We Did

The codebase of the deployment is node.js and so it felt obvious to us that the deployment code should also be written in node. If we did this we could write tests against the deployment (mocking out the actual ssh calls) and these can run earlier in the commit stage to ensure the behaviour was as expected prior to running it for real against the production system.

We realised this might be a common problem and created a helpful devops module we called ec2-each. You can use it to query EC2 for instances using a filter (status running or with a tag for example) and then perform identical tasks on each instance. If any of the operations fail the callback returns an error and you can perform a compensating action.

The following example loops each running instance and just outputs the reservation id:

var logReservationIds = function(callback) {
  var config = {
    accessKeyId: "AAAABBBBCCCCDDDDEEEE",
    secretAccessKey: "aaaa2222bbbb3333cccc4444dddd5555eeeefff",
    awsAccountId: 123456789012,
    region: "eu-west-1"
  };

  var logReservationId = function(item, state, callback) {
    console.log(item.reservationId);
    callback(null);
  };

  var ec2 = new each.EC2(config);
  ec2.running(function(err, instances) {
    ec2.each(instances, logReservationId, null, callback);
  });
};

We actually use a number of loops to stage the upgrade, calling a different ssh command in each.

This allows controlled rollback of all instances if any single instance fails to update, giving us an atomic update of the codebase over all running production instances. Combined with testing the deployment script earlier in the pipeline we can be confident the release into production can be controlled and will rollback if anything unexpected happens.