Dear Loyal Readers,
High availability has many benefits. I feel that people focus on 9’s of uptime ( 99% or 99.999%), but the real benefit is in the ability to live update different parts of your website. This is as much of a bonus for uptime as it is for continuous deployment.
The Chaos Monkey is now an iconic blog post, but what if you know you are not up for that challenge yet?!
The first thing you need is a way to check the health of your application. This needs to utilize the various backends your website may depend on. That means a way to force the front-ends of the application to traverse at least some part of each dependent pathway (i.e. each database, the messaging cluster, and any back end applications). On our project, this is someone’s full time job. This test will be run during any HA exercises you execute, so it will need to be able to be run continuously.
Next we’ll organize our boxes in to clusters by functionality. So for mongo each shard would be a cluster, but for tomcat we would have to group identical app deployment’s together. Each instance in the cluster is providing the same service, and by the laws of High Availability, should be able to continue to provide service even if one instance with in the cluster goes down.
Finally, we take a box down from a cluster more and more subtly to see how the clusters and application react.
- service [servicename] stop
This is the nicest way to take down a service. This should allow any cleanup code to execute, and for primaries to be migrated as nicely as possible. This will help you detect when your application isn’t handling the happy-path case.
- shutdown / reboot
This should work the same as 1., but is the full simulation of a maintenance operation. We are in decent shape if we can reboot any node and the application doesn’t suffer.
- kill -9
Now we are playing dirty. This begins to simulate the less-happy cases of when a random service dies. How long does it take your cluster to respond and other primaries to take over? Does your application see any errors during this time?
- ifconfig eth0 down
This is “pull the plug” scenario. This simulates a slightly different network outage where a box is suddenly unreachable in a different way than the previous scenarios. You’ll need hypervisor access to recover from this one :-P.
Until the robots evolve,