A Crash Course In Failure

By Craig Stuntz


When is the last time you intentionally unplugged a live, production server? Better still, when is the last time you intentionally unplugged a rack of live, production servers? I can think of a couple of reasons why the answer might be "never."

  • Your installation has no redundancy or very low redundancy. This is a legitimate decision whenlow prices are considered more important than high availability. It means that when hardware fails or software crashes the system will be unavailable.
  • Redundancy has been designed into your system, but you don't actually trust it.
Hardware will fail. Software will crash. Those are facts of life. In a service the size of Hotmail, hardware failures are not even uncommon; they are a routine part of the day-to-day operations of the service.
While few people would claim the software they produce and the hardware it runs on never fails, it is not uncommon to design a software architecture under the presumption that everything will work. Error handling is then added as a sort of afterthought to be invoked in the unlikely event it is ever needed. This is in contrast to, for example, the way that aikido and other martial arts are taught; the student first learns to fall, and only later learns how to throw other students. What would happen if we created software architectures in the same way?
In a short and very readable paper, "On Designing and Deploying Internet-Scale Services," James Hamilton notes that this approach may be the only way to create a massively scalable system:
Once the service has scaled beyond 10,000 servers and 50,000 disks, failures will occur multiple times a day. If a hardware failure requires any immediate administrative action, the service simply won’t scale cost-effectively and reliably. The entire service must be capable of surviving failure without human administrative interaction. Failure recovery must be a very simple path and that path must be tested frequently. Armando Fox of Stanford has argued that the best way to test the failure path is never to shut the service down normally. Just hard-fail it. This sounds counter-intuitive, but if the failure paths aren’t frequently used, they won’t work when needed.
Read that again: "... never shut the service down normally." This runs counter to just about everything we've been told about administering servers, which typically hide their power switches behind lock and key, if they have one at all. But Hamilton, who is presently Vice President and Distinguished Engineer at Amazon Web Services, and was previously an architect of Windows Live Services, would appear to know a thing or two about administering servers.
So let's consider this for a moment. What is likely to happen when you hard-fail a server? We can split this into three categories:
  1. Fault-intolerant hardware and software will not recover; at least not without manual intervention. This includes, for example, database configurations which do not do careful write ordering or journaling
  2. Fault-recoverable hardware and software will come back online, eventually. This would include filesystems which don't do journaling, but could recover with something like fsck. It would also include database servers which use in-place updates and log files; in order to rollback incomplete transactions portions of the log file may have to be replayed when the database server restarts.
  3. "Crash-only" hardware and software is designed to withstand failure and recover very quickly. Examples include journaling filesystems such as ext3 and MVCC databases such as InterBase (when careful writes or journaling are enabled) which can recover from a crash almost instantaneously.
If you're going to take Hamilton's advice it would be best if all of your hardware and software is "crash-only," and you'd better not have any fault-intolerant components. More importantly, if you believe you have designed for redundancy and availability, but are afraid to hard-fault a rack due to the presence of non-crash-only hardware or software, then you're fooling yourself.
This leads to an interesting question: Why do we design for "normal" shutdowns at all? In the paper I referenced above by George Candea and Armando Fox, they note that most OSs recover more quickly from a crash than they do in a "normal" restart. In their view, a “normal” shutdown is mostly an expensive means of avoiding testing.
There is a performance cost for crash tolerance; techniques such as filesystem journaling or careful write orders must be employed to ensure that written data is recoverable. So while restarting the OS may be faster when recovering from a crash, day-to-day performance may be a bit slower in order to ensure that this recovery is possible. However, this is only a true “cost” if it is more expensive than the cost of recovering a crashed system when such safeguards are not in place.
OK, but what does this mean for me?
The discussion above may seem academically interesting, but most of us do not administer Hotmail. In my office even business-critical services such as source control do not employ redundant servers (although we use RAID arrays, which is a form of hardware redundancy). We have made an economic choice to tolerate the occasional unavailability due to hardware failure rather than pay the hardware and administration costs of full redundancy.
But we can still learn from Hamilton's experience. We all experience hardware and software failures and, more critically, so do our customers.
The first computer my family ever owned was a Commodore 64. This was a crash-only device. There was no shutdown command; instead, you shut the computer off by hard-faulting the device (i.e., turning the power off). Most DOS machines worked the same way. So it is clear that crash-only designs are not strictly for Internet-scale applications.
 Build crash-only software
In many cases, building crash-only software is not difficult if the requirement is considered from the outset. For example, the HTTP protocol is stateless by nature. As long as you do not add non-re-constructible state to your application and as long as you handle requests in an atomic manner, you can build a crash-only web application in a very natural way.
When you design for Windows Vista one of the requirements is that your software should shut down quickly without interrupting system shutdown with prompts for the user to save their work and the like:
·         All applications must be “restart manager aware” by listening and responding to the following shutdown messages:
1.      WM_QUERYENDSESSION with LPARAM = ENDSESSION_CLOSEAPP(0x1): GUI applications must respond (TRUE) immediately in preparation for a restart.
2.      WM_ENDSESSION with LPARAM = ENDSESSION_CLOSEAPP(0x1): Applications must return a 0 value within 30 seconds and shutdown.
3.      CTRL C: Console applications that receive this message should shutdown immediately.
One (relatively difficult) way to achieve this requirement is to heavily optimize the work you do at shutdown. Another way is to make your software crash-only. In order to respond immediately to QUERYENDSESSION, you must be prepared to shut down without ever prompting the user to save their work. In order to respond quickly to ENDSESSION, you could write an optimized method to put all savable user data into a recoverable state, or you could maintain savable user data in a recoverable state for the entire lifecycle of your application, and simply do nothing when the system restarts.
As many people have observed, it is sort of absurd that users have to tell software that they would like to save their work. In truth, users nearly always want to save their work. Extra action should only be required in the unusual case where the user would like to throw their work away.
There is failure, and then there is failure…
Not all “failures” are created equally. Some aren’t even failures. For example, if a user enters invalid data then that should be allowed within the contract of the component which accepts user data. It is distinct from events outside of the contract of that component, like running out of memory. Therefore the failure path is different. In the case of the user entering invalid data, this should be handled correctly (for example, informing the user), but would not be presumed to leave the system in an invalid state. If the system runs out of memory, however, the system is in an invalid state and should crash.
Unit test failure
Go look at any introductory presentation on unit testing, and count the number of tests you see for failure cases. In most of the presentations I've seen, the number has ranged between zero and one. But testing for failure is really important; as we know, failures happen all the time. In order to have "complete" unit test coverage (by "complete" I don't mean 100% of code; I mean 100% of unit testable features), we must test error handling as well as other features. Most unit test frameworks have a notion of "expected exceptions" built in. Also test the case where a dependent method throws an exception. Use input fuzzing to test the ability of methods to respond to unexpected input.
You’ve heard of “test-first development?” I’m proposing “test-failure-first development.”
Manage dependencies
It is important to test not only the behavior of the component when that component itself fails but also how the component reacts when a dependent component either clearly fails or simply fails to respond. What is the correct behavior in this case? Well, it depends...
  • In some cases, when a dependent component fails, the entire operation should fail. Perhaps other things should fail, too. If you don't manage state carefully, the state of the entire application could be questionable. I have previously commented on Erlang's elegant mechanism for this, termed "let It crash" programming.
  • When a dependent component fails or fails to respond, the calling component could wait for a little while and then retry the entire operation from the start. Be careful, though. People often use this as their default response without thinking the consequences through. If components at multiple levels wait and retry, you could end up waiting for a very long time. Raymond Chen notes, "I've seen this go wrong many times. So much so that my personal recommendation is simply never to retry automatically."
  • It is almost never correct to continue on as though the failure had not occurred. This is why eating exceptions is dangerous.
Notify
One of the dangers of highly-crash-tolerant software and hardware is that crashes can go unnoticed. When an InterBase database server crashes, it restarts so quickly that it is quite common that clients don't notice that the server has crashed at all; they simply see that one query has failed. But the fact that the server has crashed indicates an application or server bug needs to be fixed. Make sure that there is a reporting process by which serious failures can be communicated to the appropriate person. Note that the appropriate person is often the developer rather than the user.
Contract out redundancy
Redundancy can be expensive to implement and administer in a dedicated server for a small office. Shared hosting and cloud hosting might make more sense in this situation.
Practice server recovery from failure before it is needed
Doing routine database or server backups regularly is useless if the backups you produce cannot be restored. Anyone who does a lot of backup restores can give you a long list of reasons why they can fail. Having a backup plan without a restore plan is useless.
There's more

One of the conceits of software architecture is that we like to design around the presumption that things mostly tend to work. This probably implies that we don't talk to users enough, who are often happy to inform us of just how often things fail. Although software architecture is a new field it is well-established in comparison to the architecture of software failure. We have a lot to learn yet. Please share your experiences and techniques for handling failure.

Craig Stuntz is Senior Software Developer and Architect for Vertex Systems, Inc. He is a member of TeamB and the ACM. Read his blog at blogs.teamb.com/craigstuntz.