Preparing for failure: How to design distributed applications that fail well

Distributed applications provide the flexibility for companies to hook up for data exchange, but the downside is increasing potential points of failure. Build your applications so they can identify, notify, and recover from incidents.

The upside of distributed applications is that the enabling technologies are so flexible that just about any company can hook up with just about any other company and create whatever exchange of data they both find useful. The downside is that such applications have a far greater number of opportunities to fail.

It’s difficult enough to design a robust application for an ERP environment, wherein multiple data sources are used to create real-time records on the spot for the user. It’s more difficult still to extend that application across business units and beyond the company’s boundaries. Designing and implementing an application that pleases users at home and abroad, especially if it’s a real-time application, is quite an accomplishment. But if you haven’t covered the myriad possibilities for failure, you’ve done a halfway job.

The temptation to build a crooked house
Does anybody remember Tinkertoys? They included a wonderful set of wooden construction components with plastic joints and connectors, out of which a kid could build just about anything. So flexible were these components that you could conceivably keep adding on to your creation until it collapsed under its own weight.

J2EE, Web services, .NET, and all their cousins present a similar problem. It is now possible to create a platform-independent application, distribute it to all your company’s business partners, send it sniffing for multiple sources of data, and have it give every user something different, yet function perfectly. It is also now possible to create such an application that costs little more than a single-user, single-database application would have cost five years ago.

The price we pay for such miracles is that, like a Tinkertoy machine, there are simply a great many points where a break can occur. This creates a whole series of problems, each worse than it would be in a more conventional application:
  • The failure point is often hard to find.
  • There is seldom any single person in-house who has expertise covering all the possible failure points.
  • It's possible that the failure point is beyond your department’s reach.
  • All the failure points probably have different failure modes, not necessarily affecting the same user groups.
  • The failure may also cut off the communications route by which an error message would travel to let the user know what went wrong.

This result is quite a mess and not an unlikely scenario. What can you do? Several things, but you must think them through ahead of time, and it’s a good idea to game them out with other concerned parties (especially the user).

Identify all the failure points in the design phase
Remember the old television show Get Smart!, with Don Adams as the secret agent who passes through a long series of doors in order to get into the phone booth that takes him to his agency’s secret headquarters? In a distributed app—particularly one that lives on the Internet—data is passing through just such a long series of doors. If any one door fails to open, the data doesn’t arrive and the app fails. So you must identify every door and list it as a possible obstacle.

This is not easy to do when dealing with unfamiliar data transport technologies and when relying on outside service providers. Apply extra diligence here, and inquire deeply about the points at which a downed server or some similar occurrence could interrupt your application. If you are relying on software provided by a vendor, over which you have no control, apply similar diligence and learn its weaknesses. List these as possible failure points as well. Your failure trail should include every last point at which application data makes an appearance, from its point of origin in its home database to the user’s eyes.

Establish checkpoints for later audit
Now take that failure trail and identify all the points at which the successful passage of data can be logged. What you record for later audit is up to you and should be determined by the nature of the communication, the complexity of the routes the data travels (and to which user/group), and the seriousness of the failure at any particular point. The log entry can be as simple as a local-server text message in a temp file that's to be erased upon successful application termination, to a compiled log submitted to the end user.

Creating this trail is critical for a distributed application if there are multiple hand-offs of application data. This is all the more important because the nature of distributed applications today is that they yield user-specific results and that they be (potentially) accessible at various points down the line from the source database.

Remember that in a distributed application, particularly one spanning the Internet, the disbursal of data is often skewed. Different user groups, particularly those outside your company, can have very different requirements and may be culling customized subsets of data from a common source. Moreover, some users may be interested not in the available data, but in the fact of its use. For example, a consulting marketing partner may be interested only in your company’s Web site traffic and not so concerned with the data being passed. These users, and the data they are gathering, are part of your application too, even if they are thought of as peripheral. Include these points as well in constructing your audit trail.

Build in notification of success
If your distributed app has multiple data sources, you should consider several important points. First, let your end user know the sources of the information accessed by the app, whether in real time or by on-demand log. This strategy is important because in a distributed app with multiple sources of data, the application can partially fail. Your user could be a party outside your company accessing your inventory databases in several different physical locations in real time, receiving aggregate numbers summarized from each. It may be that four out of five of these databases successfully give up the requested numbers, but that the fifth does not. It must be clear to the user that the information being presented is incomplete. This doesn’t happen without patient forethought; figure out what it will take to let the user know what’s going on, and do the extra work to put it in place.

The partial failure (or partial success) of an application’s use is called graceful degradation, and it’s a powerful alternative to straight-up success or failure. In many applications, some information is better than none at all, and often its usefulness is a question of degree. However, if you go this route, it’s more important than ever that gradual failure be detailed, that the user be forewarned that the data is incomplete, and that its sources be known (or, alternatively, that any sources not successfully accessed be specified).

Have failure modes established for all potential failure points
When things go wrong, make certain the user knows it and that the notification is as detailed as possible. Whether that user is requesting data or sending it elsewhere, if the intended communication doesn’t take place, the user is at the top of the notification list. (Be certain that you've chosen an application development environment which provides reliable and flexible error-handling; Java is the standout in this area.)

The notification of a failure to the user should offer remedial detail. If possible, the error message should say what went wrong and who to tell about it. When this isn't possible, the user (or, on the other side, a system administrator) should be notified of the last point the requested data reached unimpeded, so that the location of the failure point can at least be narrowed down.

There’s one additional step you can take. If the failure occurs beyond your company’s borders, or beyond your team’s (or your maintenance group’s) domain, then you should have communications in place to the caretakers of the Web service or server where the failure occurred. Agree ahead of time, when the service is contracted, on a failure protocol that will result in a rapid diagnosis and repair (indeed, you should make this a selection criterion), and that will instruct the service to send an agreed-upon notification that meets your needs. If you’ve designed well for failure, an external user should receive a failure notification in lieu of data when something goes awry.

How do you justify this extra effort? It takes maybe two minutes: Imagine your distributed application with all these things missing. Next, imagine any of the breakdowns on your failure list occurring. Now imagine how much time it will take to track down the failure, and imagine you’re the user, wondering what’s going on. Finally, imagine the phone call you’re going to get.

About Scott Robinson

Scott Robinson is a 20-year IT veteran with extensive experience in business intelligence and systems integration. An enterprise architect with a background in social psychology, he frequently consults and lectures on analytics, business intelligence...

Editor's Picks