Application recovery undoubtedly sits high in your company’s IT disaster recovery plan. If you thrive day to day on incoming orders, then one of the first things you want to do after a major crash is to get order entry back up and running. This is harder than it used to be, since that system (probably like most of your in-house systems) is more tightly integrated with other systems than it originally was.
Now it’s tougher still. In this age of extended and distributed systems, your critical applications could be part of a larger macrosystem composed of your company and other companies that operate in partnership with you. Where information once flowed between you and partner companies through telephones and fax machines, there may now be a direct flow of data binding you all into one big happy system. When your systems crash, what happens to these big super apps?
Are you running intercompany applications?
Do you have applications that require a higher level of disaster planning? Here’s a quick set of qualifiers:
- Does the application require storage, retrieval, and/or processing by both companies in order to be up and running?
- Would the application crash (functionally or otherwise) if either side abruptly ceased operation?
- Does either side of the application require automated information feedback of any kind from the other side, without which it could not function?
- Does the application perform a function to satisfy external customers or users that cannot be fulfilled without integrated execution on both sides?
Note that if the application’s sole shared component is a conduit for data transfer, it may or may not be a shared application. If partner companies feed data directly into an application that your company is hosting, and it goes no further, the application isn’t a true shared application. This concept is easy to test: If you have a super app that includes half a dozen partner companies feeding data into an app that you’re hosting, a crash at any one of those companies isn’t going to crash the app. On the other hand, there are some data pipeline scenarios where the participating companies do form a super app, and a single crash can bring everything to a screeching halt.
For example, as Figure A illustrates, many point-of-sale retailers send orders to a single broker, who then rewrites the orders and submits them to a group of manufacturers. Everyone’s database is in sync with everyone else’s, and the whole system is a smoothly humming order fulfillment system. Note that if any one manufacturer or retailer has a catastrophic crash, the system as a whole marches on. But if the broker’s system crashes, everything comes apart.
The distinction, then, is simple: If the crash of one company’s system in a multicompany application disables only that company’s participation in the application, the measures put forth below are purely discretionary. But if a crash at a particular point can take down the entire app for everyone, a collective application recovery procedure is called for.
First and foremost, sound the alarm
If a major application crashes within your company’s walls, everyone’s going to know about it within minutes. It’s most likely that any user attempting to run the application is going to get a real-time message saying that the application is down. If the system itself has crashed, this too is going to be instantly apparent to the user.
However, external users may or may not be interfacing directly with your in-house databases, and they’re probably getting into your system by way of their own. It’s your system that has crashed, not theirs. So it may be that your super app is not informing every participant in the chain when there’s a problem.
To illustrate why this matters, we can look at an ancestor of distributed applications that persists in the real world today: electronic data interchange (EDI). When purchase orders first started flowing electronically, the receiver of an electronic P.O. would send a generic acknowledgment back to the P.O. sender: “I got your message at date/time.” This, however, turned out not to be good enough. If there was a problem of any kind with the receipt of the P.O. or its contents or ultimate fulfillment, the sender didn’t learn this until the order shipped or the invoice arrived. This delay in spotting errors became costly. So a new EDI document, the Purchase Order Acknowledgment, was widely implemented.
Far more detailed than a simple functional acknowledgment, it said: “I got your P.O. at date/time, and I note that includes 12 widgets and 24 whatsits, to be shipped to Locations A, B, and C.” Ideally, this P.O. Acknowledgment was not a simple echoing of the original purchase order. Instead, it was generated from the supplier’s database after the original purchase order was posted to the receiver’s system, ensuring that the order was in fact processed and flagging any problems or shortages up front. Costly delays were eliminated.
The point: Any distributed system across companies should have such database synchronization built into it! Whatever EDI does wrong, it has tried with all its might to do database synchronization right. If you’ve entered into a partnership with other companies in an extended super app, you must build in tag-up mechanisms to ensure that the passing of business objects from one system to another is acknowledged in real time! This is not only a sound business practice, but it will guarantee that a system crash at any point in the chain will be instantly flagged.
Now for the tough part: Shared redundancy
There’s a final step you can take, which on the surface will seem like heresy but perhaps not so much if you’re already doing extended intercompany apps. Your extended app may be one that absolutely needs to stay up. When disaster takes it down, it stays down only a matter of minutes because the server it’s on has a twin somewhere ready to take over. The problem is, it’s enough of a challenge to set up your in-house network so that this auxiliary server is available to all of your users if the main one goes down. When your entire system crumbles, how are users on the outside going to get to it?
Here’s the very radical solution: Share the redundancy. If you’re part of an extended intercompany app, consider allowing your partner companies to host the mirror server(s) that cover your part of the app, and consider doing the same for them (see Figure B for an example).
Why do this? First, because the application can then recover instantly if your entire in-house network crumbles; second, recovery of the application and all the interim transactions is almost effortless, once your side is back up; and third, because bottlenecks will not develop as interim transaction data piles up somewhere waiting to be processed. Having a standby server ready to take over processing, beyond the effects of the crash, ensures that processing will continue.
But, you ask, if it’s possible to hand off processing to a partner company in the first place, why host part of the app to begin with? Two reasons, and they’re instantly apparent to your users: First, the whole point of such extended applications is to make information immediately available to individuals throughout the chain (e.g., where processing occurs may or may not be relevant at any particular point); and second, where it is important that processing occur on-site, it’s almost always because human beings are required, in the course of processing or reporting, to review or amend or otherwise interface with the data being compiled. That is, value is ultimately added to the data by your in-house people. Once the app is back up and running in your own basement, it will be where it should be. Handing it off for a few hours doesn’t change this.
The scenarios in which these principles can be applied are numerous and extremely varied, so we can’t illuminate them all. Consider the unique features of your own extended apps, and decide for yourself what measures you can take to achieve rapid recovery.