May 05, 2006
Distributed Computing Fallacies Explained: "The Network Is Reliable"
The first fallacy is "The network is reliable." Why is this a fallacy? Well, when was the last time you saw a switch fail? After all, even basic switches these days have MTBFs (Mean Time Between Failure) in the 50,000 operating hours and more.
If you application is a mission critical 365x7 kind of application, you can just hit that failure--and Murphy will make sure it happens in the most inappropriate moment. Nevertheless, most applications are not like that. So what's the problem?
Well, there are plenty of problems: Power failures, someone trips on the network cord, all of a sudden clients connect wirelessly, and so on. If hardware isn't enough--the software can fail as well, which it does.
The situation is more complicated if you collaborate with an external partner, such as an e-commerce application working with an external credit-card processing service. Their side of the connection is not under your direct control. Lastly there security threats like DDOS attacks and the like.
What does that mean for your design?
On the infrastructure side, you need to think about hardware and software redundancey and weigh the risks of failure versus the required investment.
On the software side, you need to think about messages/calls getting lost whenever you send a message/make a call over the wire. For one you can use a communication medium that supplies full reliable messaging; WebsphereMQ or MSMQ, for example. If you can't use one, prepare to retry, acknowledge important messages, identify/ignore duplicates (or use idempotent messages), reorder messages (or not depend on message order), verify message integrity, and so on.
One note regarding WS-ReliableMessaging: The specification supports several levels of message guarantee--most once, at least once, exactly once and orders. You should remember though that it only takes care of delivering the message as long as the endnodes are up and running, it doesn’t handle persistency and you still need to take care of that (or use a vendor solution that does that for you) for a complete solution.
To sum up, the network is Unreliable and we as software architect/designers need to address that.
Posted by Arnon Rotem-Gal-Oz at 09:33 AM Permalink
|