Site Archive (Complete)
Architecture Blog: Reliability
Architecture & Design
PATTERN LANGUAGE

Modeling, Managing, Making it Right.

by Jonathan Erickson
IF YOU BUILD IT

... Will they Come?

by Arnon Rotem-Gal-Oz
March 28, 2007

Reliability

Steve Jones stated a series of posts on achieving five nines availability with SOA. I recommend you read it before continuing, but to summarize: Steve calculates the needed availability based on the number of interactions a service has, and shows that even using reliable messaging the availability needed is ludicrous.

Let's leave SOA out of it for a minute and think about system architecture in general. Steve talks about interactions and calculated the availability of each service. But if we look at the system level, we have an architecture that has a single component. The availability of that component is the availability of our system. When we have two components, the availability of the system is the product of the availability of both of these components even if they don't interact with each other at all. When components depend on other components, the individual reliability of that component gets worse, as Steve demonstrates in his post.

Well, this isn't exactly true.

First we need to distinguish between MTBF and MTBCF. MTBF is the Mean Time Between Failures. The more components we have in the system, the lower the MTBF will get. As explained above, the probability of a failure of a component in a system gets worse as we add more and more components. But then we have MTBCF -- the Mean Time Between Critical Failures -- which is why redundancy works. If we add an instance of a component, the chances that one of them would be available is better.

The second issue has to do with interactions. If the interaction with a component (let's call it "B") is needed to complete a request a component (let's call it "A") just got. Then yes, the effect of the availability of B is immediate and the A's availability is also hampered. If we reverse the communication (for example) and have B push information to A, which in turn caches previous results. The reliability of A is not directly affected; in this example the freshness of B's data is affected and A can continue to operate for a while (the exact time is depends on the type of data B sends).

The three lessons from this discussion are:

  • We need to think about the granularity of our components (services or not). Adding more and more components might be good for flexibility but they can be problematic for reliability especially if these components needs separate tiers.
  • We need to think about MTBCF and MTBF. System reliability comes from the first, but operations costs come from the second.
  • We need to think about the way components interact. Request/reply is the easiest but not always the best option. (Note that synchronous communication doesn't have to spell synchronous interaction; see for example, my post on EDA vs. Synchronous Request/Reply).

    Posted by Arnon Rotem-Gal-Oz at 06:47 AM  Permalink




 
INFO-LINK


Related Sites: DotNetJunkies, SD Expo, SqlJunkies