Stability Patterns: Use Timeouts

„A resilient system keeps processing transactions, even when there are transient impulses, persistent stresses, or component failure disrupting normal processing.“

This is Michael Nygards definition of stability. In his book „Release it!“ he describes design and architectures patterns, which stop cracks from propagating and preserve at least partial functionality instead of total crashes.

So, what are the problems?

Why do distributed systems crash? The fallacies are (from Fallacies of distributed computing):

The network is reliable.
Latency is zero.
Bandwidth is infinite.
The network is secure.
Topology doesn’t change.
There is one administrator.
Transport cost is zero.
The network is homogeneous.

Networks are not reliable and latency is not zero. That’s why at every point a subsystem is integrated, a call could eventually fail. Without timeouts upstream systems will themselves slow down and might be vulnerable to stability problems. And failures in remote systems propagate quickly, probably turning into an cascading failure.

Failure Modes

Accept that failures could happen and design how your system reacts to specific failures. This is what Nygard calls failure modes. A failure mode contain damage and protect the rest of a system, a sort of self-protection which determines the resilience of a whole system. These modes are like crumple zones, areas designed to protect passengers by failing fast. So think about your system. Which features are indispensable? And then build failure modes around them.

Using Timeouts

Using Timeouts is one pattern you can use to defend such failures in your system. If you look at different Java Specifications, e.g JDBC, JMS or JAX-RS you can find methods which have a timeout parameter. Or with Jersey you can easily set a timeout at the client side like this:

[code]
Jersey HttpUrlConnection
restClient = ClientBuilder.newClient();
restClient.property(ClientProperties.CONNECT_TIMEOUT, 2000);
restClient.property(ClientProperties.READ_TIMEOUT, 2000);
[/code]

Why are here two timeout properties? The Sockets API defines two types of timeouts: connection timeout defines a maximum time elapsed before the connection is established or an error occurs, and socket timeout determines the maximum period of inactivity between two consecutive data packets arriving on the client side after a connection has been established.

Okay, and how can I timeout a call when a third library doesn’t provide a method with a timeout parameter?

[code]
ExecutorService executorService = Executors.newSingleThreadExecutor();

Future<String> task = executorService.submit(new Callable<String>() {
@Override
public String call() throws Exception {
long heavyTaskInMs = 6000;
Thread.sleep(heavyTaskInMs);
return "Hello Timeout";
}
});

try {
long timeoutInSeconds = 5;
System.out.println(task.get(timeoutInSeconds, TimeUnit.SECONDS));
} catch (InterruptedException | ExecutionException | TimeoutException e) {
e.printStackTrace();
}
[/code]

But wait! Don’t reinvent the wheel. You could use Guava’s SimpleTimeLimiter to solve that problem 😉

Defend with Timeouts

Always use methods that takes a timeout parameter or if not provided, make sure that your call will come back.

Examples on Github