Wednesday, February 25, 2009

On Parallel Programming with Processes

About once a week, a well-meaning person stops by my office to ask a question like this:

I need to run about 1000 simulations that take about an hour each. I can't wait a thousand hours for the results, so I need to parallelize my simulation. So, should I re-write my application using threads, MPI, or something else?

For some reason, they are always disappointed by my response:

Just run multiple copies of your program at once.

The reasoning is very simple. You already have a complete, debugged program. You have multiple processors, and your operating system knows how to use them. Running four processes at once on a four CPU machine will give you four times the number of results in the same amount of time. Your work will be down in 250 hours instead of 1000. In fact, you can take the same sequential program and submit it to a large batch system that can run on 100 different processors at once and complete one hundred simulations in one hour. If you only get 99 hosts, that's ok, you will still get a 99x improvement.

The alternative is almost too awful to contemplate. Those who have written multithreaded or message passing programs knows that it sounds great on the chalkboard, but the reality is much more complicated. Debugging tools are ineffective on parallel programs. Many existing libraries are not thread safe. You have to deal with synchronization problems, and an endless slew of tuning parameters. If you write a message passing program that requires eight hosts, then you need to wait until you have exactly eight hosts available for your exclusive use. It is all too likely that you will spend more time trying to correct the program than you actually will running it.

The funny part is, many people do not like this advice. But... that's not... parallel! Or, if they concede it's parallel, it's merely embarassingly parallel, or even worse, shamefully parallel. (As if using 100 CPUs simultaneously with processes was somehow less parallel than using 8 CPUs with threads.) They were hoping to doing some really difficult, macho programming, but now find there is a simple solution to the problem.

Now, I'm over-simplifying a little bit. There are certainly cases where it makes sense to take an existing program and parallelize it to use multiple processors. There are a few good reasons for doing so. First, if you really need one particular result as soon as possible, then it makes sense to parallelize. For example, if you are predicting tomorrow's weather, you need the result before tomorrow. Second, if your sequential program has fully consumed another resource on the machine, then it may make sense to parallelize. For example, if your simulation uses all available memory on a machine, then you cannot run two copies at once on the same machine. Third, if one program will run for longer than you can realistically keep a computer up without rebooting, then it may make sense to parallelize. However, none of these cases are as common as you might think, and it's usually best to avoid threads or message passing until the necessity has been proven.

A middle ground that we have been exploring in my lab is the use of abstractions to represent large parallel programs. In this approach, the user provides a sequential program that performs the key kernel of computation in which they specialize. Many invocations of the kernel are then combined together to run very large parallel programs with parallelism measured in hundreds of CPUs. You can read more about the BXGrid, Wavefront, All-Pairs, and Classify abstractions.

Saturday, February 21, 2009

Exponential Backoff in Distributed Systems

In response to my previous article, a commenter asked:

Why exponential backoff? To put a finer point on the question, How should I choose the parameters for my exponential backoff algorithm? I think many people choose parameters that back off too much, too fast.

The idea of exponential backoff in distributed systems goes back quite a few years. An early example can be found in the Ethernet network. In its original form, an Ethernet consisted of a single cable connecting all stations on the network. Unlike some other computer networks at the time, it had no direct means of controlling which station could transmit at any time. If one station transmitted while everyone else was silent, then the message would be received by all stations. But, if two (or more) transmitted at once, every station would receive a corrupted message.

Here's an analogy. Imagine a school gymnasium with people lined up along the walls. People have to shout to be heard, and there are multiple conversations going on at once. As you probably know from experience, this can only work if one person speaks at a time. So, each person waits for a quiet moment to speak. Occasionally, two people try to speak simultaneously, and then you have a silly game of each waiting a bit and then trying again until the tie is broken.

That is essentially how Ethernet works. Each party that wants to transmit waits for a quiet moment, and then sends a message. The sender also simultaneously listens to see if it can hear its own message. If the message is corrupted, it means another party transmitted at the same time, so both wait a bit and try again.

The essential question is: How long should each station wait?

It does no good to have each party wait a fixed amount of time -- say, one microsecond -- because then each will try again at the same time, and the situation repeats forever. A better idea is to choose a random time -- say, between one and ten microseconds, which will break the tie in a small number of attempts. However, if many parties are trying to talk at once, the result will still be a chaotic mess of messages, with no-one making any progress.

A more robust solution is for each party to use exponentially increasing delays. For example, delay one microsecond plus a random factor the first time, then two, then four, and so on. This solution works regardless of the number of competing parties, because it tends to thin the traffic out over time until the congestion is eased.

I wrote a paper titled The Ethernet Approach to Grid Computing on this topic a few years back, making the observation that this strategy is needed everywhere in distributed systems. Whenever you talk to a file server, a batch system, a print server, or file your taxes online, failures are possible, so you need to use Ethernet-like strategies. To encourage this, I wrote a simple language called the Fault Tolerant Shell which looks a lot like a conventional shell with exceptions. For example, here is how to reliably submit a Condor job:

try for 1 hour
condor_submit job.file

Or, if you have a choice of three different places to fetch a file from:

forany server in X, Y, Z
wget http://$server/myfile

Internally, the shell takes care of all of the error detection, retries, and so forth, so that the programmer can concentrate on the essential issues. The end result is that the system becomes much more robust to load bursts. For example, the following graph shows the performance of many clients submitting batch jobs to a queue using three methods: the Ethernet approach, the Aloha approach (an intermediate step), and a simple fixed retry:

As you can see, the fixed approach crashes to zero after about 400 clients, whereas the Ethernet approach continues to maintain a high level of throughput. It is not as high as the performance under low load, but it is relatively stable over a wide range of load.

The disadvantage to using exponential backoff is that it is going to extend the time to recovery after a failure by about a factor of two. Suppose that you are a client talking to a web server which crashes. You wait one second, try again, then two seconds, and so on. If the web server is unavailable for thirty seconds and then recovers, the client will not notice right away, because it will be in the middle of waiting for thirty seconds before trying again. Now, extending a thirty second outage to a sixty second outage is unlikely to cause any real heartache. But, what about extending a thirty minutes to sixty minutes? That could be irate customer territory.

So, you need to balance the needs of your customers against the capacity of your system. If you you want to handle 1000 clients and have a maximum recovery-after-failure time of one second, then you had better make sure that your system can handle 1000 failed requests per second at a sustained rate. That may sound easy, but if each failed request involves a database query, a write to a log file, and an email to an administrator, then you will be quickly overwhelmed.

Now let's answer the original question: How should I pick the backoff parameters?

Let's assume that they delay chosen at any point is based on an initial timeout (T), an exponential factor (F), the number of retries so far (N), a random number (R), and a maximum timeout (M). Then:

delay = MIN( R * T * F ^ N , M )

  • R should be a random number in the range [1-2], so that its effect is to spread out the load over time, but always more conservative than plain backoff.
  • T is the initial timeout, and should be set at the outer limits of expected response time for the service. For example, if your service responds in 1ms on average but in 10ms for 99% of requests, then set t=10ms.
  • F doesn't matter much, so choose 2 as a nice round number. (It's the exponential nature that counts.)
  • M should be as low as possible to keep your customers happy, but high enough that the system can definitely handle requests from all clients at that sustained rate.

Sunday, February 8, 2009

Fail Fast, Fail Often

A common misconception among programmers is that software should always attempt to hide failures in distributed systems. This idea seems sensible at first, because distributed systems are full of failures of all kinds: machines crash, software fails, and networks go down, just to name a few. If I am writing a function called transfer_file() which copies a file from one place to another, then I should try to connect multiple times and restart failed transfers for about an hour before giving up, right?

It turns out that transparent fault tolerance is exactly the wrong approach in a large, layered software system. Instead, each layer of software should carefully define consistent failure conditions, and then feel free to fail as often as it wants to.

Here is why: If someone else builds an application that calls transfer_file(), the application itself knows a whole lot more about what kind of fault tolerance is needed. It may turn out that the application knows about several file servers, and if one cannot be reached immediately, then another will do just fine. On the other hand, perhaps transfer_file will be used in some batch workload that will run for weeks, so it is vital that the transfer be retried until success.

If you want to build a controllable system, then your building blocks must have very precise failure semantics.. Unfortunately, many system calls have such vague semantics that they are nearly impossible to use correctly in the presence of failures. Consider, for example, the Unix system call connect(), which initiates a TCP connection to a remote host. Here are some possible results from connect():
  1. If the host does not respond to IP traffic, connect() will block for an undetermined amount of time configured by the kernel (anywhere from minutes to hours), and then return ETIMEDOUT.
  2. If a router or switch determines that the host is not routable, then in a few seconds connect() will return with the error EHOSTUNREACH.
  3. If the host is up, but there is no process listening on the port, then connect() will return almost immediately with ECONNREFUSED.

Depending on the precise nature of the failure, the call might return immediately, or it might return after a few hours. And, the distinction between these failure modes hardly matters to the user: in each case, the requested service is simply not available. Imagine trying to build an application that will quickly connect to the first available server, out of three. Yuck.

To get around this, all our software uses an intermediate layer that does a fair amount of work to place consistent failure semantics on system calls. For example, instead of using BSD sockets directly, we have a layer called link with operations like this:

  • link_connect( address, port, timeout );
  • link_accept( link, timeout );
  • link_read( link, buffer, length, timeout );

Inside each of these operations, the library carefully implements the desired failure semantics. If an operation fails quickly, then it is retried (with an exponential backoff) until the timeout as expired. If an operation fails slowly, then it is cleanly aborted when the timeout expires. With these in place, we can build higher level operations that rely on network communication without getting unexpectedly stuck.

Here is an example where precise failure detection really matters. In an earlier post, I wrote about the Wavefront abstraction, which is a distributed computing model with a lot of dependencies. In a Wavefront problem, we must first execute one process in the lower left hand corner. Once that is complete, we can run two adjacent functions, then three, and so on:

If we run a Wavefront on a system of hundreds of processors, then delays are inevitable. What's more, a delay in the computation of any one result slows down the whole system. To avoid this problem, we keep running statistics on the expected computation time of any node, and set timeouts appropriately. If any one computation falls more than a few standard deviations beyond the average, we abort it and try it on another processor. We call this technique "Fast Abort".

Here is the effect of this technique on a large wavefront problem. The X axis shows time, and the Y axis shows the number of tasks currently running. The bottom line shows the reliable technique of waiting and retrying tasks until they succeed. The top line shows what happens with Fast Abort. As you can see, this technique much more rapidly reaches a high degree of parallelism.

The moral of the story is: Make failure conditions an explicit part of your interface. If you make it very clear how and when a call can fail, then it is very easy for applications to implement fault tolerance appropriate to the situation at hand.