In this blog, I'll discuss open problems and new developments in the field of distributed systems.
A distributed system is a set of independent computers that accomplish a meaningful task by working together. The field covers a huge range of interesting systems, from sensor networks to multi-tier web server farms. For the most part, I will be writing about the sort of large computing systems used to attack very large problems generated by big science and big business.
These systems go by many different names that mean very similar things:
- A cluster is a distributed system that consists of a number of identical machines owned by a single entity, usually stacked up in a closet or a machine room. Clusters became the most common form of high performance computing in the 1990s and are the type of system now dominating the Top 500 List of supercomputers.
- A grid is a distributed system that enables people to access computing resources from different institutions over the wide area. The term grid computing was coined by Ian Foster and Carl Kesselman in the late 1990s to describe easy access to large scale computational power. Examples of grids include TeraGrid and the Open Science Grid.
- A cloud is a distributed system where the user doesn't care exactly what resources are used to carry out an operation; this is virtualization in the most abstract sense. There exist commercial clouds such as Amazon EC2, as well proprietary clouds found in nearly any industrial scale web site.
Although people love to argue about these terms, the differences aren't terribly important. In fact, the terms often represent different facets of the same systems. When you submit a job to a cluster, you are treating it like a cloud, because you don't care on exactly which CPU the job runs. A grid is usually built up from multiple clusters and connected together by wide area networks and software. A cloud is usually contained in a single data center. If you need to access it remotely, then you need a grid.
If you want to really want to mix it up, read about Ed Walker's MyCluster, which is a system that uses a grid to allocate a personal cluster, which you can then use as a cloud!
Regardless of what we call these systems, the challenges are the same. How do we design the interactions between pieces so that the system is robust to failures, has acceptable performance, and protects the interests of all of the parties involved? Future posts in this column will focus on these fundamental problems.
So that you know where I am coming from, I am a professor at the University of Notre Dame, where I teach classes in distributed systems, operating systems, and compilers. I direct the Cooperative Computing Lab where a great group of students does the hard work to design and test new distributed systems. We develop solutions to problems in physics, chemistry, biometrics, and other fields, and them evaluate them on our shared testbed of 700 CPUs. I'll be writing more about these ideas in this column.