Tuesday, November 16, 2010

The Virtualization Theorem Ignored for Three Decades

Today, in my graduate operating systems class, we discussed what I believe is the most important result in computer science ever to be persistently ignored:

Popek and Goldberg, Formal Requirements for Virtualizible Third Generation Architectures, Communications of the ACM, Volume 17, Issue 7, July 1974.

This paper puts forth a very simple principle that must be observed in order for a CPU to be capable of running in a virtual machine. First, two definitions:
  • A sensitive instruction reads or modifies supervisor state
  • A privileged instruction traps if attempted in user mode.
And this central theorem:
  • All sensitive operations must be privileged.
Here is why this is important. A conventional operating system (OS) is in charge of the whole machine, and is free to modify the processor status, page tables, I/O devices, and other sensitive aspects of the machine in order to run normal processes.

But, if you take that OS and put it in a virtual machine (VM), it is no longer in charge of the whole machine. All of those actions on sensitive state must be translated in some way by the virtual machine monitor. The simplest way to accomplish that translation is to run the OS in user mode, allowing the VMM to execute sensitive operations on its behalf. To make sure that the VMM gets all of the sensitive operations, they must all be forced to trap.

This principle was articulated very clearly in 1974, when virtualization was already a widely applied technique in the world of mainframe computing. Unfortunately, the principle didn't make the leap into the microcomputer world. In fact, there was an enduring tradition of releasing processors that were not virtualizable, only to realize the mistake and issue a second version with a minor fix.

For example, the venerable Motorola 68000 was first released in 1978, and was heralded as a "mainframe on a chip". Except, it had one little problem: a read from the sensitive status register did not trap, preventing the construction of a virtual memory system. So, Motorola issued the 68010, which was almost identical, except that a read from the status register forced a trap, enabling correct virtualization of memory.

Unfortunately, not everybody got the memo.

For nearly three decades, the Intel x86 series of processors did not have this property. In user mode, many instructions could be used to view sensitive state, and many attempts to write sensitive state would fail silently without a trap. From the 1970s until the late 1990s, efficient virtualization was basically impossible on the most widely used processor family.

Around the year 2000, virtualization became of interest as a way to service the multi-tenancy needs of large internet services. A number of solutions were developed simultaneously to work around the limitations of the Intel chips. One approach used in VMWare was to dynamically compile assembly code at runtime to convert sensitive instructions into deliberate traps to the VMM. Another approach used in the Xen hypervisor was to modify the operating system code so that it explicitly called the VMM instead of invoking sensitive instructions.

There are many other approaches to working around the limitation. Suffice to say that they are all rather complicated, but they can be made to work.

Finally, in 2005, both Intel and AMD introduced virtualization extensions to their processors, enabling basic trap-and-execute virtualization, only 29 years after the Popek and Goldberg theorem was widely circulated.

So, what's the moral of the story?

Monday, November 8, 2010

Sometimes It All Comes Together

Most days, software engineering involves compromises and imperfect solutions. It's rare for two pieces of software to mesh perfectly -- you always have to work to overcome the limitations or assumptions present in different modules. But, every once in a while, the pieces just come together in a satisfying way.

A few weeks back, we ran into a problem with BXGrid, our system for managing biometric data. Our users had just ingested a whole pile of new images and videos, and were browsing and validating the data. Because data was recently ingested, no thumbnails had been generated yet, so every screenful required a hundred or so thumbnails to be created from the high resolution images. Multiple that by each active user, and you have a web site stuck in the mud.

A better solution would be to generate all of the missing thumbnails offline in an orderly way. Since many of the transcoding operations are compute intensive, it makes sense to farm them out to a distributed system.

Peter Bui -- a graduate student in our group -- solved this problem elegantly by putting together almost all of our research software simultaneously. He used Weaver as the front-end language to query BXGrid and determine what thumbnails needed to be generated. Weaver generated a Makeflow to perform all of the transcodings. Makeflow used Work Queue to execute the tasks, with the Workers submitted to our campus Condor pool.

So far, so good. But, the missing piece was that Makeflow expects data to be available as ordinary files. At the time, this would have required that we copy several terabytes out of the archive onto the local disk, which wasn't practical. So, Peter wrote a module for Parrot which enabled access to BXGrid files under paths like /bxgrid/fileid/123. Then, while attached to Parrot, Makeflow could access the files, being none the wiser that they were actually located in a distributed system.

Put it all together, and you have this:

Sometimes, it all comes together.

Monday, November 1, 2010

Compiling Workflows with Weaver

Over the last year, our Makeflow system has become quite popular here at Notre Dame. Briefly, Makeflow takes a workload expressed in the plain old Make format, and executes it in a distributed system, using the dependency information to set up the appropriate remote execution environment. It does not require a distributed filesystem, so it's easy to get your applications going on hundreds to thousands of processors from the cloud or the grid. Makeflow is currently the back-end engine for our science portals in bioinformatics, biometrics, and molecular dynamics.

It didn't take long before our users started writing scripts in Perl or Python in order to generate Makeflows with tens of thousands of nodes. Those scripts all did similar things (query a database, break a dataset into N pieces) but also started to get unruly and difficult to debug. It wasn't easy to look at a script generator and determine what it was trying to accomplish.

Enter Weaver, which is the creation of Peter Bui, one of our graduate students. Weaver is a high level Python framework that, in a few simple lines, can generate enormous (optimized) Makeflows. Peter presented a paper about Weaver at the workshop on Challenges of Large Applications in Distributed Environments at HPDC earlier this year.

Consider this common biometrics workload: extract all of the blue irises from our repository, then convert each iris into a 'template' data type, then compare all of them to each other. Here is how you do it in Weaver:

b = SQLDataSet (’db’,’biometrics','irises')
nefs = Query(db,db.color='Blue')

conv = SimpleFunction('convertiristotemplate',outsuffix='bit')
bits = Map(conv,nefs)

cmp = SimpleFunction('compareiristemplates')
AllPairs(cmp,bits,bits,output='matrix.txt')

In the simplest case, Weaver just emits one gigantic Makeflow that performs all of the operations. However, sometimes there are patterns that can be executed more efficiently, given some better underlying tool. AllPairs is the perfect example of this optimization -- you can do an AllPairs using Makeflow, but it won't be as efficient as our native implementation. If the various data types line up appropriately, Weaver will simply call the All-Pairs abstraction. If not, it will expand the call into Makeflow in the appropriate way.

In principle, this is a lot like a C compiler: under certain conditions, the addition of two arrays can be accomplished with a vector add instruction, otherwise it must be expanded into a larger number of conventional instructions. So, we think of Weaver as a compiler for workflows: it chooses the best implementation available to execute a complex program, leaving the programmer to worry about the higher level objectives.