Cooperative Computing Lab News: 2012

Friday, November 2, 2012

CCTools 3.6.1 Released!

The Cooperative Computing Lab is pleased to announce the release of version 3.6.1 of the Cooperative Computing Tools, including Parrot, Chirp, Makeflow, WorkQueue, SAND, All-Pairs, and other software.

This is a bug fix release of version 3.6.0. No new features were added.

The software may be downloaded here: http://www.cse.nd.edu/~ccl/software/download

Changes:

[Work Queue] Fixes bugs that resulted in a cancelled task becoming a zombie. [Dinesh Rajan]

[Makeflow] Various corrections to Makeflow manual and html documentation. [Li Yu]

[Makeflow] -I and -O options now correctly output file list to stdout. [Li Yu]

[*] Added missing debug flag for ldflags in configure. [Douglas Thain]

[Work Queue] Now correctly removes directories during cleanup. [Dinesh Rajan]

[Chirp] -b is now documented in the man/-h page. [Patrick Donnelly]

[Sand] Fixed a wrong error message. [Peter Bui, Li Yu]

[Catalog Server] -T option now properly accepts an argument. [Patrick Donnelly]

[*] Fixed a bug where the wrong version of perl was used in configure. [Dinesh Rajan]

Thanks goes to the contributors for this release: Dinesh Rajan, Patrick Donnelly, Peter Bui, Li Yu, and Douglas Thain.

Enjoy!

Thursday, November 1, 2012

Applied Cyber Infrastructure Class at U. Arizona

The Applied Cyber Infrastructure Concepts course at the University of Arizona makes use of the Cooperative Computing Tools to teach principles of large scale distributed computing. The project-based course introduced fundamental concepts and tools for analyzing large datasets on national computing resources and commercial cloud providers. For the final project, students developed an appliance for annotating multiple rice genomes using MAKER software, distributed across Amazon and FutureGrid resources via the Work Queue framework. They also developed a web application to visualize the performance and utilization of each cloud appliance in real-time, overall progress of the rice annotation process, and integrated the final results into a genome browser.

- Nirav Merchant and Eric Lyons, University of Arizona

Thursday, October 4, 2012

NSF Grant: Data and Software Preservation for Open Science

Mike Hildreth, Professor of Physics, Jarek Nabrzyski, Director of the Center for Research Computing and Concurrent Associate Professor of Computer Science and Engineering, and Douglas Thain, Associate Professor of Computer Science and Engineering, are the lead investigators on a project that will explore solutions to the problems of preserving data, analysis software, and how these relate to results obtained from the analysis of large datasets.

Known as Data and Software Preservation for Open Science (DASPOS), it is focused on High Energy Physics data from the Large Hadron Collider (LHC) and the Fermilab Tevatron. The group will also survey and incorporate the preservation needs of other communities, such as Astrophysics and Bioinformatics, where large datasets and the derived results are becoming the core of emerging science in these disciplines

The three-year $1.8M program, funded by the National Science Foundation, will include several international workshops and the design of a prototype data and software-preservation architecture that meets the functionality needed by the scientific disciplines. What is learned from building this prototype will inform the design and construction of the global data and software-preservation infrastructure for the LHC, and potentially for other disciplines.

The multi-disciplinary DASPOS team includes particle physicists, computer scientists, and digital librarians from Notre Dame, the University of Chicago, the University of Illinois Urbana-Champaign, the University of Nebraska at Lincoln, New York University, and the University of Washington, Seattle.

Wednesday, October 3, 2012

Tutorial on Scalable Programming at Notre Dame

Tutorial: Introduction to Scalable Programming with Makeflow and Work Queue

October 24th, 3-5PM, 303 Cushing Hall

Would you like to learn how to write programs that can scale up to
hundreds or thousands of machines?

This tutorial will provide an introduction to writing scalable
programs using Makeflow and Work Queue. These tools are used at Notre
Dame and around the world to attack large problems in fields such as
biology, chemistry, data mining, economics, physics, and more. Using
these tools, you will be able to write programs that can scale up to
hundreds or thousands of machines drawn from clusters, clouds, and
grids.

This tutorial is appropriate for new graduate students, undergraduate
researchers, and research staff involved in computing in any
department on campus. Some familiarity with Unix and the ability to
program in Python, Perl, or C is required.

The class will consist of half lecture and half hands-on instruction
in a computer equipped classroom. The instructors are Dinesh Rajan
and Michael Albrecht, developers of the software who are PhD students
in the CSE department.

For questions about the tutorial, contact Dinesh Rajan, dpandiar AT nd.edu.

Monday, October 1, 2012

Global Access to High Energy Physics Software with Parrot and CVMFS

Scientists searching for the Higgs boson have profited from Parrot's new support for the CernVM Filesystem (CVMFS), a network filesystem tailored to providing world-wide access to software installations. By using Parrot, CVMFS, and additional components integrated by the Any Data, Anytime, Anywhere project, physicists working in the Compact Muon Solenoid experiment have been able to create a uniform computing environment across the Open Science Grid. Instead of maintaining large software installations at each participating institution, Parrot is used to provide access to a single highly-available CVMFS installation of the software from which files are downloaded as needed and aggressively cached for efficiency. A pilot project at the University of Wisconsin has demonstrated the feasibility of this approach by exporting excess compute jobs to run in the Open Science Grid, opportunistically harnessing 370,000 CPU-hours across 15 sites with seamless access to 400 gigabytes of software in the Wisconsin CVMFS repository.

- Dan Bradley, University of Wisconsin and the Open Science Grid

Wednesday, September 19, 2012

CCTools 3.6.0 Released!

The Cooperative Computing Lab is pleased to announce the release ofversion 3.6.0 of the Cooperative Computing Tools, including Parrot, Chirp, Makeflow, WorkQueue, SAND, All-Pairs, and other software.

The software may be downloaded here: http://www.cse.nd.edu/~ccl/software/download

This is a minor release which adds numerous features and fixes several bugs:

[WQ] Added API for logging functionality. [Christopher Bauschka]

[WQ] Python bindings have more complete access to the API available from C. Documentation has also been improved. [Dinesh Rajan]

[WQ] No longer manually redirects stdin/stdout/stderr by editing the user provided shell string, it now sets file descriptors directly. User redirections are no longer overridden. [Patrick Donnelly]

[WQ, Makeflow] The torque batch submission system is now supported. [Michael Albrecht, Douglas Thain]

[Parrot] Now supports extended attributes. [Patrick Donnelly]

[Makeflow] Now supports garbage collection of intermediate files. [Peter Bui]

[Makeflow] Now supports lexical scoping of Makeflow variables. [Peter Bui]

[Makeflow] New MAKEFLOW keyword for recursive Makeflows. [Peter Bui]

[WQ] Bindings for WQ now support SWIG versions >= 1.3.29. [Peter Bui]

[Parrot] iRODS now supports putfile/getfile operations for much faster file copies. [Douglas Thain]

[Parrot] Now includes watchdog support that runs alongside Parrot. [Douglas Thain, Brian Bockelman]

[*] CCTools now have been version information from the -v option. Version information is included in debug output with the `-d debug' flag. [Patrick Donnelly]

[WQ] work_queue_status output has been cosmetically improved. [Douglas Thain]

[WQ] New $WORK_QUEUE_SANDBOX environment variable. [Dinesh Rajan]

Thanks goes to the contributors for many minor features and bug fixes:

Michael Albrecht

Christopher Bauschka

Brian Bockelman

Dan Bradley

Peter Bui

Iheanyi Ekechukwu

Patrick Donnelly

Dinesh Rajan

Douglas Thain

Please send any feedback to the CCTools discussion mailing list. Enjoy!

Tuesday, September 18, 2012

Papers at e-Science Conference

Members of the CCL will present two papers and two posters at the upcoming IEEE Conference on e-Science in Chicago:

Badi Abdul-Wahid, Li Yu, Dinesh Rajan, Haoyun Feng, Eric Darve, Douglas Thain, Jesus A. Izaguirre,
Folding Proteins at 500 ns/hour with Work Queue
Peter Sempolinski, Daniel Wei, Douglas Thain and Ahsan Kareem, A System for Management of Computational Fluid Dynamics Simulations for Civil Engineering
Michael Albrecht, John Bent, and Douglas Thain, An Evaluation of Embedded Storage for Exascale and the Value of Asynchronous Data Transfer
Dinesh Rajan, Badi Abdul-Wahid, Joe Fetsch, Li Yu, Jesus Izaguirre, Sekou Remy and Douglas Thain, A Case Study in Elastic Scientific Application Design using Work Queue

Thursday, September 6, 2012

Lecture and Tutorial: Univ. of Arizona

We are doing a guest lecture and tutorial titled Building Scalable Data Intensive Applications with Makeflow and Work Queue at the University of Arizona as part of the Applied CI Concepts class on September 11 and 13, 2012.

Wednesday, August 1, 2012

Rapid Processing of LIDAR Data in the Field with Makeflow

Makeflow is used to manage the data processing workflow of the Airborne Lidar Processing System (ALPS) for the Experimental Advanced Airborne Research Lidar (EAARL) at the USGS. Over the course of several hours of flight time, about a gigabyte of raw LIDAR waveform data is collected. This data must be geo-referenced in order to convert it into point clouds broken into 2km tiles suitable for GIS use. When we collect this data in the field, it is critical that the field crew can process the data rapidly and take a look at it for obvious problems so they can be corrected before the next flight.

Using Makeflow, the data can be processed quickly on a portable 32-core cluster in the field in about 20 minutes. The data can be processed fast enough to do some cursory analysis and also re-process it a few times if needed to troubleshoot issues. Using Makeflow, it is easy to run the exactly same workflow in the field on the portable cluster or back in the office on a multi-core system.

- David Nagle, US Geological Survey

Tuesday, July 31, 2012

Tutorial at Cloud Summer School

We will be offering a tutorial titled Building Scalable Data Intensive Applications on the Cloud with Makeflow and Work Queue as part of the Science Cloud Summer School hosted by Indiana University and joined by universities around the country.

Friday, July 27, 2012

Talk at ICE Workshop

Prof. Thain gave a talk titled Computational Abstractions: Strategies for Scaling Up Applications at the Initiative for Computational Economics at the University of Chicago.

Tuesday, July 24, 2012

CCTools 3.5.2 Released

The Cooperative Computing Lab is pleased to announce the release of version 3.5.2 of the Cooperative Computing Tools, including Parrot, Chirp, Makeflow, WorkQueue, SAND, All-Pairs, and other software.

This is a bug fix release of version 3.5.1. A shell script executable has been added for Torque worker compatibility.

The software may be downloaded here.

Changes:

[WQ] Improved some debug messages. [Dinesh Rajan]

[AP] Fixed minor bug for dealing with comparison commands that produce no output. [Douglas Thain]

[WQ] Fixed a bug where the stdout buffer was not reset at the beginning of every task. [Douglas Thain]

[WQ] Documented -C option for work_queue_status. [Patrick Donnelly]

[WQ] Fixed a bug where pool configurations with an absolute path would result in a segfault. [Patrick Donnelly]

[Parrot] Fixed a bug where Parrot mistakenly thought it correctly wrote to memory using /proc/pid/mem. [Patrick Donnelly]

[*] Fixed a bug on OSX where non-blocking connects would result in an infinite loop. [Douglas Thain]

[*] Support for SWIG 1.3.29 added. [Peter Bui]

[*] Support has been added workers using Torque. [Michael Albrecht, Douglas Thain]

[Makeflow] Fixed option parsing. [Patrick Donnelly]

Thursday, June 28, 2012

CCTools 3.5.1 Released

The Cooperative Computing Lab is pleased to announce the release of version 3.5.1 of the Cooperative Computing Tools, including Parrot, Chirp, Makeflow, WorkQueue, SAND, All-Pairs, and other software.

This is a bug fix release of version 3.5.0. No new features were added.

The software may be downloaded here: http://www.cse.nd.edu/~ccl/software/download

Changes:

Fixed a file descriptor leak in WorkQueue. [Joe Fetsch, Dinesh Rajan, Patrick Donnelly]

Better detection of fast memory access in Linux kernels >= 3.0. [Michael Hanke, Douglas Thain]

Monday, February 6, 2012

Some Open Computer Science Problems in Workflow Systems

In the previous article, I extolled the virtues of Makeflow, which has been very effective at engaging new users and allowing them to express their workflows in a way that facilitates parallel and distributed computing. We can very consistently get new users going from one laptop to 100 cores in a distributed system very easily.

However, as we develop experience in scaling up workflows to thousands of cores across wide area distributed systems, a number of interesting computer science challenges have emerged. These problems are not specific to Makeflow, but can be found in most workflow systems:

Capacity Management
Just because a workflow expresses thousand-way concurrency doesn't mean that it is actually a good idea to run it on one thousand nodes! The cost of moving data to and from the execution nodes may outweigh the benefit of the added computational power. If one uses fewer nodes than the available parallelism, then it may be possible to pay the data movement cost once, and then exploit it multiple times. For most workflows, there is a "sweet spot" at which performance is significantly maximized. Of course, users don't want to discover this by experiment, they need some tool to recommend an appropriate size for the given workflow.

Software Engineering Tools
A workflow is just another kind of program: it has source code that must be managed, dependencies that must be gathered, and a history of modification to be tracked. In this sense, we are badly in need of tools for manipulating workflows in coherent ways. For example, we need a linker that can take a workflow, find all the dependent components, and gather them together in one package. We need a loader that can take an existing workflow, load it into a computing system, and then update file names and other links to accomodate it. We need a profiler that can report on the time spent across multiple runs of a workflow, so as to determine where problem spots may be.

Portability and Reproducibility
Makeflow itself enables portability across execution systems. For example, you can run your application on Condor or SGE without modification. However, that doesn't mean that your applications are actually portable. If one cluster runs Blue Beanie Linux 36.5 and another runs Green Sock Linux 82.7, your chances of the same executable running on both are close to zero. Likewise, if you run a workflow one day, then set it aside for a year, it's possible that your existing machine has been updated to the point where the old workflow no longer runs.

However, if we also explicitly state the execution environment in the workflow, then this can be used to provide applications with what they need to run. The environment might be as simple as a directory structure with the applications, or as complex as an entire virtual machine. Either way, the environment becomes data that must be managed and moved along with the workflow, which affects the performance and cost issues discussed above.

Composability
Everything in computing must be composable. That is, once you get one component working, the very next step is to hook it up to another so that it runs as a subcomponent. While we can technically hook up one Makeflow to another, this doesn't currently happen in a way that results in a coherent program. For example, the execution method and resource limits don't propagate from one makeflow to another. To truly enable large scale structures, we need a native method of connecting workflows together that connects not only the control flow, but the resource allocation, capacity management, and everything else discussed above.

Effortless Scalability

As a rule of thumb, I tell brand new users that running a Makeflow on 10 cores simultaneously is trivial, running on 100 cores is usually easy, and getting to 1000 cores will require some planning and debugging. Going over 1000 cores is possible (our largest system is running on 5000 cores) but requires a real investment of time by the user.

Why does scale make things harder? One reason is that computer systems are full of artificial limits that are not widely know or managed effectively. On a Unix-like system, a given process has a limited number of file descriptors per process and a limited number of files per directory. (Most people don't figure this out until they hit the limit, and then the work must be restructured to accomodate.) A complex network with translation devices may have a limited number of simultaneously network connections. A data structure that was small to ignore suddenly becomes unmanageable when there are 10,000 entries.

To have a software system that can scale to enormous size, you need to address these known technical issues, but also have methods of accomodating limits that you didn't expect. You also need an architecture that can scale naturally and observe its own limits to understand when they are reached. An ideal implementation would know its own limits and not require additional experts in order to scale up.

---

Each of these problems, though briefly described, are pretty hefty problems once you start digging into them. Some of them are large enough to earn a PhD. (In fact, some are already in progress!) They all have the common theme of making data intensive workflows manageable, useable, portable, and productive across a wide variety of computing systems.

More to follow.

Wednesday, February 1, 2012

Why Makeflow Works for New Users

In past articles, I have introduced Makeflow, which is a large scale workflow engine that we have created at Notre Dame.

Of course, Makeflow is certainly not the first or only workflow engine out there. But, Makeflow does have several unique properties that make it an interesting platform for bringing new people into the world of distributed computing. And, it is the right level of abstraction that allows us to address some fundamental computer science problems that result.

Briefly, Makeflow is a tool that lets the user express a large number of tasks by writing them down as a conventional makefile. You can see an example on our web page. A Makeflow can be just a few rules long, or it can consist of hundreds to thousands of tasks, like this EST pipeline workflow:

Once the workflow is written down, you can then run Makeflow in several different ways. You can run it entirely on your workstation, using multiple cores. You can ask Makeflow to send the jobs to your local Condor pool, PBS or SGE cluster, or other batch system. Or, you can start the (included) Work Queue system on a few machines that you happen to have handy, and Makeflow will run the jobs there.

Over the last few years, we have had very good experience getting new users to adopt Makeflow, ranging from highly sophisticated computational scientists all the way to college sophomores learning the first principles of distributed computing. There are a couple of reasons why this is so:

A simple and familiar language. Makefiles are already a well known and widely used way of expressing dependency and concurrency, so it is easy to explain. Unlike more elaborate languages, it is brief and easy to read and write by hand. A text-based language can be versioned and tracked by any existing source control method.
A neutral interface and a portable implementation. Nothing in a Makeflow references any particular batch system or distributed computing technology, so existing workflows can be easily moved between computing systems. If you I use Condor and you use SGE, there is nothing to prevent my workflow from running on your system.
The data needs are explicit. A subtle but significant difference between Make and Makeflow is that Makeflow treats your statement of file dependencies very seriously. That is, you must state exactly which files (or directories) that your computation depends upon. This is slightly inconvenient at first, but vastly improves the ability of Makeflow to create the right execution environment, verify a correct execution, and manage storage and network resources appropriately.
An easy on-ramp to large resources. We have gone to great lengths to make it absolutely trivial to run Makeflow on a single machine with no distributed infrastructure. Using the same framework, you can move to harnessing a few machines in your lab (with Work Queue) and then progressively scale up to enormous scale using clusters, clouds, and grids. We have users running on 5 cores, 5000 cores, and everything in between.

Of course, our objective is not simply to build software. Makeflow is a starting point for engaging our research collaborators, which allows us to explore some hard computer science problems related to workflows. In the next article, I will discuss some of those challenges.