Cooperative Computing Lab News: May 2016

Thursday, May 26, 2016

Work Queue from Raspberry Pi to Azure at SPU

"At Seattle Pacific University we have used Work Queue in the CSC/CPE 4760 Advanced Computer Architecture course in Spring 2014 and Spring 2016. Work Queue serves as our primary example of a distributed system in our “Distributed and Cloud Computing” unit for the course. Work Queue was chosen because it is easy to deploy, and undergraduate students can quickly get started working on projects that harness the power of distributed resources."

The main project in this unit had the students obtain benchmark results for three systems: a high performance workstation; a cluster of 12 Raspberry Pi 2 boards, and a cluster of A1 instances in Microsoft Azure. The task for each benchmark used Dr. Peter Bui’s Work Queue MapReduce framework; the students tested both a Word Count and Inverted Index on the Linux kernel source. In testing the three systems the students were exposed to the principles of distributed computing and the MapReduce model as they investigated tradeoffs in price, performance, and overhead.
- Prof. Aaron Dingler, Seattle Pacific University.

Tuesday, May 24, 2016

Lifemapper analyzes biodiversity using Makeflow and Work Queue

Lifemapper is a high-throughput, webservice-based, single- and multi-species modeling and analysis system designed at the Biodiversity Institute and Natural History Museum, University of Kansas. Lifemapper was created to compute and web publish, species distribution models using available online species occurrence data. Using the Lifemapper platform, known species localities georeferenced from museum specimens are combined with climate models to predict a species’ “niche” or potential habitat availability, under current day and future climate change scenarios. By assembling large numbers of known or predicted species distributions, along with phylogenetic and biogeographic data, Lifemapper can analyze biodiversity, species communities, and evolutionary influences at the landscape level.

Lifemapper has had difficulty scaling recently as our projects and analyses are growing exponentially. For a large proof-of-concept project we deployed on the XSEDE resource Stampede at TACC, we integrated Makeflow and Work Queue into the job workflow. Makeflow simplified job dependency management and reduced job-scheduling overhead, while Work Queue scaled our computation capacity from hundreds of simultaneous CPU cores to thousands. This allowed us to perform a sweep of computations with various parameters and high-resolution inputs producing a plethora of outputs to be analyzed and compared. The experiment worked so well that we are now integrating Makeflow and Work Queue into our core infrastructure. Lifemapper benefits not only from the increased speed and efficiency of computations, but the reduced complexity of the data management code, allowing developers to focus on new analyses and leaving the logistics of job dependencies and resource allocation to these tools.

Information from C.J. Grady, Biodiversity Institute and Natural History Museum, University of Kansas.

Condor Week 2016 presentation

We presented in Condor Week 2016 our approach to create a comprehensive resource feedback loop to execute tasks of unknown size. In this feedback look, tasks are monitored and measured in user-space as they run; the resource usage is collected into an online archive; and further instances are provisioned according to the application's historical data to avoid resource starvation and minimize resource waste. We present physics and bioinformatics case studies consisting of more than 600,000 tasks running on 26,000 cores (96% of them from opportunistic resources), where the proposed solution leads to an overall increase in throughput (from 10% to 400% across different workflows), and a decrease in resource waste compared to workflow executions without the resource feedback-loop.

condor week 2016 slides

Monday, May 23, 2016

Containers, Workflows, and Reproducibility

The DASPOS project hosted a workshop on Container Strategies for Data and Software Preservation that Promote Open Science at Notre Dame on May 19-20, 2016. We had a very interesting collection of researchers and practitioners, all working on problems related to reproducibility, but presenting different approaches and technologies.

Prof. Thain presented recent work by CCL grad students Haiyan Meng and Peter Ivie on Combining Containers and Workflow Systems for Reproducible Execution.

The Umbrella tool created by Haiyan Meng allows for a simple, compact, archival representation of a computation, taking into account hardware, operating system, software, and data dependencies. This allows one to accurately perform computational experiments and give each one a DOI that can be shared, downloaded, and executed.

The PRUNE tool created by Peter Ivie allows one to construct dynamic workflows of connected tasks, each one precisely specified by execution environment. Provenance and data are tracked precisely, so that the specification of a workflow (and its results) can be exported, imported, and shared with other people. Think of it like git for workflows.

Monday, May 9, 2016

Balancing Push and Pull in Confuga, an Active Storage Cluster File System for Scientific Workflows

Patrick Donnelly has published a journal article in Concurrency and Computation: Practice and Experience on the Confuga active cluster file system. The journal article presents the use of controlled transfers to distribute data within the cluster without destabilizing the resources of storage nodes:

Patrick Donnelly and Douglas Thain, Balancing Push and Pull in Confuga, an Active Storage Cluster File System for Scientific Workflows, Concurrency and Computation: Practice and Experience, May 2016.

Confuga is a new active storage cluster file system designed for executing regular POSIX workflows. Users may store extremely large datasets on Confuga in a regular file system layout, with whole files replicated across the cluster. You may then operate on your dataset using regular POSIX applications, with defined inputs and outputs.

Jobs execute with full data locality with all whole-file dependencies available in its own private sandbox. For this reason, Confuga will first copy a job's missing data to the target storage node prior to dispatching the job. This journal article examines two transfer mechanisms used in Confuga to manage this data movement: push and pull.

A push transfer is used to direct a storage node to copy a file to another storage node. Pushes are centrally managed by the head node which allows it to schedule transfers in a way that avoids destabilizing the cluster or individual storage nodes. To avoid some potential inefficiencies with centrally managed transfers, Confuga also uses pull transfers which resemble file access in a typical distributed file system. A job will pull its missing data dependencies from other storage nodes prior to execution.

This journal article examines the trade-offs of the two approaches and settles on a balanced approach where pulls are used for transfers of smaller files and pushes are used for larger files. This results in significant performance improvements for scientific workflows with large data dependencies. For example, two bioinformatics workflows we studied, a Burrows-Wheeler Aligner (BWA) workflow and an Iterative Alignments of Long Reads (IALR) workflow, achieved 48% and 77% reductions in execution time compared to using either an only push or only pull strategy.

For further details, please check out our journal article here. Confuga is available as part of the Cooperative Computing Toolset distributed under the GNU General Public License. For usage instructions, see the Confuga manual and man page.

See also our last blog post on Confuga which introduced Confuga.

Tuesday, May 3, 2016

Interships at Red Hat and CERN

Two CCL graduate students will be off on internships in summer 2016:

Haiyan Meng will be interning at Red Hat, working on container technologies.

Tim Shaffer will be a summer visitor at CERN, working with the CVMFS group on distributed filesystems for HPC and high energy physics.