Cooperative Computing Lab News: 2016

Tuesday, October 25, 2016

Reproducibility Papers at eScience 2016

CCL students presented two papers at the IEEE 12th International Conference on eScience on the theme of reproducibility in computational science:

Haiyan Meng presented Conducting Reproducible Research with Umbrella: Tracking, Creating, and Preserving Execution Environments, which describes how the Umbrella framework is used to create a precise specification of computational environment, software, and data for reproducible execution for epidemiology and high energy physics codes.
Peter Ivie presented PRUNE: A Preserving Run Environment for Reproducible Computing which describes PRUNE, a workflow system that tracks both data and executions in a way that can be compactly named and shared. This allows one to uniquely identify an execution in a way that others can track the complete provenance, or re-execute it if desired.

Congrads to Haiyan for winning a Best of Conference award for her paper:

CCL Workshop 2016

The 2016 CCL Workshop on Scalable Scientific Computing was held on October 19-20 at the University of Notre Dame. We offered tutorials on Makeflow, Work Queue, and Parrot. and gave highlights of the many new capabilities relating to reproducibility and container technologies. Our user community gave presentations describing how these technologies are used to accelerate discovery in genomics, high energy physics, molecular dynamics, and more.
Everyone got together to share a meal, solve problems, and generate new ideas. Thanks to everyone who participated, and see you next year!

Tuesday, September 20, 2016

NSF Grant to Support CCTools Development

We are pleased to announce that our work will continue to be supported by the National Science Foundation through the division of Advanced Cyber Infrastructure.

The project is titled "SI2-SSE: Scaling up Science on Cyberinfrastructure with the Cooperative Computing Tools" It will advance the development of the Cooperative Computing Tool s to meet the changing technology landscape in three key respects: exploiting container technologies, making efficient use of local concurrency, and performing capacity management at the workflow scale. We will continue to focus on active user communities in high energy physics, which rely on Parrot for global scale filesystem access in campus clusters and the Open Science Grid; bioinformatics users executing complex workflows via the VectorBase, LifeMapper, and CyVerse disciplinary portals, and ensemble molecular dynamics applications that harness GPUs from XSEDE and commercial clouds.

Thursday, September 15, 2016

Announcement: CCTools 6.0.0. released

The Cooperative Computing Lab is pleased to announce the release of version 6.0.0 of the Cooperative Computing Tools including Parrot, Chirp, Makeflow, WorkQueue, Umbrella, Prune, SAND, All-Pairs, Weaver, and other software.

The software may be downloaded here:
http://ccl.cse.nd.edu/software/download

This is a major which adds several features and bug fixes. Among them:

[Catalog] Automatic fallback to a backup catalog server. (Tim Shaffer)
[Makeflow] Accept DAGs in JSON format. (Tim Shaffer)
[Makeflow] Multiple documentation omission bugs. (Nick Hazekamp and Haiyan Meng)
[Makeflow] Send information to catalog server. (Kyle Sweeney)
[Makeflow] Syntax directives (e.g. .SIZE for to indicate file size). (Nick Hazekamp)
[Parrot] Fix cvmfs logging redirection. (Jakob Blomer)
[Parrot] Multiple bug-fixes. (Tim Shaffer, Patrick Donnelly, Douglas Thain)
[Parrot] Timewarp mode for reproducible runs. (Douglas Thain)
[Parrot] Use new libcvmfs interfaces if available. (Jakob Blomer)
[Prune] Use SQLite as backend. (Peter Ivie)
[Resource Monitor] Record the time where a resource peak occurs. (Ben Tovar)
[Resource Monitor] Report the peak number of cores used. (Ben Tovar)
[Work Queue] Add a transactions log. (Ben Tovar)
[Work Queue] Automatic resource labeling and monitoring. (Ben Tovar)
[Work Queue] Better capacity worker autoregulation. (Ben Tovar)
[Work Queue] Creation of disk allocation per tasks. (Nate Herman-Kremer)
[Work Queue] Extensive updates to wq_maker. (Nick Hazekamp)
[Work Queue] Improvements in computing master's task capacity. (Nate Herman-Kremer).
[Work Queue] Raspberry Pi compilation fixes. (Peter Bui)
[Work Queue] Throttle work_queue_factory with --workers-per-cycle. (Ben Tovar)
[Work Queue] Unlabeled tasks are assumed to consume 1 core, 512 MB RAM and 512 MB disk. (Ben Tovar)
[Work Queue] Worker disconnects when node does not longer have the resources promised. (Ben Tovar)
[Work Queue] work queue statistics clean up (see work_queue.h for deprecated names). (Ben Tovar)
[Work Queue] work_queue_status respects terminal column settings. (Mathias Wolf)

We will have tutorials on the new features in our upcoming workshop, October 19 and 20. Refer to http://ccl.cse.nd.edu/workshop/2016 for more information. We hope you can join us!

Thanks goes to the contributors for many features, bug fixes, and tests:

Jakob Blomer
Peter Bui
Patrick Donnelly
Nathaniel Kremer-Herman
Kenyi Hurtado-Anampa
Peter Ivie
Kevin Lannon
Haiyan Meng
Tim Shaffer
Douglas Thain
Ben Tovar
Kyle Sweeney
Mathias Wolf
Anna Woodard
Chao Zheng

Please send any feedback to the CCTools discussion mailing list:

http://ccl.cse.nd.edu/community/forum

Enjoy!

Friday, August 19, 2016

Summer REU Projects in Data Intensive Scientific Computing

We recently wrapped up the first edition of the summer REU in Data Intensive Scientific Computing at the University of Notre Dame. Ten undergraduate students came to ND from around the country and worked on projects encompassing physics, astronomy, bioinformatics, network sciences, molecular dynamics, and data visualization with faculty at Notre Dame.

To learn more, see these videos and posters produced by the students:

Wednesday, August 10, 2016

Simulation of HP24stab with AWE and Work Queue

The villin headpiece subdomain "HP24stab" is a recently discovered 24-residue stable supersecondary structure that consists of two helices joined by a turn. Simulating 1μs of motion for HP24stab can take days or weeks depending on the available hardware, and folding events take place on a scale of hundreds of nanoseconds to microseconds. Using the Accelerated Weighted Ensemble (AWE), a total of 19us of trajectory data were simulated over the course of two months using the OpenMM simulation package. These trajectories were then clustered and sampled to create an AWE system of 1000 states and 10 models per state. A Work Queue master dispatched 10,000 simulations to a peak of 1000 connected 4-core workers, for a total of 250ns of concurrent simulation time and 2.5μs per AWE iteration. As of August 8, 2016, the system has run continuously for 18 days and completed 71 iterations, for a total of 177.5μs of simulation time. The data gathered from these simulations will be used to report the mean first passage time, or average time to fold, for HP24stab, as well as the major folding pathways. - Jeff Kinnison and Jesús Izaguirre, University of Notre Dame

ND Leads DOE Grant on Virtual Clusters for Scientific Computing

Prof. Douglas Thain is leading a new $2.2M DOE-funded project titled "VC3: Virtual Clusters for Community Computation" in an effort to make our national supercomputing facilities more effective for collaborative scientific computing. The project team brings together researchers from the University of Notre Dame, the University of Chicago, and Brookhaven National Lab.

Our current NSF and DOE supercomputers are very powerful, but they each have different operating systems and software configurations, which makes it difficult and time consuming for new users to deploy their codes and share results. The new service will create virtual clusters on the existing machines that have the custom software and other services needed to easily run advanced scientific codes from fields such as high energy physics, bioinformatics, and astrophysics. If successful, users of this service will be able to easily move applications between university and national supercomputing facilities.

Friday, July 29, 2016

2016 DISC Summer Session Wraps Up

Congratulations to our first class of summer students participating in the Data Intensive Scientific Computing research experience! Twelve students from around the country came to Notre Dame to learn how computing drives research in high energy physics, climatology, bioinformatics, astrophysics, and molecular dynamics.

At our closing poster session in Jordan hall (along with several other REU programs) students presented their work and results to faculty and guests across campus.

If you are excited to work at the intersection of scientific research and advanced computing, we invite you to apply to the 2017 DISC summer program at Notre Dame!

Wednesday, June 22, 2016

New Work Queue Visualization

Nate Kremer-Herman has created a new, convenient way to lookup information of Work Queue masters. This new visualization tool provides real-time updates on the status of each Work Queue master that contacts our catalog server. We hope that this new tool will serve to both facilitate our users' understanding of what their Work Queue masters are doing and assist the user in determining when it may be time to take corrective action.

A Comparative View

A Specific Master

In part, this tool provides our users with measurements on their tasks currently running, the number of tasks waiting to be run, and the total capacity of tasks that could be running. As an example, a user could find that they have a large number of tasks waiting, a small number of tasks running, and a task capacity that is somewhere in between. A recommendation we could make to a user who is seeing something like this would be to ask for more workers. Our hope is that users will take advantage of this new way to view and manage their work.

Thursday, May 26, 2016

Work Queue from Raspberry Pi to Azure at SPU

"At Seattle Pacific University we have used Work Queue in the CSC/CPE 4760 Advanced Computer Architecture course in Spring 2014 and Spring 2016. Work Queue serves as our primary example of a distributed system in our “Distributed and Cloud Computing” unit for the course. Work Queue was chosen because it is easy to deploy, and undergraduate students can quickly get started working on projects that harness the power of distributed resources."

The main project in this unit had the students obtain benchmark results for three systems: a high performance workstation; a cluster of 12 Raspberry Pi 2 boards, and a cluster of A1 instances in Microsoft Azure. The task for each benchmark used Dr. Peter Bui’s Work Queue MapReduce framework; the students tested both a Word Count and Inverted Index on the Linux kernel source. In testing the three systems the students were exposed to the principles of distributed computing and the MapReduce model as they investigated tradeoffs in price, performance, and overhead.
- Prof. Aaron Dingler, Seattle Pacific University.

Tuesday, May 24, 2016

Lifemapper analyzes biodiversity using Makeflow and Work Queue

Lifemapper is a high-throughput, webservice-based, single- and multi-species modeling and analysis system designed at the Biodiversity Institute and Natural History Museum, University of Kansas. Lifemapper was created to compute and web publish, species distribution models using available online species occurrence data. Using the Lifemapper platform, known species localities georeferenced from museum specimens are combined with climate models to predict a species’ “niche” or potential habitat availability, under current day and future climate change scenarios. By assembling large numbers of known or predicted species distributions, along with phylogenetic and biogeographic data, Lifemapper can analyze biodiversity, species communities, and evolutionary influences at the landscape level.

Lifemapper has had difficulty scaling recently as our projects and analyses are growing exponentially. For a large proof-of-concept project we deployed on the XSEDE resource Stampede at TACC, we integrated Makeflow and Work Queue into the job workflow. Makeflow simplified job dependency management and reduced job-scheduling overhead, while Work Queue scaled our computation capacity from hundreds of simultaneous CPU cores to thousands. This allowed us to perform a sweep of computations with various parameters and high-resolution inputs producing a plethora of outputs to be analyzed and compared. The experiment worked so well that we are now integrating Makeflow and Work Queue into our core infrastructure. Lifemapper benefits not only from the increased speed and efficiency of computations, but the reduced complexity of the data management code, allowing developers to focus on new analyses and leaving the logistics of job dependencies and resource allocation to these tools.

Information from C.J. Grady, Biodiversity Institute and Natural History Museum, University of Kansas.

Condor Week 2016 presentation

We presented in Condor Week 2016 our approach to create a comprehensive resource feedback loop to execute tasks of unknown size. In this feedback look, tasks are monitored and measured in user-space as they run; the resource usage is collected into an online archive; and further instances are provisioned according to the application's historical data to avoid resource starvation and minimize resource waste. We present physics and bioinformatics case studies consisting of more than 600,000 tasks running on 26,000 cores (96% of them from opportunistic resources), where the proposed solution leads to an overall increase in throughput (from 10% to 400% across different workflows), and a decrease in resource waste compared to workflow executions without the resource feedback-loop.

condor week 2016 slides

Monday, May 23, 2016

Containers, Workflows, and Reproducibility

The DASPOS project hosted a workshop on Container Strategies for Data and Software Preservation that Promote Open Science at Notre Dame on May 19-20, 2016. We had a very interesting collection of researchers and practitioners, all working on problems related to reproducibility, but presenting different approaches and technologies.

Prof. Thain presented recent work by CCL grad students Haiyan Meng and Peter Ivie on Combining Containers and Workflow Systems for Reproducible Execution.

The Umbrella tool created by Haiyan Meng allows for a simple, compact, archival representation of a computation, taking into account hardware, operating system, software, and data dependencies. This allows one to accurately perform computational experiments and give each one a DOI that can be shared, downloaded, and executed.

The PRUNE tool created by Peter Ivie allows one to construct dynamic workflows of connected tasks, each one precisely specified by execution environment. Provenance and data are tracked precisely, so that the specification of a workflow (and its results) can be exported, imported, and shared with other people. Think of it like git for workflows.

Monday, May 9, 2016

Balancing Push and Pull in Confuga, an Active Storage Cluster File System for Scientific Workflows

Patrick Donnelly has published a journal article in Concurrency and Computation: Practice and Experience on the Confuga active cluster file system. The journal article presents the use of controlled transfers to distribute data within the cluster without destabilizing the resources of storage nodes:

Patrick Donnelly and Douglas Thain, Balancing Push and Pull in Confuga, an Active Storage Cluster File System for Scientific Workflows, Concurrency and Computation: Practice and Experience, May 2016.

Confuga is a new active storage cluster file system designed for executing regular POSIX workflows. Users may store extremely large datasets on Confuga in a regular file system layout, with whole files replicated across the cluster. You may then operate on your dataset using regular POSIX applications, with defined inputs and outputs.

Jobs execute with full data locality with all whole-file dependencies available in its own private sandbox. For this reason, Confuga will first copy a job's missing data to the target storage node prior to dispatching the job. This journal article examines two transfer mechanisms used in Confuga to manage this data movement: push and pull.

A push transfer is used to direct a storage node to copy a file to another storage node. Pushes are centrally managed by the head node which allows it to schedule transfers in a way that avoids destabilizing the cluster or individual storage nodes. To avoid some potential inefficiencies with centrally managed transfers, Confuga also uses pull transfers which resemble file access in a typical distributed file system. A job will pull its missing data dependencies from other storage nodes prior to execution.

This journal article examines the trade-offs of the two approaches and settles on a balanced approach where pulls are used for transfers of smaller files and pushes are used for larger files. This results in significant performance improvements for scientific workflows with large data dependencies. For example, two bioinformatics workflows we studied, a Burrows-Wheeler Aligner (BWA) workflow and an Iterative Alignments of Long Reads (IALR) workflow, achieved 48% and 77% reductions in execution time compared to using either an only push or only pull strategy.

For further details, please check out our journal article here. Confuga is available as part of the Cooperative Computing Toolset distributed under the GNU General Public License. For usage instructions, see the Confuga manual and man page.

See also our last blog post on Confuga which introduced Confuga.

Tuesday, May 3, 2016

Interships at Red Hat and CERN

Two CCL graduate students will be off on internships in summer 2016:

Haiyan Meng will be interning at Red Hat, working on container technologies.

Tim Shaffer will be a summer visitor at CERN, working with the CVMFS group on distributed filesystems for HPC and high energy physics.

Wednesday, April 6, 2016

Ph.D. Defense: Patrick Donnelly

Patrick Donnelly successfully defended his Ph.D. titled "Data Locality Techniques in an Active Cluster Filesystem for Scientific Workflows".

Congratulations to Dr. Donnelly!

The continued exponential growth of storage capacity has catalyzed the broad acquisition of scientific data which must be processed. While today's large data analysis systems are highly effective at establishing data locality and eliminating inter-dependencies, they are not so easily incorporated into scientific workflows that are often complex and irregular graphs of sequential programs with multiple dependencies. To address the needs of scientific computing, I propose the design of an active storage cluster file system which allows for execution of regular unmodified applications with full data
locality.

This dissertation analyzes the potential benefits of exploiting the structural information already available in scientific workflows -- the explicit dependencies -- to achieve a scalable and stable system. I begin with an outline of the design of the Confuga active storage cluster file system and its applicability to scientific computing. The remainder of the dissertation examines the techniques used to achieve a scalable and stable system. First, file system access by jobs is scoped to explicitly defined dependencies resolved at job dispatch. Second, workflow's structural information is harnessed to direct and control necessary file transfers to enforce cluster stability and maintain performance. Third, control of transfers is selectively relaxed to improve performance by limiting any negative effects of centralized transfer management.

This work benefits users by providing a complete batch execution platform joined with a cluster file system. The user does not need to redesign their workflow or provide additional consideration to the management of data dependencies. System stability and performance is managed by the cluster file
system while providing jobs with complete data locality. - See more at: http://cse.nd.edu/events/phd-defense-patrick-donnelly#sthash.4BLYgJYM.dpuf

DATA LOCALITY TECHNIQUES IN AN ACTIVE CLUSTER FILE SYSTEM DESIGNED FOR
SCIENTIFIC WORKFLOWS - See more at: http://cse.nd.edu/events/phd-defense-patrick-donnelly#sthash.4BLYgJYM.dpuf

Tuesday, March 22, 2016

Searching for Exo-Planets with Makeflow and Work Queue

Students at the University of Arizona made use of Makeflow and Work Queue to build an image processing pipeline on the Chameleon cloud testbed at TACC.

The course project was to build an image processing pipeline to accelerate the research of astronomer Jared Males, who designs instruments to search for exo-planets by observing the changes in appearance of a star. This results in hundreds of thousands of images of a single star, which must then be processed in batch to eliminate noise and align the images.

The students built a solution (Find-R) which consumed over 100K CPU-hours on Chameleon, distributed using Makeflow and Work Queue.

Read more here:
https://www.tacc.utexas.edu/-/in-search-of-a-planet

Friday, March 18, 2016

Parrot talk at OSG All-hands meeting 2016

Ben Tovar gave a talk on using parrot to access CVMFS as part of the Open Science Grid (OSG) all-hands meeting in Clemson, SC.

Software access with parrot and CVMFS

Tuesday, February 16, 2016

CCTools 5.4.0 released

The Cooperative Computing Lab is pleased to announce the release of version 5.4.0 of the Cooperative Computing Tools including Parrot, Chirp, Makeflow, WorkQueue, Umbrella, SAND, All-Pairs, Weaver, and other software.

The software may be downloaded here:

http://www.cse.nd.edu/~ccl/software/download

This minor release adds several features and bug fixes. Among them:

[Catalog] Catalog server communication is now done using JSON encoded queries and replies. (Douglas Thain)
[Makeflow] --skip-file-check added to mitigate overhead on network file systems. (Nick Hazekamp)
[Makeflow] Added amazon batch job interface. (Charles Shinaver)
[Resource Monitor] Network bandwidth, bytes received, and sent are now recorded. (Ben Tovar)
[Work Queue] Tasks may be grouped into categories, for resource control and fast abort. (Ben Tovar)
[Work Queue] work_queue_pool was renamed to work_queue_factory. (Douglas Thain)
[Work Queue] --condor-requirements to specify arbitrary HTCondor requirements in worker_queue_factory. (Chao Zheng)
[Work Queue] --factory-timeout to terminate worker_queue_factory when no master is active. (Neil Butcher)
[Work Queue] Compile-time option to specify default local settings in sge_submit_workers. (Ben Tovar)
[Umbrella] Several bugfixes. (Haiyan Meng, Alexander Vyushkov)
[Umbrella] Added OSF, and S3 communication. (Haiyan Meng)
[Umbrella] Added EC2 execution engine. (Haiyan Meng)
[Parrot] Several bug-fixes for memory mappings. (Patrick Donnelly)
[Parrot] All compiled services are shown under / (Tim Shaffer)
[Parrot] POSIX directory semantics. (Tim Shaffer)
[Parrot] Added new syscalls from Linux kernel 4.3. (Patrick Donnelly)

Thanks goes to the contributors for many features, bug fixes, and tests:

Jakob Blomer
Neil Butcher
Patrick Donnelly
Nathaniel Kremer-Herman
Nicholas Hazekamp
Peter Ivie
Kevin Lannon
Haiyan Meng
Tim Shaffer
Douglas Thain
Ben Tovar
Alexander VyushkovRodney Walker
Mathias Wolf
Anna Woodard
Chao Zheng

Please send any feedback to the CCTools discussion mailing list:

http://ccl.cse.nd.edu/software/help/

Enjoy!

Wednesday, February 3, 2016

Preservation Talk at Grid-5000

Prof. Thain gave a talk titled Preservation and Portability in Distributed Scientific Applications at the Grid-5000 Winter School on distributed computing in Grenoble, France. I gave a broad overview of our recent efforts, including distributing software with Parrot and CVMFS, preserving high energy physics applications with Parrot, specifying environments with Umbrella, and preserving workflows with PRUNE.

Tuesday, January 12, 2016

Summer REU in DISC at Notre Dame

REU in Data Intensive Scientific Computing (DISC) at the University of Notre Dame
DISC combines big data, big science, and big computers at the University of Notre Dame. The only thing missing is you!

We invite outstanding undergraduates to apply for a summer research experience in DISC at the University of Notre Dame. Students will spend ten weeks learning how to use high performance computing and big data technologies to attack scientific problems. Majors in computer science, physics, biology, and related topics are encouraged to apply.

For more information:
http://disc.crc.nd.edu