Wednesday, March 22, 2017

Ph.D. Defense: Haiyan Meng

Haiyan Meng successfully defended her dissertation titled "Improving the Reproducibility of Scientific Applications with Execution Environment Specifications"  Congratulations!

Thursday, March 9, 2017

Makeflow Examples Archive

We recently updated our archive of example Makeflows so that they are significantly easier to download, execute, and reshape to various sizes.   For each one, we have instructions on how to obtain the underlying binary program, generate some sample data, and then create a workload of arbitrary size.  This allows you to experiment with Makeflow at small scale, and then dial things up when you are ready run on on thousands of nodes:

Friday, February 17, 2017

Big CMS Data Analysis at Notre Dame

Analyzing the data produced by the Compact Muon Solenoid (CMS), one of the experiments at the Large Hadron Collider, requires a collaboration of physicists, computer scientists to harness hundreds of thousands of computers at universities and research labs around the world.  The contribution of each site to the global effort, whether small or large, is reported out on a regular basis.

This recent graph tells an interesting story about contributions to CMS computing in late 2016.  Each color in the bargraph represents the core-hours provided by a given site over the course of a week:

The various computing sites are divided into tiers:

  • Tier 0 is CERN, which is responsible for providing data to the lower tiers.
  • Tier 1 contains the national research labs like Fermi National Lab (FNAL), Rutherford Appleton Lab in in UK, and so forth, that facilitate analysis work for universities in their countries.
  • Tier 2 contains universities like Wisconsin, Purdue, and MIT, that have significant shared computing facilities dedicated to CMS data analysis.
  • Tier 3 is everyone else performing custom data analysis, sometimes on private clusters, and sometimes on borrowed resources. Most of those are so small that they are compressed into black at the bottom of the graph.

Now, you would think that the big national sites would produce most of the cycles, but there are a few interesting exceptions at the top of the list.

First, there are several big bursts in dark green that represent the contribution of the HEPCloud prototype, which is technically a Tier-3 operation, but is experimenting with consuming cycles from Google and Amazon.  This has been successful at big bursts of computation, and the next question is whether this will be cost-effective over the long term.

Next, the Tier-2 at the University of Wisconsin consistently produces a huge number of cycles from their dedicated facility and opportunistic resources from the Center for High Throughput Computing.  This group works closely with the HTCondor team at Wisconsin to make sure every cycle gets used, 365 days a year.

Following that, you have the big computing centers at CERN and FNAL, which is no surprise.

And, then the next contributor is our own little Tier-3 at Notre Dame, which frequently produces more cycles than most of the Tier-2s and some of the Tier-1s!  The CMS group at ND harnesses a small dedicated cluster, and then adds to that unused cycles from our campus Center for Research Computing by using Lobster and the CCL Work Queue software on top of HTCondor.

The upshot is, on a good day, a single grad student from Notre Dame can perform data analysis at a scale that rivals our national computing centers!

Thursday, February 2, 2017

IceCube Flies with Parrot and CVMFS

IceCube is a neutrino detector built at the South Pole by instrumenting about a cubic kilometer of ice with 5160 light sensors. The IceCube data is analyzed by a collaboration of about 300 scientists from 12 countries. Data analysis relies on the precise knowledge of detector characteristics, which are evaluated by vast amounts of Monte Carlo simulation.  On any given day, 1000-5000 jobs are continuously running.

Recently, the experiment began using Parrot to get their code running on GPU clusters at XSEDE sites (Comet, Bridges, and xStream) and the Open Science Grid.  IceCube relies on software distribution via CVMFS, but not all execution sites provide the necessary FUSE modules.  By using Parrot, jobs can attach to remote software repositories without requiring special privileges or kernel modules.

- Courtesy of Gonzalo Merino, University of Wisconsin - Madison

Tuesday, October 25, 2016

Reproducibility Papers at eScience 2016

CCL students presented two papers at the IEEE 12th International Conference on eScience on the theme of reproducibility in computational science:
Congrads to Haiyan for winning a Best of Conference award for her paper:

CCL Workshop 2016

The 2016 CCL Workshop on Scalable Scientific Computing was held on October 19-20 at the University of Notre Dame.  We offered tutorials on Makeflow, Work Queue, and Parrot. and gave highlights of the many new capabilities relating to reproducibility and container technologies. Our user community gave presentations describing how these technologies are used to accelerate discovery in genomics, high energy physics, molecular dynamics, and more. 
Everyone got together to share a meal, solve problems, and generate new ideas. Thanks to everyone who participated, and see you next year

Tuesday, September 20, 2016

NSF Grant to Support CCTools Development

We are pleased to announce that our work will continue to be supported by the National Science Foundation through the division of Advanced Cyber Infrastructure.

The project is titled "SI2-SSE: Scaling up Science on Cyberinfrastructure with the Cooperative Computing Tools"  It will advance the development of the Cooperative Computing Tools to meet the changing technology landscape in three key respects: exploiting container technologies, making efficient use of local concurrency, and performing capacity management at the workflow scale.  We will continue to focus on active user communities in high energy physics, which rely on Parrot for global scale filesystem access in campus clusters and the Open Science Grid; bioinformatics users executing complex workflows via the VectorBase, LifeMapper, and CyVerse disciplinary portals, and ensemble molecular dynamics applications that harness GPUs from XSEDE and commercial clouds. 

Thursday, September 15, 2016

Announcement: CCTools 6.0.0. released

The Cooperative Computing Lab is pleased to announce the release of version 6.0.0 of the Cooperative Computing Tools including Parrot, Chirp, Makeflow, WorkQueue, Umbrella, Prune, SAND, All-Pairs, Weaver, and other software.

The software may be downloaded here:

This is a major which adds several features and bug fixes. Among them:

  • [Catalog]   Automatic fallback to a backup catalog server. (Tim Shaffer)
  • [Makeflow]  Accept DAGs in JSON format. (Tim Shaffer)
  • [Makeflow]  Multiple documentation omission bugs. (Nick Hazekamp and Haiyan Meng)
  • [Makeflow]  Send information to catalog server. (Kyle Sweeney)
  • [Makeflow]  Syntax directives (e.g. .SIZE for to indicate file size). (Nick Hazekamp)
  • [Parrot] Fix cvmfs logging redirection. (Jakob Blomer)
  • [Parrot] Multiple bug-fixes. (Tim Shaffer, Patrick Donnelly, Douglas Thain)
  • [Parrot] Timewarp mode for reproducible runs. (Douglas Thain)
  • [Parrot] Use new libcvmfs interfaces if available. (Jakob Blomer)
  • [Prune]     Use SQLite as backend. (Peter Ivie)
  • [Resource Monitor] Record the time where a resource peak occurs. (Ben Tovar)
  • [Resource Monitor] Report the peak number of cores used. (Ben Tovar)
  • [Work Queue] Add a transactions log. (Ben Tovar)
  • [Work Queue] Automatic resource labeling and monitoring. (Ben Tovar)
  • [Work Queue] Better capacity worker autoregulation. (Ben Tovar)
  • [Work Queue] Creation of disk allocation per tasks. (Nate Herman-Kremer)
  • [Work Queue] Extensive updates to wq_maker. (Nick Hazekamp)
  • [Work Queue] Improvements in computing master's task capacity. (Nate Herman-Kremer).
  • [Work Queue] Raspberry Pi compilation fixes. (Peter Bui)
  • [Work Queue] Throttle work_queue_factory with --workers-per-cycle. (Ben Tovar)
  • [Work Queue] Unlabeled tasks are assumed to consume 1 core, 512 MB RAM and 512 MB disk. (Ben Tovar)
  • [Work Queue] Worker disconnects when node does not longer have the resources promised. (Ben Tovar)
  • [Work Queue] work queue statistics clean up (see work_queue.h for deprecated names). (Ben Tovar)
  • [Work Queue] work_queue_status respects terminal column settings. (Mathias Wolf)

We will have tutorials on the new features in our upcoming workshop, October 19 and 20. Refer to for more information. We hope you can join us!

Thanks goes to the contributors for many features, bug fixes, and tests:

  • Jakob Blomer
  • Peter Bui
  • Patrick Donnelly
  • Nathaniel Kremer-Herman
  • Kenyi Hurtado-Anampa
  • Peter Ivie
  • Kevin Lannon
  • Haiyan Meng
  • Tim Shaffer
  • Douglas Thain
  • Ben Tovar
  • Kyle Sweeney
  • Mathias Wolf
  • Anna Woodard
  • Chao Zheng

Please send any feedback to the CCTools discussion mailing list:


Friday, August 19, 2016

Summer REU Projects in Data Intensive Scientific Computing

We recently wrapped up the first edition of the summer REU in Data Intensive Scientific Computing at the University of Notre Dame.  Ten undergraduate students came to ND from around the country and worked on projects encompassing physics, astronomy, bioinformatics, network sciences, molecular dynamics, and data visualization with faculty at Notre Dame.

To learn more, see these videos and posters produced by the students:


Wednesday, August 10, 2016

Simulation of HP24stab with AWE and Work Queue

The villin headpiece subdomain "HP24stab" is a recently discovered 24-residue stable supersecondary structure that consists of two helices joined by a turn. Simulating 1μs of motion for HP24stab can take days or weeks depending on the available hardware, and folding events take place on a scale of hundreds of nanoseconds to microseconds.  Using the Accelerated Weighted Ensemble (AWE), a total of 19us of trajectory data were simulated over the course of two months using the OpenMM simulation package. These trajectories were then clustered and sampled to create an AWE system of 1000 states and 10 models per state. A Work Queue master dispatched 10,000 simulations to a peak of 1000 connected 4-core workers, for a total of 250ns of concurrent simulation time and 2.5μs per AWE iteration. As of August 8, 2016, the system has run continuously for 18 days and completed 71 iterations, for a total of 177.5μs of simulation time. The data gathered from these simulations will be used to report the mean first passage time, or average time to fold, for HP24stab, as well as the major folding pathways.  - Jeff Kinnison and Jesús Izaguirre, University of Notre Dame