Monday, December 4, 2017

CCL on Chameleon Cloud with ACIC

As has been a tradition for several years, the CCL has had the opportunity to teach about the CCTools and distributed computing as part of Applied Cyberinfrastructure Concepts(ACIC) course at University of Arizona and taught by Dr. Nirav Merchant and Dr. Eric Lyons. Due to the number of features that have been added as of recently, this year primarily focused on Makeflow and how we use it on the Cloud and with containers. The topics we talk about were:
  • Thinking opportunistically
  • Overview of the Cooperative Computing Tools
  • Makeflow
  • Makeflow using Work Queue as a batch system
  • Makeflow using the Cloud as a batch system
  • Specifying and managing resources
  • Using containers in and on a Makeflow
  • Specifying Makeflow with JSON and JX
The major topic I wanted to focus on here is running Makeflow on the Cloud. For several months we have supported Makeflow submitting to Amazon EC2 directly, and there is an upcoming release that will incorporate support for the Amazon Batch system. For this class we also worked on deploying CCTools on Chameleon Cloud, which is a "configurable experimental environment for large-scale cloud research" as found Chameleon Cloud is a great test bed for researchers to deploy cloud instances and utilizes the OpenStack KVM interface.

TPDS Paper: Storage Management in Makeflow

As the scale of workflows and their data grow, it becomes increasingly difficult to execute within the provide storage. This issue is only exacerbated when attempting to run multiple workflows at the same time, often sharing the resources. Up until recently, the user would often make a guess at the total size of the workflow and execute until failure, having to remove older experiment and data to accommodate the needed room; creating a time consuming cycle of run->fail->clean until the required space is achieved.

The initial solution, which is effective in many cases, is to turn on garbage collection. Garbage collection in Makeflow track a created file from creation until it is no longer needed, at which point it is deleted. This initial solution works well to limit the active footprint of the workflow. However, the user is still left in a situation where they are not aware of the space needed to execute.

To resolve this added an algorithm that will estimate the size of the workflow, and what this minimum size needed to execute said workflow would be. This is done by determining the different paths of execution and finding the resulting minimum path(s) through the workflow. This is most accurately done by estimating and labeling the files in the Makeflow:
.SIZE test.dat 1024K

Using this information Makeflow can statically analyze the workflow and tell you the minimum and maximum storage needed to execute. This information can then be coupled with a run-time storage manager and garbage collection to stay within a user specified limit. Instead of actively trying to schedule in an order to prevent going over the limit, nodes are submitted when there is enough space to permit the it to run and have space for all of its children. This allows for the more concurrency if the space allows. Below is an image that shows how this limit can be used to at different levels.

This first image shows a bioinformatics workflow running using the minimum required space. We can see several(10) peaks in the workflow. Each of these correspond to a larger set of intermediate files that can be removed later. In the naive case where we don't track storage these can all occur at the same time using more storage than may be available.
In this second case, we set a limit on storage that is higher than the minimum. We can see similar spikes, run-time manager is only scheduling as many branches as can be run under the limit.
To use this feature, which is now released and available in the main branch of CCTools here are the steps:
  1. Label the files. A slight over-estimate will work as well as the exact number is not known ahead of time. The default size is 1G.
    .SIZE 5M
  2. Find the estimated size of the Makeflow.
    makeflow --storage-print storage-estimate.dat
  3. Run Makeflow setting a mode and limit. The limit can be anywhere between the min and the max. Type 1 indicates a min tracking which holds below the limit set.
    makeflow --storage-type 1 --storage-limit 10G