Friday, March 27, 2015

Confuga: Scalable Data Intensive Computing for POSIX Workflows


Patrick Donnely will present his work on the Confuga distributed filesystem at  CCGrid 2015 in China:

Patrick Donnelly, Nicholas Hazekamp, Douglas Thain,Confuga: Scalable Data Intensive Computing for POSIX Workflows, IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May, 2015.  
Confuga is a new active storage cluster file system designed for executing regular POSIX workflows. Users may store extremely large datasets on Confuga in a regular file system layout, with whole files replicated across the cluster. You may then operate on your dataset using regular POSIX applications, with defined inputs and outputs.



Confuga handles the details of placing jobs near data and minimizing network load so that the cluster's disk and network resources are used efficiently. Each job executes with all of its input file dependencies local to its execution, within a sandbox.

For those familiar with CCTools, Confuga operates as a cluster of Chirp servers with a single Chirp server operating as the head node. You may use the Chirp library, Chirp CLI toolset, FUSE, or even Parrot to upload and manipulate the data on Confuga.

For running a workflow on Confuga, we encourage you to use Makeflow. Makeflow will submit the jobs to Confuga using the Chirp job protocol and take care of ordering the jobs based on their dependencies.


Tuesday, March 24, 2015

Makeflow Visualization with Cytoscape

We have created a new Makeflow visualization module which exports a workflow into an xgmml file compatible with Cytoscape.  Cytoscape is a powerful network graphing application with support for custom styles, layouts, annotations, and more. While this program is known more for visualizing molecular networks in biology, it can be used for any purpose, and we believe it is a powerful tool for visualizing makeflow tasks.  Our visualization module was designed for and tested on Cytoscape 3.2. The following picture is a Cytoscape visualization of the example makeflow script provided in the User’s Manual (http://ccl.cse.nd.edu/software/manuals/makeflow.html):



To generate a Cytoscape graph from your makeflow script, simply run:

makeflow_viz –D cytoscape workflow.mf > workflow.xgmml
 workflow.xgmml can then be opened in Cytoscape through File -> Import -> Network -> File.  We have created a clean style named specifically for visualizing makeflow tasks named style.xml, which is generated in the present working directory when you run makeflow_viz. To apply the style in Cytoscape, select File -> Import -> Style, and select the style.xml file.  Next, right-click the imported network and select “Apply Style…”.  Select “makeflow” from the dropdown menu and our style will be applied.  This will add the proper colors, edges, arrows, and shapes for processes and files.

Cytoscape also has a built in layout function which can be used to automatically rearrange nodes according to their hierarchy.   To access this, select Layout à Settings, and a new window will pop up.  Simply select “Hierarchical Layout” from the dropdown menu, change the settings for that layout to your liking, and select “Execute Layout.”  There is a caveat with this function.  With larger makeflow tasks, this auto layout function can take long to complete.   This is due to Cytoscape being designed for all types of graphs, and they do not appear to implement algorithms specifically for dags to take advantage of faster time complexities.  We have tested the auto-layout function with the following test cases:

Number of nodes
Number of edges
Time to layout nodes
114
258
20-30 seconds
2213
11526
2.5 hours
15245
30478
23 hours

After the layout completes, the graph should be visible in a clean fashion, and you can customize the display further to your liking with the various options available in Cytoscape.  For more information about Cytoscape, visit http://cytoscape.org