Monday, May 9, 2016

Balancing Push and Pull in Confuga, an Active Storage Cluster File System for Scientific Workflows

Patrick Donnelly has published a journal article in Concurrency and Computation: Practice and Experience on the Confuga active cluster file system. The journal article presents the use of controlled transfers to distribute data within the cluster without destabilizing the resources of storage nodes:

Confuga is a new active storage cluster file system designed for executing regular POSIX workflows. Users may store extremely large datasets on Confuga in a regular file system layout, with whole files replicated across the cluster. You may then operate on your dataset using regular POSIX applications, with defined inputs and outputs.

Jobs execute with full data locality with all whole-file dependencies available in its own private sandbox. For this reason, Confuga will first copy a job's missing data to the target storage node prior to dispatching the job. This journal article examines two transfer mechanisms used in Confuga to manage this data movement: push and pull.

A push transfer is used to direct a storage node to copy a file to another storage node. Pushes are centrally managed by the head node which allows it to schedule transfers in a way that avoids destabilizing the cluster or individual storage nodes. To avoid some potential inefficiencies with centrally managed transfers, Confuga also uses pull transfers which resemble file access in a typical distributed file system. A job will pull its missing data dependencies from other storage nodes prior to execution.

This journal article examines the trade-offs of the two approaches and settles on a balanced approach where pulls are used for transfers of smaller files and pushes are used for larger files. This results in significant performance improvements for scientific workflows with large data dependencies. For example, two bioinformatics workflows we studied, a Burrows-Wheeler Aligner (BWA) workflow and an Iterative Alignments of Long Reads (IALR) workflow, achieved 48% and 77% reductions in execution time compared to using either an only push or only pull strategy.

For further details, please check out our journal article here. Confuga is available as part of the Cooperative Computing Toolset distributed under the GNU General Public License. For usage instructions, see the Confuga manual and man page.

See also our last blog post on Confuga which introduced Confuga.

No comments:

Post a Comment