Monday, August 1, 2022

iSURE Project: Visualizing and Right Sizing Work Queue Applications

Samuel Huang, an exchange student in the iSURE program, recently completed a summer project with the Cooperative Computing Lab at the University of Notre Dame.  He developed tools for visualizing the performance and behavior of distributed Work Queue applications.  These applications can run on thousands of nodes and may have surprisingly complex behavior over time.  Visualization is key to understanding what's going on.

For example, this display shows an application consisting of about 30,000 tasks.  Each line segment shows one task from beginning to end, sorted by submission time.  (The color indicates the type of each task: preprocessing, processing, or accumulation.). As this display clearly shows, this application goes through several distinct phases, in which tasks of different types take increasing amounts of time.  In fact, the last few thousands tasks take much longer, showing a classic "long tail" behavior common to distributed applications.

This display is of the same application, but showing the utilization of the worker processes in the system.  Here you can see delays in provisioning.  The first 60-some workers arrive quite quickly, and the manager is able to dispatch (preprocessing) tasks to them quickly.  The next 200-some workers arrive another minute in, and task some time to be fully utilized, due to the requirements of moving software dependencies for each task.  Finally, at the end of execution, some additional workers become available, but go unutilized due to the change in task resource consumption.

Both these display are now integrated into CCTools in the work_queue_graph_workers tool, which generates a dynamic webpage for digging into the detailed data.

REU Project: TopEFT Performance Analysis: Solving Bottlenecks in Data Transfer and Task Resource Management

Andrew Hennessy, a junior at Notre Dame, recently completed a summer REU project in which he analyzed and improved the performance of TopEFT, a high energy physics analysis application built using the Coffea framework and the Work Queue distributed execution system.

This applications runs effectively in production, but takes about one hour to complete an analysis on O(1000) cores -- we would like to get it down to fifteen minutes or less in order to enable "near real time" analysis capability.   Andrew built a visualization of the accumulation portion and observed one problem: the data flow is highly unbalanced, resulting in the same data moving around multiple times.  He modified the scheduling of the accumulation step, resulting in a balanced tree with reduced I/O impact.

Next, he observed that processing tasks acting on different datasets have different resource needs: tasks consuming monte carlo (simulated) data take much more time and memory than tasks consuming Production (acquired) data.  This results in a slowdown as the system gradually adjusts to the changing task size.  The solution here is to ask the user to label the datasets appropriately, and place the Monte Carlo and Production tasks in different "categories" in Work Queue, so that resource prediction will be more accurate.

See the entire poster here:

REU Project: Integrating Serverless and Task Computation in Work Queue

David Simonetti, a junior undergraduate at Notre Dame, recently completed a summer REU project in which he added "serverless" computing capabilities to the Work Queue distributed computing framework.

Work Queue has historically used a "task" abstraction in which a complete program with its input files is submitted for remote execution.  David added a capability in which a coprocessor is attached to each worker, providing a hot function execution environment.  Then, lightweight tasks representing single function executions can be sent throughout the distributed system, making use of the existing scheduling, resource management, and data movement capabilities of Work Queue.

This allows for the integrated execution of both conventional whole-process tasks and lightweight functions within the same framework.  Check out the full poster here: