Cooperative Computing Lab News: August 2020

Monday, August 24, 2020

CCTools version 7.1.7 released

The
 Cooperative Computing Lab is pleased to announce the release of version
 7.1.7 of the Cooperative Computing Tools including Parrot, Chirp, JX, 
Makeflow, WorkQueue, and other software.

The software may be downloaded here:
http://ccl.cse.nd.edu/community/forum


This is a bug release with some new features and bug fixes. Among them:

[Batch] Set number of MPI processes for SLURM. (Ben Tovar)
[General] Use the right signature when overriding gettimeofday. (Tim Shaffer)
[Resource Monitor] Add context-switch count to final summary. (Ben Tovar)
[Resource Monitor] Fix kbps to Mbps typo in final summary. (Ben Tovar)
[WorkQueue] Update example apps to python3. (Douglas Thain)

Thanks goes to the contributors for many features, bug fixes, and tests:

Ben Tovar
Cami Carballo
Douglas Thain
Nathaniel Kremer-Herman
Tanner Juedeman
Tim Shaffer

Please send any feedback to the CCTools discussion mailing list:

http://ccl.cse.nd.edu/community/forum


Enjoy!

Friday, August 14, 2020

Resource usage histograms for Work Queue using python's pandas+matplotlib

Work Queue is a framework to write and execute master-worker applications. A master process that can be written in python, perl, or C generates the tasks that then can be remotely executed by worker processes. You can learn more about Work Queue here.

Work Queue can automatically measure the resources, such as cores, memory, disk, and network bandwidth, used by each task. In python, this is enabled as:

import work_queue as wq
q = wq.WorkQueue(port=9123)
q.enable_monitoring()

The resources measured are available as part of the task structure:

# wait for 5 seconds for a task to finish
t = q.wait(5)
if t:
    print("Task {id} measured memory: {memory} MB"

            .format(id=t.id, memory=t.resources_measured.memory))

The resources measured are also written to Work Queue's transaction log. This log can be enabled when declaring the master's queue:

import work_queue as wq
q = wq.WorkQueue(port=9123, transactions_log='my_wq_trans.log')
q.enable_monitoring()

This log is also generated by Makeflow when using Work Queue as a batch system (-Twq).

The resource information per task appears as a json object in the transactions marked as DONE end-state exit-code resource-exhausted resources-measured. Here is an example of how a DONE transaction looks like:

1595431788501342 10489 TASK 1 DONE SUCCESS  0  {} {"cores": 2, ...}

With a regular expression incantation, we can extract the resource information into python's pandas. Say, for example, that we are interested in the memory and bandwidth distribution among the executed tasks. We can read these resources as follows:

import json
import re
import pandas as pd
import matplotlib.pyplot as plt

# the list of the resources we are interested in
resources = 'memory bandwidth'.split()
df = pd.DataFrame(columns=resources)

input_file = 'my_wq_trans.log'

with open(input_file) as input:
    for line in input:
        # timestamp master-pid TASK id (continue next line)

        # DONE SUCCESS exit-code exceeded measured
        m = re.match('\d+\s+\d+\s+TASK\s+\d+\s+'

                     'DONE\s+SUCCESS\s+0\s+{}\s+({.*})\s*$', line)
        if not m:
            continue

        # the resources are captured in the first (only pair

        # of parentheses) group used:
        s = json.loads(m.group(1))

        # append the new resources to the panda's data frame.

        # Resources are represented in a json array as

        # [value, "unit", such as [1000, "MB"],
        # so we only take the first element of the array:
        df.loc[len(df)] = list(s[r][0] for r in resources)

For a quick view, we can directly use panda's histogram method:

df.hist()
plt.show()

However, we can use matplotlib's facilities for subfigures and add titles,
units, etc. to the histograms:

# size width x height in inches
fig = plt.figure(figsize=(5,2))

# 1 row, 2 columns, 1st figure of the array
mem = plt.subplot(121)
mem.set_title('memory in MB')
mem.set_ylabel('task count')
mem.hist(df['memory'], range=(0,100))

# 1 row, 2 columns, 2nd figure of the array
mbp = plt.subplot(122)
mbp.set_title('bandwidth in Mbps')
mbp.hist(df['bandwidth'], range=(0,1200))

fig.savefig(input_file + '.png')

(credit: Python code highlight generated by: http://hilite.me)

Thursday, August 13, 2020

Tim Shaffer Awarded DOE Fellowship

CCL grad student Tim Shaffer was recently awarded a DOE SCGSR fellowship for his work titled "Enabling Distributed HPC for Loosely‐Coupled Dataflow Applications". He will be working with Ian Foster and Kyle Chard at Argonne National Lab on data intensive applications that combine the Parsl system from Argonne and the Work Queue runtime from Notre Dame. Congratulations Tim!

WRENCH Simulation of Work Queue

Our colleagues Henri Casanova (U Hawaii) and Rafael Ferreira da Silva (USC), along with their students, have recently published a paper highlighting their work in the WRENCH project. The have constructed a series of simulators have model the behavior of distributed systems, for the purposes of both performance prediction and education.

In their paper "Developing accurate and scalable simulators of production workflow management systems with WRENCH" the describe simulators that correspond the the Pegasus workflow management system and our own Work Queue distributed execution framework.

Of course, any simulation is an imperfect approximation of a real system, but what's interesting about the WRENCH simulations is that they allow us to verify the basic assumptions and behavior of a software implementation. In this example, the real system and the simulation show the same overall behavior, except that the real system has a stair-step behavior:

So, does that mean the simulation is "wrong"? Not really! In this case, the software is showing an undesirable behavior that is due either to incorrect logging or possibly a convoy effect. In short, the simulation helps us to find a bug relative to the "ideal" design. Nice!

https://www.sciencedirect.com/science/article/pii/S0167739X19317431