Cooperative Computing Lab News: July 2020

Coffea + Work Queue Presentation at PyHEP 2020

CCL grad student Cami Carballo gave an interactive notebook talk on scaling up data analysis workloads at the PyHEP 2020 conference on Python for high energy physics.

This Python notebook (Integrating-Coffea-and-WorkQueue.ipynb) demonstrates the combination of the Coffea data analysis framework running on the Work Queue distributed execution system, all packaged up within a Jupyter notebook.

A particular challenge in cluster environments is making sure that the remote execution nodes have the proper Python execution environment needed by the end user. Scientific applications change quickly, and so it's important to have exactly the right Python interpreter along with the precise set of libraries (Python and native) installed. To accomplish this, the Coffea-WorkQueue module performs a static analysis of the dependencies needed by an application, and ships them along with the remote tasks, deploying them as needed so that multiple independent applications can run simultaneously on the cluster.

Coffea + Work Queue is under active development as we continue to tune and scale the combined system.

Troubleshooting at PEARC 2020

CCL grad student Nate Kremer-Herman presented his work on troubleshooting distributed systems at the PEARC 2020 conference:

Nathaniel Kremer-Herman and Douglas Thain, Log Discovery for Troubleshooting Open Distributed Systems with TLQ, Practice and Experience of Advanced Research Computing (PEARC), July, 2020.

Abstract:

Troubleshooting a distributed system can be incredibly difficult. It is rarely feasible to expect a user to know the fine-grained interactions between their system and the environment configuration of each machine used in the system. Because of this, work can grind to a halt when a seemingly trivial detail changes. To address this, there is a plethora of state-of-the-art log analysis tools, debuggers, and visualization suites. However, a user may be executing in an open distributed system where the placement of their components are not known before runtime. This makes the process of tracking debug logs almost as difficult as troubleshooting the failures these logs have recorded because the location of those logs is usually not transparent to the user (and by association the troubleshooting tools they are using). We present TLQ, a framework designed from first principles for log discovery to enable troubleshooting of open distributed systems. TLQ consists of a querying client and a set of servers which track relevant debug logs spread across an open distributed system. Through a series of examples, we demonstrate how TLQ enables users to discover the locations of their system’s debug logs and in turn use well-defined troubleshooting tools upon those logs in a distributed fashion. Both of these tasks were previously impractical to ask of an open distributed system without significant a priori knowledge. We also concretely verify TLQ’s effectiveness by way of a production system: a biodiversity scientific workflow. We note the potential storage and performance overheads of TLQ compared to a centralized, closed system approach.

Container Management at IPDPS 2020

CCL grad student Tim Shaffer recently presented his recent work on container management at IPDPS 2020:

Container technologies are seeing wider use at advanced computing facilities for managing highly complex applications that must execute at multiple sites. However, in a distributed high throughput computing setting, the unrestricted use of containers can result in the container explosion problem. If a new container image is generated for each variation of a job dispatched to a site, shared storage is soon exceeded. On the other hand, if a single large container image is used to meet multiple needs, the size of that container may become a problem for storage and transport. To address this problem, we observe that many containers have an internal structure generated by a structured package manager, and this information could be used to strategically combine and share container images. We develop LANDLORD to exploit this property and evaluate its performance through a combination of simulation studies and empirical measurement of high energy physics applications.

Tim Shaffer, Nicholas Hazekamp, Jakob Blomer, and Douglas Thain, "Solving the Container Explosion Problem for Distributed High Throughput Computing" International Parallel and Distributed Processing Symposium, May, 2020.

Thursday, July 23, 2020

Coffea + Work Queue Presentation at PyHEP 2020

Troubleshooting at PEARC 2020

Container Management at IPDPS 2020