Monday, August 9, 2021

New PythonTask Interface in Work Queue

The most recent version of Work Queue supports two different categories of tasks.  Standard Tasks describe a Unix command line and corresponding files, just as before.  The new PythonTask describes a Python function invocation and corresponding arguments:

def my_sum(x, y):
    import math
    return x+y

task = wq.PythonTask(mysum,10,20)
queue.submit(task)
When a task is returned, the function value is available as t.output:
task = queue.wait(5);
if task:
    print("task {} completed with result {}".format(task.id,task.output))
Underneath, a PythonTask serializes the desired function and arguments, and turns it into a standard task which can be remotely executed, using all of the existing capabilities of Work Queue.  And so, a task can be given a resource allocation, time limits, tags, and everything else needed to manage distributed execution:
task.specify_cores(4)
task.specify_gpus(1)
task.specify_tag("hydrazine")
Thanks for new CCL graduate student Barry Sly-Delgado for adding this new capability to Work Queue!  See the full documentation.


Monday, August 2, 2021

Harnessing HPC at User Level for High Energy Physics

Ben Tovar presented some recent work at the (virtual) CHEP 2021 conference:  "Harnessing HPC Resources for CMS Jobs Using a Virtual Private Network".

The future computing needs of the Compact Muon Solenoid (CMS) experiment will require making use of national HPC facilities.  These facilities have substantial computational power, but for a variety of reasons, are not set up to allow network access from computational nodes out to the Internet.  This presents a barrier for CMS analysis workloads, which expect to make use of wide area data federations (like XROOTD) and global filesystems (like CVMFS) in order to execute.

In this paper, Ben demonstrates a prototype that creates a user-level virtual private network (VPN) that is dynamically deployed alongside a running analysis application.  This trick here is to make the whole thing work without requiring any root-level privileges, because that simply isn't possible at an HPC facility.  The solution brings together a variety of technologies -- namespaces, openconnect, slirp4netns, ld_preload -- in order to provide a complete user-level solution:

The performance of this setup is a bit surprising: when running a single worker node, the throughput of a single file transfer is substantially lower when tunneled (196 Mbps) compared to native performance (872 Mbps).  However, when running twenty or so workers, the tunneled solution achieves the same aggregate bandwidth as native.  (872 Mbps)  The most likely explanation is that tunneling a TCP connections over another TCP connections results in substantial start-up penalty while both stacks perform slow-start.

You can try the solution yourself here:

https://github.com/cooperative-computing-lab/userlevel-vpn-tun-tap 


New CCL Swag!

Check out the new CCL swag! Send us a brief note of how you use Work Queue or Makeflow to scale up your computational work, and we'll send you some laptop stickers and other items as a small thank-you.





Thursday, July 29, 2021

CCTools Version 7.3.0 released

The Cooperative Computing Lab is pleased to announce the release of version 7.3.0 of the Cooperative Computing Tools including Parrot, Chirp, JX, Makeflow, WorkQueue, and other software.

The software may be installed from here: http://ccl.cse.nd.edu/software/download

This is a minor release with some new features and bug fixes. Among them:

- [WorkQueue] PythonTask to directly execute python functions as WorkQueue tasks. (Barry Sly-Delgado)
- [WorkQueue] Fix max mode allocation to work as a high-water mark when dispatching tasks. (Ben Tovar)
- [WorkQueue] Reworked documentation in https://cctools.readthedocs.io. (Douglas Thain)
- [WorkQueue] API to show summary of workers connected. (David Simonetti)
- [WorkQueue] Adds --wall-time limit to workers. (Thanh Son Phung)
- [Resource Monitor] Time is now reported in seconds, rather than microseconds. (Ben Tovar)
- [JX] jx_repl tool for JX language exploration. (Jack Rundle)


Thanks goes to the contributors for many features, bug fixes, and tests:

- David Rundle
- Barry Sly-Delgado
- Thanh Son Phung
- Tim Shaffer
- David Simonetti
- Douglas Thain
- Ben Tovar

Please send any feedback to the CCTools discussion mailing list:

http://ccl.cse.nd.edu/community/forum

Enjoy!

Wednesday, April 21, 2021

Lightweight Function Paper at IPDPS

Tim Shaffer, a Ph.D student in the CCL, will be presenting a paper "Lightweight Function Monitors for Fine-Grained Management in Large Scale Python Applications" at the International Parallel and Distributed Processing Symposium (IPDPS) in May 2021.  This work is the result of a collaboration between our group and the Parsl workflow team at the University of Chicago, led by Kyle Chard.

Emerging large scale science applications may consist of a large number of dynamic tasks to be run across a large number of workers.  When written in a Python-oriented framework like Parsl or FuncX those tasks are not heavyweight Unix processes, but rather lightweight invocations of individual functions.  Running these functions at large scale presents two distinct challenges:


1 - The precise software dependencies needed by the function must be made available at each worker node.  These dependencies must be chosen accurately: too few, and the function won't work; too many, and the cost of distribution is too high.  We show a method for determining, distributing, and caching the exact dependencies needed at runtime, without user intervention.

2 - The right number of functions must be "packed" into large worker nodes that may have hundreds of cores and many GB of memory.  Too few, and the system is badly underutilized; too many, and performance will suffer or invocations will crash.  We show an automatic method for monitoring and predicting the resources consumed by categories of functions.  This results in resource allocation that is much more efficient than an unmanaged approach, and is very close to an "oracle" that predicts perfectly.


The techniques shown in this paper are integrated into the Parsl workflow system from U-Chicago, and the Work Queue distributed execution framework from Notre Dame, both of which are open source software supported by the NSF CSSI program.

Citation:
  • Tim Shaffer, Zhuozhao Li, Ben Tovar, Yadu Babuji, TJ Dasso, Zoe Surma, Kyle Chard, Ian Foster, and Douglas Thain, Lightweight Function Monitors for Fine-Grained Management in Large Scale Python Applications, IEEE International Parallel & Distributed Processing Symposium, May, 2021. 

Ph.D. Defense - Nathaniel Kremer-Herman

Congratulations to Dr. Kremer-Herman, who successfully defended his Ph.D. dissertation "Log Discovery, Log Custody, and the Web Inspired Approach for Open Distributed Systems Troubleshooting".  His work created a system TLQ (Troubleshooting via Log Queries) that enables the structured query of distributed data logged by independent components including workflows, batch systems, and application file access.  Prof. Kremer-Herman recently began a faculty position at Hanover College in Indiana.  Congrads!

Thursday, February 18, 2021

CCTools 7.2.0 released

The Cooperative Computing Lab is pleased to announce the release of version 7.2.0 of the Cooperative Computing Tools including Parrot, Chirp, JX, Makeflow, WorkQueue, and other software.

The software may be downloaded here:
http://ccl.cse.nd.edu/software/download

This is a minor release with some new features and bug fixes. Among them:

  • [Batch] Improved gpu handling with HTcondor. (Douglas Thain)
  • [WorkQueue] Resource usage report with work_queue_status per task and worker. (Thanh Son Phung)
  • [WorkQueue] Improved gpu handling. (Douglas Thain, Tim Shaffer)
  • [WorkQueue] Assign tasks to specific gpus with CUDA_VISIBLE_DEVICES. (Douglas Thain)
  • [Makeflow] Several fixes for sge_submit_makeflow. (Ben Tovar)


Thanks goes to the contributors for many features, bug fixes, and tests:

  • Ben Tovar
  • Douglas Thain
  • Nathaniel Kremer-Herman
  • Thanh Son Phung
  • Tim Shaffer


Please send any feedback to the CCTools discussion mailing list:

http://ccl.cse.nd.edu/community/forum

Enjoy!

Friday, December 18, 2020

CCTools version 7.1.12 released

 The Cooperative Computing Lab is pleased to announce the release of version 7.1.12 of the Cooperative Computing Tools including Parrot, Chirp, JX, Makeflow, WorkQueue, and other software.

The software may be downloaded here:
http://ccl.cse.nd.edu/software/download

This is a bug fix release:

  • [Batch interface] Adds sge_submit_workers to installed scripts directory. (Ben Tovar)
  • [Batch interface] Adds LSF as a batch type. (Douglas Thain)


Thanks goes to the contributors for many features, bug fixes, and tests:

  • Ben Tovar
  • Cami Carballo
  • Douglas Thain
  • Nathaniel Kremer-Herman
  • Thanh Son Phung
  • Tim Shaffer



Please send any feedback to the CCTools discussion mailing list:

http://ccl.cse.nd.edu/community/forum

Enjoy!

Tuesday, December 15, 2020

OpenTopography + EEMT + Makeflow

The OpenTopography service provides online access to geospatial data and computational tools in support of earth sciences.  The Effective Energy and Mass Transfer (EEMT) tool allows for computations of energy transfer in the Earth's critical zone, taking into account topography, vegetation, weather, and so forth.  To scale these computations up to large clusters, the CCL's Makeflow and Work Queue frameworks are employed to construct large scale parallel workflows at the touch of a button from the OpenTopography website. 

Source: Tyson Swetnam, University of Arizona

Analyzing Agriculture with Work Queue

The Field Scanalyzer at the University of Arizona is a massive robot that uses sensors, cameras, and GPS devices to collect vast quantities of agricultural data from crop fields.  In the background, distributed computing and deep learning techniques are used to understand and improve agricultural efficiencies in hot, dry, climates.  Processing all this data requires reliable computation on large clusters: the PhytoOracle software from the Lyons Lab at UA makes this possible, building on the Work Queue software from the Cooperative Computing Lab at Notre Dame.

- Source: Eric Lyons University of Arizona