Cooperative Computing Lab News

Tuesday, October 21, 2025

Undergraduate Researcher Showcases PLEDGE Project at APANAC 2025 in Panama

On Thursday, October 2, 2025, undergraduate student Andrés Iglesias attended APANAC 2025, the National Congress dedicated to Science and Technology held in Panama. Andrés participated in the poster session with “PLEDGE: Accelerating Data Intensive Scientific Applications with Consistency Contracts.”

Andrés joined the Cooperative Computing Lab (CCL) at the University of Notre Dame as a summer research student through the iSURE program, spending three months on campus. During his time at Notre Dame, he contributed to the initial implementation of the PLEDGE Tracer, based on Colin’s previous work, which observes a scientific application and generates a consistency contract. He also worked on the PLEDGE Enforcer, which ensures the scientific application respects the consistency contract at runtime.

We are proud of Andrés’s contributions and delighted to see his work showcased at a prestigious national conference!

Tuesday, October 14, 2025

Reducing Overhead of LLM-integrated Applications on GPU Clusters with Parsl+TaskVine

Large Language Models (LLMs) are becoming a key tool for scientific discovery, but using them on High-Performance Computing (HPC) clusters is challenging due to the limitations of traditional resource allocation methods. For instance, static allocation, which assigns a dedicated set of GPUs for a task, is a rigid system. This can lead to long queues of frustrated users and wasted resources, as the allocated GPUs sit idle while waiting for the next job to start. Meanwhile, opportunistic allocation allows tasks to use available, but not guaranteed, resources. While this improves overall cluster utilization, it's problematic for LLM applications. The initial loading of a multi-billion-parameter LLM is a time-consuming process, and since tasks in an opportunistic environment can be preempted at any moment, this expensive startup often has to be repeated from scratch.

To solve this, we propose a new technique called Pervasive Context Management. The core idea is to decouple the LLM initialization context from the actual inference tasks and keep this context persistent on GPUs until it is no longer needed. This transforms the high startup cost into a one-time, amortizable expense. When a task is preempted, it can be quickly rescheduled to another GPU that already has the necessary context loaded, eliminating the need to re-initialize the model. Our Parsl+TaskVine system can also transfer existing context between nodes to bootstrap new GPUs, reducing data transfer time and avoiding bottlenecks.

To demonstrate the effectiveness of this approach, we transformed a fact-verification application to use Pervasive Context Management and conducted a comprehensive evaluation. Our results show a significant improvement in performance on both static and opportunistic resources. By enabling Pervasive Context Management, the end-to-end execution time of our application was reduced by 72.1%, from 3 hours to just 48 minutes, using the same number of GPUs. The application also scaled efficiently on up to 32.8% of all GPUs (up to 186 GPUs) in the cluster, further reducing the execution time to a mere 13 minutes.

Additionally, Pervasive Context Management helps users avoid the complex problem of tuning the inference batch size. Because the expensive startup cost is now a one-time event, the application's performance becomes much more stable regardless of the batch size chosen. This removes the burden of manual tuning and ensures near-optimal execution. In summary, our findings show that this new approach is a viable solution for running high-throughput LLM inference applications efficiently on heterogeneous and opportunistic HPC clusters.

Tuesday, October 7, 2025

TaskVine Insights: Storage Management – PFS vs. NLS

There are two primary storage layers when running workflows in HPC environments:

Parallel File System (PFS): A shared file system accessible to all users in the same cluster, such as VAST, Lustre, BeeGFS, and CephFS.
Node-Local Storage (NLS): Each worker’s local disk, accessed directly without relying on the network, usually the temporary directory of the local file system.

PFS and NLS each have their own advantages and disadvantages.

PFS is convenient because it can be easily accessed by users. It usually has large capacity, often hundreds of terabytes, making it an ideal choice for storing big datasets. It is also stable and reliable, ensuring that data is not lost. However, the main drawback of PFS is that its I/O bandwidth is shared among many users, so it can become saturated when many jobs perform I/O at the same time, turning data access into a major bottleneck for parallel computation.

In contrast, NLS provides isolated storage on distributed workers. Each worker has its own local disk and can read and write data directly. This allows the total I/O bandwidth to aggregate across all workers, and data transfer between nodes happens through peer-to-peer communication. This design effectively reduces the I/O contention that occurs on the PFS and helps workflows scale to larger sizes. On the flip side, it also has limitations. Its capacity is small, typically a few hundred gigabytes per node, and it is less reliable, because node failures are unpredictable, and when a node is preempted or goes offline, all data stored on that node is lost, which poses a risk to users.

The figure below shows the performance difference between PFS and NLS. We test concurrent reads and writes from 1 to 128 threads and measure their average bandwidth. The PFS test runs directly on the parallel file system, while the NLS test runs on 8 workers, each with 16 cores. It is clear that running parallel I/O on NLS provides much higher average bandwidth. Each thread maintains about 1 GB/s of throughput, whereas the average bandwidth on PFS drops sharply as concurrency increases.

The key takeaway from this comparison is that running large-scale or data-intensive workflows on HPC systems requires relying on NLS for its high aggregate I/O bandwidth, ensuring that the entire workflow is not slowed down by a deluge of reads and writes.

Our team has been studying and improving how to better leverage NLS to accelerate large-scale workflow computations in HPC systems through TaskVine, a workflow management system we have been developing over the past few years. TaskVine’s key advantage is its ability to use each worker’s NLS to reduce I/O contention on the PFS, enabling faster data access and quicker workflow completion. It also employs a range of data management techniques and strategies to ensure that NLS is used effectively and efficiently, keeping data safe and handling unpredictable node failures with care.

This blog series shares how we manage data carefully, address the challenges of using NLS when local disk space is limited and nodes are prone to failures, and achieve massive scalability.

Stay tuned for upcoming technical insights, code examples, and updates!

Wednesday, October 1, 2025

eScience 2025: Liberating the Data Aware Scheduler to Achieve Locality in Layered Scientific Workflow Systems

On September 16 graduate student Colin Thomas presented the paper titled: Liberating the Data Aware Scheduler to Achieve Locality in Layered Scientific Workflow Systems at the 21st IEEE International Conference on eScience in Chicago, Illinois.

This work engages mutliple topics including workflow systems, data management, and task scheduling. The title describes two key components; The Data Aware Scheduler, and Layered Scientific Workflow Systems. Data aware schedulers are capable of understanding task data dependencies and making task scheduling decisions based on that information, primarily to benefit from data locality when there is intermediate workflow data which is already cached somewhere in the cluster. It is beneficial to schedule tasks who consume this data to the same site in which the data was created, so that the dependencies to not have to move through the network, or perhaps even out of memory.

The term "layered workflow system" is a way to describe multiple popular workflow systems in the HPC community such as Parsl and Dask. These workflow systems consist of two primary components. The DAG manager and executor. The DAG manager understands the workflow composition and data dependencies. The executor receives tasks from the DAG manager and uses its understanding of the cluster and available resources to place tasks on their execution sites.

The primary argument of the paper is highlighting the obstacles created by using a data aware scheduler in this layered execution scheme. If we take a DAG such as the one described by Figure 1, we can easily identify opportunities for data locality in the groups of sequentially dependent tasks. However the data aware scheduler is not privy to this picture of the DAG. Rather the scheduler, or executor, is only aware of tasks which are ready to run, while the DAG manager withholds future task information until they are ready to run as well. This forces a data aware scheduler to make last-minute scheduling decisions based on individual tasks. In many cases it may occupy a node in which a later task would have been better suited due to data locality opportunities.

The paper shows an implementation of a modified Parsl DAG manager and TaskVine data aware scheduler which passes through intermediate-dependent sequential tasks to the data aware scheduler before all of them are ready to run. This allows TaskVine to identify the group dependencies and the ideal execution pattern, and schedule these task groups in batches rather than on an individual basis.

The result of this increases the data locality achieved on the 2 workflows in the evaluation. In addition it reduces the total number of scheduling operations by a factor of the average task group size.

The link to the full paper can be found below:

https://ccl.cse.nd.edu/research/papers/liberating-escience-2025.pdf

Tuesday, September 23, 2025

Floability at eScience 2025: Making Notebooks Portable with Backpacks Across HPC Clusters

Grad student Saiful Islam presented our paper “Backpacks for Notebooks: Enabling Containerized Notebook Workflows in Distributed Environments” at the 2025 IEEE eScience Conference in Chicago, Illinois.

Notebooks have become the de facto interface for scientific computing, but moving from a local notebook to large-scale HPC clusters is far from seamless—especially when the notebook contains a distributed workflow that must be submitted across multiple nodes. A notebook file alone doesn’t capture the full execution context—it misses environment specifications, data locations, and resource requirements. As a result, the same notebook often runs on one system but fails on another.

This paper introduces the concept of a backpack—a lightweight companion that travels with the notebook and captures everything needed for execution. A backpack makes explicit the software environment, data sources, and resource requirements that are often left implicit in code. With Floability, our implementation of backpack specifications, backpacks transform ordinary notebooks into portable, reproducible workflows that can execute across heterogeneous HPC clusters with zero to minimal code modification.

We evaluated Floability on three representative scientific workflows—distributed image convolution, climate trend analysis, and high-energy physics data analysis—running them across five heterogeneous HPC systems (Notre Dame CRC, Purdue Anvil, UT Stampede3, OSPool, and AWS). In each case, backpacks successfully captured the required software and data dependencies, provisioned worker environments, and reproduced execution without code changes. While runtime varied due to site-specific infrastructure like schedulers and storage, all workflows completed consistently, demonstrating that backpacks enable portable, reproducible, and scalable execution of notebook workflows across diverse HPC environments.

For all the details, please check out our paper here:

Md Saiful Islam, Talha Azaz, Raza Ahmad, A D M Shahadat Hossain, Furqan Baig, Shaowen Wang, Kevin Lannon, Tanu Malik, and Douglas Thain, Backpacks for Notebooks: Enabling Containerized Notebook Workflows in Distributed Environments, IEEE Conference on eScience, pages 9, September 2025.

Project website: https://floability.github.io

Example backpacks: https://github.com/floability/floability-examples

Tuesday, September 16, 2025

Workshop on Harmonizing Python Workflows at IEEE e-Science 2025

We helped host the Workshop on Harmonizing Python Workflows at IEEE International Conference on e-Science on Monday, 15 Sep 2025.

This workshop is one component of our NSF POSE project to explore the creation of an open source ecosystem encompassing a variety of workflow technologies, including the Parsl Project (Kyle Chard at the University of Chicago), the RADICAL tools (Shantenu Jha, Rutgers University), and TaskVine (Douglas Thain, University of Notre Dame).

Workshop attendees conducted several working groups where they identified key barriers to workflow creation, deployment, and adoption; proposed activities for an open source ecosystem to support these needs; and provided feedback on the possible structure sof an ecosystem.

The organizers will be following up soon with a workshop report and draft proposal for next steps.

Thank you to everyone who participated!

Tuesday, September 9, 2025

Welcome Back, Colin!

This past summer, 4th year PhD student Colin Thomas completed an internship at the National Energy Research Scientific Computing Center (NERSC) located at the Lawrence Berkeley National Laboratory.

Colin worked with a team of researchers and fellow interns to develop and deploy an Inference-as-a-Service (IaaS) platform for particle physics experiments including DUNE and ATLAS. Colin organized a network of services deployed on Kubernetes and the Perlmutter supercomputer which enabled remote scientists to run their applications and perform analysis of their data. The IaaS deployment consisted of metrics collection capabilities used to profile the system, identify bottlenecks, and inform the system about the need to scale the compute resources to meet the demands of users in real-time. The team profiled multiple inference-serving technologies such as Ray Serve and NVIDIA Triton. The team effort resulted in a number of successes including tests with scientists from multiple institutions, the gathering of valuable data from the profiles of Ray and Triton, and the system prototype which can serve as an example for future work in scientific inference serving.

Colin shared with us his wonderful experience during the summer and also gave an engaging talk about the work he has done during our first team meeting. Below is the poster he created and presented. We are excited to learn new ideas and look forward to more inspiring discussions ahead.

Tuesday, September 2, 2025

New Semester, New Faces

The new semester is here, and we’re excited to welcome three new colleagues and roll out a clear plan for the months ahead.

New faces. Lax joins as a first-year Ph.D. student. Ryan just completed his M.S. in our lab and begins his Ph.D. this semester. Abby joins as a first-year M.S. student. We’re glad to have them on board!

Each semester we adjust our routines and schedules to keep the lab running smoothly and to support a diverse team. This semester we’re trying a few new approaches. Colin will organize our weekly team talks with a schedule planned about a month ahead, host practice sessions for conference presentations, and invite remote guests for ~30-minute Zoom talks and discussion. Jin will handle outreach and support: a short blog every Tuesday, cross-posting to LinkedIn and other channels, encouraging GitHub Discussions as a support path, and serving as the first responder for incoming technical questions. Saiful will lead the website migration to a Jekyll site on GitHub, move over most existing content, convert prior posts into native Jekyll, and switch our papers database from PHP + database to a JSON dataset. Ben, the most experienced and supportive engineer who knows every line of the codebase, will guide release engineering: review PRs and give feedback, prioritize and maintain issues, keep integration tests in working order, and plan and ship regular software releases.

This semester, alongside our long-standing systems research, we will explore practical ways to pair our work with LLMs. For example, improving developer documentation, identifying code invariants, visualizing logs, and troubleshooting workflows. Ground rules: you are responsible for your output; do not create extra work for others; use ND-authorized tools such as Gemini and Gemini-CLI to protect data. This week each person will pick a modest task and tool to explore. Next week we will report on methods, results, and observations.

Project highlights.

CMS Computing (Barry, Jin, Ben, Alan): keep the current TaskVine release stable and usable; share successes with the HEP community; publish the new cms-taskvine-example repository; advance an IPDPS/TPDS paper on dynamic data reduction.
NSF CSSI Floability (Saiful, Abby, Ben): keep the release stable, usable, and documented; attach straightforward data handling for existing applications; grow users and collaborations; build interactive visualization of notebook workflows to support troubleshooting and performance; then extend the approach to new notebook and workflow models.
NSF Consistency Contracts (Colin + Jin): complete and release the basic toolchain to measure, summarize, and enforce; choose baseline techniques that exploit application behavior.
DOE XGFabric (Ryan, Thanh): capture the full software stack; port to multiple ACCESS sites and troubleshoot issues as they appear; evaluate scale, responsiveness, and resource use; develop solutions for HPC batch queue delays.
NSF HARMONY: run a fall workshop on workflow collaboration with the eScience conference; capture examples across Parsl-TaskVine (astro), DaskVine (HEP), and RADICAL (xgfabric).
NASA-SADE (Lucas, Lax): demonstrate the simulation infrastructure at the September Year 2 review; evaluate OS scalability and performance; begin integrating native SADE components.

We’ll post a short update every Tuesday and syndicate it to LinkedIn and other channels. Questions and ideas are welcome on GitHub Discussions.

CCL Team at GCASR 2025

Members of the CCL team traveled to Chicago, Illinois on May 8 to attend GCASR 2025 (12th Greater Chicago Area Systems Research Workshop).

CCL team members presented their work in both poster sessions.

Barry Sly-Delgado presented: Task Graph Restructuring via Function Based Annotation For Large-Scale Scientific Applications.

Colin Thomas presented: Enabling Tailored Optimizations for Scientific Workflows through I/O Tracing.

Md. Saiful Islam presented: Floability: Enabling Portable Scientific Workflows Across HPC Facilities.

Jin Zhou presented: Effectively Exploiting Node-Local Storage For Data-Intensive Scientific Workflows.

Tuesday, January 14, 2025

Reshaping High Energy Physics Applications Using TaskVine @ SC24

Barry Sly-Delgado presented our paper titled: "Reshaping High Energy Physics Applications for Near-Interactive Execution Using TaskVine" at the 2024 Supercomputing Conference in Atlanta, Georgia. This paper investigates the necessary steps to convert long-running high-throughput high energy physics applications to high concurrency. This included incorporating new functionality within TaskVine. The paper presents the speedup gained as changes were incorporated to the workflow execution stack for application DV3. We eventually achieve a speedup of 13X.

Configurations for each workflow execution stack as improvements were made.

Starting with stack 1. The first change incorporated was to the storage system where initial data sets are stored. This change showed little improvement, taking 3545s runtime to 3378s. This change is minimal as much of the data handling during application execution is related to intermediate results. Initially, with stack 1 and 2, intermediate data movement is handled via a centralized manager.

data movement during application execution between Work Queue and TaskVine. With TaskVine, the most data exchanged between any two nodes tops off around 4GB (the manger is node 0). With Work Queue the most data transferred is 40GB

Incorporated in stack 3 is a change of scheduler, TaskVine. Here, TaskVine allows for intermediate results to be stored on node-local storage and transferred between peer compute nodes. This relieves strain on the centralized manager and allows it to schedule tasks more effectively. This change drops the runtime to 730s.

CDF of task runtimes within the application per execution paradigm. With Function Calls, individual tasks execute faster.

Our final improvement changes the previous task execution paradigm within TaskVine. Initially "PythonTasks" serialized functions along with arguments and distributed them to compute nodes to execute individual tasks. Under this paradigm, the python interpreter would be invoked for each individual task. Our new task execution paradigm, "Function Calls" stands up a persistent Python process, "Library Task", that contains function definitions that can be invoked via individual function calls. Thus, invocations of the Python interpreter are reduced from per-task to per-compute-node. This change reduces runtime to 272s for a 13X speedup from our initial configuration.

Application execution comparison between stack configurations.