Tuesday, October 14, 2025

Reducing Overhead of LLM-integrated Applications on GPU Clusters with Parsl+TaskVine

Large Language Models (LLMs) are becoming a key tool for scientific discovery, but using them on High-Performance Computing (HPC) clusters is challenging due to the limitations of traditional resource allocation methods. For instance, static allocation, which assigns a dedicated set of GPUs for a task, is a rigid system. This can lead to long queues of frustrated users and wasted resources, as the allocated GPUs sit idle while waiting for the next job to start. Meanwhile, opportunistic allocation allows tasks to use available, but not guaranteed, resources. While this improves overall cluster utilization, it's problematic for LLM applications. The initial loading of a multi-billion-parameter LLM is a time-consuming process, and since tasks in an opportunistic environment can be preempted at any moment, this expensive startup often has to be repeated from scratch.

To solve this, we propose a new technique called Pervasive Context Management. The core idea is to decouple the LLM initialization context from the actual inference tasks and keep this context persistent on GPUs until it is no longer needed. This transforms the high startup cost into a one-time, amortizable expense. When a task is preempted, it can be quickly rescheduled to another GPU that already has the necessary context loaded, eliminating the need to re-initialize the model. Our Parsl+TaskVine system can also transfer existing context between nodes to bootstrap new GPUs, reducing data transfer time and avoiding bottlenecks.



To demonstrate the effectiveness of this approach, we transformed a fact-verification application to use Pervasive Context Management and conducted a comprehensive evaluation. Our results show a significant improvement in performance on both static and opportunistic resources. By enabling Pervasive Context Management, the end-to-end execution time of our application was reduced by 72.1%, from 3 hours to just 48 minutes, using the same number of GPUs. The application also scaled efficiently on up to 32.8% of all GPUs (up to 186 GPUs) in the cluster, further reducing the execution time to a mere 13 minutes.



Additionally, Pervasive Context Management helps users avoid the complex problem of tuning the inference batch size. Because the expensive startup cost is now a one-time event, the application's performance becomes much more stable regardless of the batch size chosen. This removes the burden of manual tuning and ensures near-optimal execution. In summary, our findings show that this new approach is a viable solution for running high-throughput LLM inference applications efficiently on heterogeneous and opportunistic HPC clusters.



No comments:

Post a Comment