High-energy physics analysis at scale depends on efficient data processing pipelines. When working with ROOT files in distributed computing environments, even small inefficiencies compound quickly, especially when the same preprocessing work gets repeated across hundreds or thousands of workers. That's the problem we're tackling in this pull request for Coffea: eliminating redundant preprocessing overhead through file-backed caching.
Coffea's preprocessor currently recomputes metadata for ROOT files on every run. While this guarantees freshness, it also means that workflows repeatedly pay the cost of extracting file structure, branch information, and entry counts even when those files haven't changed in weeks or months. In clustered environments, this creates a particularly important issue: preprocessing must complete before actual analysis begins, so workers with lighter preprocessing loads sit idle while stragglers finish. The result is wasted CPU cycles, longer time-to-results, and underutilized infrastructure.Wednesday, December 17, 2025
Accelerating Coffea Workflows with Persistent Preprocessing Cache
Tuesday, December 9, 2025
Exploring Execution Strategies and Compositional Trade-Offs in the Context of Large-Scale HEP Workflows
The European Organization for Nuclear Research (CERN) has four main High Energy Physics experiments, the Compact Muon Solenoid (CMS) being one of them. These experiments are already approaching the Exabyte-scale, and data rates are planned to increase significantly with the High-Luminosity Large Hadron Collider (HL-LHC).
Flexible workload specification and execution are critical to the success of the CMS Physics program, which currently utilizes approximately half a million CPU cores across the Worldwide LHC Computing Grid (WLCG) and other HPC centers, executing Directed Acyclic Graph (DAG) workflows for data reprocessing and Monte Carlo production. Each of the nodes in the DAG corresponds to a taskset, they can have different resource requirements such as operating system version, CPU cores, memory, GPUs, and every taskset can spawn 1 to millions of grid jobs.
In this research, we explore the hybrid spectrum of workflow composition by interpolating between the two currently supported specifications, known as TaskChain and StepChain. They employ very distinct workflow paradigms: TaskChain executes a single physics payload per grid job, whereas StepChain processes multiple payloads within the same job. To address the challenge of heterogeneous workflow requirements, together with an increasingly diverse set of resources, an adaptive workflow specification is essential for efficient resource utilization and increased event throughput.
A DAG workflow simulation, named DAGFlowSim, has been developed to understand the tradeoff involved in the different workflow constructions. The simulator provides insights on event throughput, resource utilization, disk I/O requirements, etc. Given a sequential DAG with 5 heterogeneous tasksets/nodes, we can analyse the CPU utilization and time per event for the 16 possible workflow constructions. Construction 1 and Construction 16 are the extreme cases already supported, representing a tightly-dependent taskset execution (Stepchain-like) and a fully independent execution (Taskchain-like), respectively.
For more details on the methodology and simulation results, please check out the slide deck presented at the ACAT 2025 Workshop.
Tuesday, December 2, 2025
TaskVine Insights - Storage Management: Depth-Aware Pruning
Modern scientific workflows often span tens or hundreds of thousands of tasks, forming deep DAGs (directed acyclic graphs) that handle large volumes of intermediate data. The large number of tasks primarily results from ever-growing experimental datasets and their partitioning into various data chunks. Each data chunk is applied to a subgraph to form a branch, and different branches typically have an identical graph structure. Collectively, these form a gigantic computing graph. This graph is then fed into a WMS (workflow management system) which manages task dependencies, distributes the tasks across a HPC (high performance computing) cluster, and retrieves outputs when they are available. Below is the general architecture of the data partitioning and workflow computation process.
TaskVine serves as the core WMS component in such an architecture. The unique feature of TaskVine lies in its advantage of leveraging NLS (node-local storage) to alleviate I/O burdens toward the centralized PFS (parallel file system). This allows all intermediate data (or part of it) to be stored and accessed on individual workers' local disks, and peer-to-peer transfers may be performed on demand.So, to reserve certain redundancies to minimize recomputation in response to worker failures, TaskVine provides the option to tune the depth of ancestors for each file before it can be pruned.
In the implementation, pruning is triggered at the moment a task finishes, when its outputs are checked for eligibility to be discarded. Each file carries a pruning depth parameter that specifies how many generations of consumer tasks must finish before the file can be pruned. Depth-1 pruning is equivalent to aggressive pruning, while depth-k pruning (k is greater than or equal to 2) retains the file until every output of its consumer tasks has itself satisfied the depth-(k-1) criterion. Formally, the pruning time is defined as: P(f) equals the minimum of P-sub-k(g) for all g that are outputs of f's consumers.
Depth-aware pruning can work in conjunction with two other fault tolerance mechanisms in TaskVine: peer replication, in which every intermediate file is replicated across multiple workers, and checkpointing, in which a portion of intermediates are written into the PFS for persistent storage. Subsequent articles will cover these topics to help readers better understand the underlying tradeoffs and know how to select the appropriate parameters to suit their own needs.
TaskVine users can explore the manual to see how to adjust these parameters to suit their needs for the tradeoff between storage efficiency and fault tolerance. For depth-aware pruning, we recommend using aggressive pruning (depth-1 pruning) as the general and default setting to save storage space to the largest extent, and increase the depth parameter when worker failures are not uncommon and additional data redundancy is needed.


