Tuesday, December 9, 2025

Exploring Execution Strategies and Compositional Trade-Offs in the Context of Large-Scale HEP Workflows

The European Organization for Nuclear Research (CERN) has four main High Energy Physics experiments, the Compact Muon Solenoid (CMS) being one of them. These experiments are already approaching the Exabyte-scale, and data rates are planned to increase significantly with the High-Luminosity Large Hadron Collider (HL-LHC).



Flexible workload specification and execution are critical to the success of the CMS Physics program, which currently utilizes approximately half a million CPU cores across the Worldwide LHC Computing Grid (WLCG) and other HPC centers, executing Directed Acyclic Graph (DAG) workflows for data reprocessing and Monte Carlo production. Each of the nodes in the DAG corresponds to a taskset, they can have different resource requirements such as operating system version, CPU cores, memory, GPUs, and every taskset can spawn 1 to millions of grid jobs.

In this research, we explore the hybrid spectrum of workflow composition by interpolating between the two currently supported specifications, known as TaskChain and StepChain. They employ very distinct workflow paradigms: TaskChain executes a single physics payload per grid job, whereas StepChain processes multiple payloads within the same job. To address the challenge of heterogeneous workflow requirements, together with an increasingly diverse set of resources, an adaptive workflow specification is essential for efficient resource utilization and increased event throughput.

A DAG workflow simulation, named DAGFlowSim, has been developed to understand the tradeoff involved in the different workflow constructions. The simulator provides insights on event throughput, resource utilization, disk I/O requirements, etc. Given a sequential DAG with 5 heterogeneous tasksets/nodes, we can analyse the CPU utilization and time per event for the 16 possible workflow constructions. Construction 1 and Construction 16 are the extreme cases already supported, representing a tightly-dependent taskset execution (Stepchain-like) and a fully independent execution (Taskchain-like), respectively.



For more details on the methodology and simulation results, please check out the slide deck presented at the ACAT 2025 Workshop.

No comments:

Post a Comment