In an upcoming paper to be presented at IPDPS 2022, we discuss our experience with designing and executing high throughput data intensive applications for high energy physics. The application itself is pretty cool: TopEFT is a physics application that uses the Coffea framework for parallelization, the Work Queue framework for distributed execution, and XRootD for remote data access:
Configuring such applications to run on large clusters is a substantial end-user problem. It's not enough to write a correct application: one must also select a wide variety of performance parameters, like the data chunk size, the task length, the amount of memory per task, and so on. When these are chosen well, everything runs smoothly. But even one parameter out of tune can result in the application taking orders of magnitude longer than necessary, wasting thousands of resources resources, or simply not running at all. Here is the end-to-end runtime for a few configurations with slight variations:
With these techniques, we are able to relieve the user of the burden of setting a variety of controls, and allow the system to find its own stable configuration. Check it out:
- Ben Tovar, Ben Lyons, Kelci Mohrman, Barry Sly-Delgado, Kevin Lannon, and Douglas Thain, Dynamic Task Shaping for High Throughput Data Analysis Applications in High Energy Physics, IEEE International Parallel and Distributed Processing Symposium, June, 2022.