Emerging large scale science applications may consist of a large number of dynamic tasks to be run across a large number of workers. When written in a Python-oriented framework like Parsl or FuncX those tasks are not heavyweight Unix processes, but rather lightweight invocations of individual functions. Running these functions at large scale presents two distinct challenges:
1 - The precise software dependencies needed by the function must be made available at each worker node. These dependencies must be chosen accurately: too few, and the function won't work; too many, and the cost of distribution is too high. We show a method for determining, distributing, and caching the exact dependencies needed at runtime, without user intervention.
2 - The right number of functions must be "packed" into large worker nodes that may have hundreds of cores and many GB of memory. Too few, and the system is badly underutilized; too many, and performance will suffer or invocations will crash. We show an automatic method for monitoring and predicting the resources consumed by categories of functions. This results in resource allocation that is much more efficient than an unmanaged approach, and is very close to an "oracle" that predicts perfectly.
The techniques shown in this paper are integrated into the Parsl workflow system from U-Chicago, and the Work Queue distributed execution framework from Notre Dame, both of which are open source software supported by the NSF CSSI program.
- Tim Shaffer, Zhuozhao Li, Ben Tovar, Yadu Babuji, TJ Dasso, Zoe Surma, Kyle Chard, Ian Foster, and Douglas Thain, Lightweight Function Monitors for Fine-Grained Management in Large Scale Python Applications, IEEE International Parallel & Distributed Processing Symposium, May, 2021.