CCL grad student Cami Carballo gave an interactive notebook talk on scaling up data analysis workloads at the PyHEP 2020 conference on Python for high energy physics.
This Python notebook (Integrating-Coffea-and-WorkQueue.ipynb) demonstrates the combination of the Coffea data analysis framework running on the Work Queue distributed execution system, all packaged up within a Jupyter notebook.
A particular challenge in cluster environments is making sure that the remote execution nodes have the proper Python execution environment needed by the end user. Scientific applications change quickly, and so it's important to have exactly the right Python interpreter along with the precise set of libraries (Python and native) installed. To accomplish this, the Coffea-WorkQueue module performs a static analysis of the dependencies needed by an application, and ships them along with the remote tasks, deploying them as needed so that multiple independent applications can run simultaneously on the cluster.
Coffea + Work Queue is under active development as we continue to tune and scale the combined system.