Tuesday, September 23, 2025

Floability at eScience 2025: Making Notebooks Portable with Backpacks Across HPC Clusters

Grad student Saiful Islam presented our paper “Backpacks for Notebooks: Enabling Containerized Notebook Workflows in Distributed Environments” at the 2025 IEEE eScience Conference in Chicago, Illinois.

Notebooks have become the de facto interface for scientific computing, but moving from a local notebook to large-scale HPC clusters is far from seamless—especially when the notebook contains a distributed workflow that must be submitted across multiple nodes. A notebook file alone doesn’t capture the full execution context—it misses environment specifications, data locations, and resource requirements. As a result, the same notebook often runs on one system but fails on another.


This paper introduces the concept of a backpack—a lightweight companion that travels with the notebook and captures everything needed for execution. A backpack makes explicit the software environment, data sources, and resource requirements that are often left implicit in code. With Floability, our implementation of backpack specifications, backpacks transform ordinary notebooks into portable, reproducible workflows that can execute across heterogeneous HPC clusters with zero to minimal code modification.

We evaluated Floability on three representative scientific workflows—distributed image convolution, climate trend analysis, and high-energy physics data analysis—running them across five heterogeneous HPC systems (Notre Dame CRC, Purdue Anvil, UT Stampede3, OSPool, and AWS). In each case, backpacks successfully captured the required software and data dependencies, provisioned worker environments, and reproduced execution without code changes. While runtime varied due to site-specific infrastructure like schedulers and storage, all workflows completed consistently, demonstrating that backpacks enable portable, reproducible, and scalable execution of notebook workflows across diverse HPC environments.

For all the details, please check out our paper here: 

Md Saiful Islam, Talha Azaz, Raza Ahmad, A D M Shahadat Hossain, Furqan Baig, Shaowen Wang, Kevin Lannon, Tanu Malik, and Douglas Thain, Backpacks for Notebooks: Enabling Containerized Notebook Workflows in Distributed Environments, IEEE Conference on eScience, pages 9, September 2025.

Project website: https://floability.github.io

Example backpacks: https://github.com/floability/floability-examples

Tuesday, September 16, 2025

Workshop on Harmonizing Python Workflows at IEEE e-Science 2025

We helped host the Workshop on Harmonizing Python Workflows at IEEE International Conference on e-Science on Monday, 15 Sep 2025. 


This workshop is one component of our NSF POSE project to explore the creation of an open source ecosystem encompassing a variety of workflow technologies, including the Parsl Project (Kyle Chard at the University of Chicago), the RADICAL tools (Shantenu Jha, Rutgers University), and TaskVine (Douglas Thain, University of Notre Dame).  

Workshop attendees conducted several working groups where they identified key barriers to workflow creation, deployment, and adoption; proposed activities for an open source ecosystem to support these needs; and provided feedback on the possible structure sof an ecosystem. 

The organizers will be following up soon with a workshop report and draft proposal for next steps. 

Thank you to everyone who participated!




Tuesday, September 9, 2025

Welcome Back, Colin!

This past summer, 4th year PhD student Colin Thomas completed an internship at the National Energy Research Scientific Computing Center (NERSC) located at the Lawrence Berkeley National Laboratory.

Colin worked with a team of researchers and fellow interns to develop and deploy an Inference-as-a-Service (IaaS) platform for particle physics experiments including DUNE and ATLAS. Colin organized a network of services deployed on Kubernetes and the Perlmutter supercomputer which enabled remote scientists to run their applications and perform analysis of their data. The IaaS deployment consisted of metrics collection capabilities used to profile the system, identify bottlenecks, and inform the system about the need to scale the compute resources to meet the demands of users in real-time. The team profiled multiple inference-serving technologies such as Ray Serve and NVIDIA Triton. The team effort resulted in a number of successes including tests with scientists from multiple institutions, the gathering of valuable data from the profiles of Ray and Triton, and the system prototype which can serve as an example for future work in scientific inference serving.

Colin shared with us his wonderful experience during the summer and also gave an engaging talk about the work he has done during our first team meeting. Below is the poster he created and presented. We are excited to learn new ideas and look forward to more inspiring discussions ahead.



Tuesday, September 2, 2025

New Semester, New Faces


The new semester is here, and we’re excited to welcome three new colleagues and roll out a clear plan for the months ahead.

New faces. Lax joins as a first-year Ph.D. student. Ryan just completed his M.S. in our lab and begins his Ph.D. this semester. Abby joins as a first-year M.S. student. We’re glad to have them on board!

Each semester we adjust our routines and schedules to keep the lab running smoothly and to support a diverse team. This semester we’re trying a few new approaches. Colin will organize our weekly team talks with a schedule planned about a month ahead, host practice sessions for conference presentations, and invite remote guests for ~30-minute Zoom talks and discussion. Jin will handle outreach and support: a short blog every Tuesday, cross-posting to LinkedIn and other channels, encouraging GitHub Discussions as a support path, and serving as the first responder for incoming technical questions. Saiful will lead the website migration to a Jekyll site on GitHub, move over most existing content, convert prior posts into native Jekyll, and switch our papers database from PHP + database to a JSON dataset. Ben, the most experienced and supportive engineer who knows every line of the codebase, will guide release engineering: review PRs and give feedback, prioritize and maintain issues, keep integration tests in working order, and plan and ship regular software releases.

This semester, alongside our long-standing systems research, we will explore practical ways to pair our work with LLMs. For example, improving developer documentation, identifying code invariants, visualizing logs, and troubleshooting workflows. Ground rules: you are responsible for your output; do not create extra work for others; use ND-authorized tools such as Gemini and Gemini-CLI to protect data. This week each person will pick a modest task and tool to explore. Next week we will report on methods, results, and observations.

Project highlights.

  • CMS Computing (Barry, Jin, Ben, Alan): keep the current TaskVine release stable and usable; share successes with the HEP community; publish the new cms-taskvine-example repository; advance an IPDPS/TPDS paper on dynamic data reduction.

  • NSF CSSI Floability (Saiful, Abby, Ben): keep the release stable, usable, and documented; attach straightforward data handling for existing applications; grow users and collaborations; build interactive visualization of notebook workflows to support troubleshooting and performance; then extend the approach to new notebook and workflow models.

  • NSF Consistency Contracts (Colin + Jin): complete and release the basic toolchain to measure, summarize, and enforce; choose baseline techniques that exploit application behavior.

  • DOE XGFabric (Ryan, Thanh): capture the full software stack; port to multiple ACCESS sites and troubleshoot issues as they appear; evaluate scale, responsiveness, and resource use; develop solutions for HPC batch queue delays.

  • NSF HARMONY: run a fall workshop on workflow collaboration with the eScience conference; capture examples across Parsl-TaskVine (astro), DaskVine (HEP), and RADICAL (xgfabric).

  • NASA-SADE (Lucas, Lax): demonstrate the simulation infrastructure at the September Year 2 review; evaluate OS scalability and performance; begin integrating native SADE components.

We’ll post a short update every Tuesday and syndicate it to LinkedIn and other channels. Questions and ideas are welcome on GitHub Discussions.