High-energy physics analysis at scale depends on efficient data processing pipelines. When working with ROOT files in distributed computing environments, even small inefficiencies compound quickly, especially when the same preprocessing work gets repeated across hundreds or thousands of workers. That's the problem we're tackling in this pull request for Coffea: eliminating redundant preprocessing overhead through file-backed caching.
Coffea's preprocessor currently recomputes metadata for ROOT files on every run. While this guarantees freshness, it also means that workflows repeatedly pay the cost of extracting file structure, branch information, and entry counts even when those files haven't changed in weeks or months. In clustered environments, this creates a particularly important issue: preprocessing must complete before actual analysis begins, so workers with lighter preprocessing loads sit idle while stragglers finish. The result is wasted CPU cycles, longer time-to-results, and underutilized infrastructure.Wednesday, December 17, 2025
Accelerating Coffea Workflows with Persistent Preprocessing Cache
Tuesday, December 9, 2025
Exploring Execution Strategies and Compositional Trade-Offs in the Context of Large-Scale HEP Workflows
The European Organization for Nuclear Research (CERN) has four main High Energy Physics experiments, the Compact Muon Solenoid (CMS) being one of them. These experiments are already approaching the Exabyte-scale, and data rates are planned to increase significantly with the High-Luminosity Large Hadron Collider (HL-LHC).
Flexible workload specification and execution are critical to the success of the CMS Physics program, which currently utilizes approximately half a million CPU cores across the Worldwide LHC Computing Grid (WLCG) and other HPC centers, executing Directed Acyclic Graph (DAG) workflows for data reprocessing and Monte Carlo production. Each of the nodes in the DAG corresponds to a taskset, they can have different resource requirements such as operating system version, CPU cores, memory, GPUs, and every taskset can spawn 1 to millions of grid jobs.
In this research, we explore the hybrid spectrum of workflow composition by interpolating between the two currently supported specifications, known as TaskChain and StepChain. They employ very distinct workflow paradigms: TaskChain executes a single physics payload per grid job, whereas StepChain processes multiple payloads within the same job. To address the challenge of heterogeneous workflow requirements, together with an increasingly diverse set of resources, an adaptive workflow specification is essential for efficient resource utilization and increased event throughput.
A DAG workflow simulation, named DAGFlowSim, has been developed to understand the tradeoff involved in the different workflow constructions. The simulator provides insights on event throughput, resource utilization, disk I/O requirements, etc. Given a sequential DAG with 5 heterogeneous tasksets/nodes, we can analyse the CPU utilization and time per event for the 16 possible workflow constructions. Construction 1 and Construction 16 are the extreme cases already supported, representing a tightly-dependent taskset execution (Stepchain-like) and a fully independent execution (Taskchain-like), respectively.
For more details on the methodology and simulation results, please check out the slide deck presented at the ACAT 2025 Workshop.
Tuesday, December 2, 2025
TaskVine Insights - Storage Management: Depth-Aware Pruning
Modern scientific workflows often span tens or hundreds of thousands of tasks, forming deep DAGs (directed acyclic graphs) that handle large volumes of intermediate data. The large number of tasks primarily results from ever-growing experimental datasets and their partitioning into various data chunks. Each data chunk is applied to a subgraph to form a branch, and different branches typically have an identical graph structure. Collectively, these form a gigantic computing graph. This graph is then fed into a WMS (workflow management system) which manages task dependencies, distributes the tasks across a HPC (high performance computing) cluster, and retrieves outputs when they are available. Below is the general architecture of the data partitioning and workflow computation process.
TaskVine serves as the core WMS component in such an architecture. The unique feature of TaskVine lies in its advantage of leveraging NLS (node-local storage) to alleviate I/O burdens toward the centralized PFS (parallel file system). This allows all intermediate data (or part of it) to be stored and accessed on individual workers' local disks, and peer-to-peer transfers may be performed on demand.So, to reserve certain redundancies to minimize recomputation in response to worker failures, TaskVine provides the option to tune the depth of ancestors for each file before it can be pruned.
In the implementation, pruning is triggered at the moment a task finishes, when its outputs are checked for eligibility to be discarded. Each file carries a pruning depth parameter that specifies how many generations of consumer tasks must finish before the file can be pruned. Depth-1 pruning is equivalent to aggressive pruning, while depth-k pruning (k is greater than or equal to 2) retains the file until every output of its consumer tasks has itself satisfied the depth-(k-1) criterion. Formally, the pruning time is defined as: P(f) equals the minimum of P-sub-k(g) for all g that are outputs of f's consumers.
Depth-aware pruning can work in conjunction with two other fault tolerance mechanisms in TaskVine: peer replication, in which every intermediate file is replicated across multiple workers, and checkpointing, in which a portion of intermediates are written into the PFS for persistent storage. Subsequent articles will cover these topics to help readers better understand the underlying tradeoffs and know how to select the appropriate parameters to suit their own needs.
TaskVine users can explore the manual to see how to adjust these parameters to suit their needs for the tradeoff between storage efficiency and fault tolerance. For depth-aware pruning, we recommend using aggressive pruning (depth-1 pruning) as the general and default setting to save storage space to the largest extent, and increase the depth parameter when worker failures are not uncommon and additional data redundancy is needed.
Tuesday, November 25, 2025
Your First Distributed Workflow on ACCESS CI: A Grad Student’s Checklist with TaskVine
What This Guide Is (and Isn’t)
- get an ACCESS CI allocation,
- choose the right resources (Purdue Anvil, Stampede3, etc.),
- log in for the first time, and
- run a minimal TaskVine workflow across real HPC nodes.
What Is ACCESS CI
- Explore — Small, quick-start allocations for testing resources or class projects.
- Discover — Moderate allocations for active research or larger coursework.
- Accelerate — Mid-scale allocations for sustained, collaborative research.
- Maximize — Large-scale, high-impact research requiring major resources.
Most students start with Discover or Explore, and approvals usually take only a few days.
ACCESS CI vs OSPool (super short version)
OSPool is another free national computing service—part of the OSG/OSDF ecosystem—that provides high-throughput compute on opportunistic resources. You log in to a single access point and submit your workflow, and OSPool automatically distributes your tasks to whatever compute resources are available across the country.
ACCESS CI, in contrast, uses an allocation and credit-based model. You request an allocation, receive credits, and then “spend” those credits on specific HPC systems (like Anvil or Stampede3). You get scheduled jobs, predictable queue behavior, and access to fixed hardware configurations (e.g., large-memory nodes, GPUs, high-speed filesystems).
In short:
-
OSPool is great for workloads that are embarrassingly parallel, tolerant of variable performance, and don’t require specific node types.
-
ACCESS CI is great when you need predictable resources, specific hardware (like GPUs or large-memory nodes), or access to a particular HPC environment.
Checklist Overview (The Whole Journey at a Glance)
This is the full path we’ll walk through in the rest of the post:
- Create an ACCESS CI account
- Find the resources you need for your project
- Prepare and submit your request
- Pick a site (exchange credits)
- Set up SSH keys and log in to the sites
- Install TaskVine
- Run your first TaskVine workflow
- Debugging & Next Steps
Where these steps are documented
-
Steps 1–4 are well-documented on the ACCESS CI site:
👉 https://allocations.access-ci.org/get-your-first-project -
Step 5 (logging in, SSH, MFA, environment modules) varies by site and is documented in each system’s user guide (e.g., Anvil, Stampede3).
-
Steps 6 and 7 (installing TaskVine and running a TaskVine workflow) are covered in the official TaskVine manual.
This guide brings all of them together in one place, filling in the gaps so you can follow a smooth end-to-end workflow—from getting an allocation to running your very first distributed TaskVine program on an NSF HPC machine.
Step 1 — Create an ACCESS CI Account
If your university isn’t listed, you can register by creating an ACCESS-specific username, password, and Duo MFA.
Full instructions are here:
👉 https://operations.access-ci.org/identity/new-user
Step 2 — Choosing Resources for Your Project
How to pick the right resource
-
Choose the simplest site that fits your needs.
For most introductory TaskVine workflows, standard CPU-only systems are sufficient. -
Consider credit cost.
Each site charges different ACCESS credits per CPU-hour or GPU-hour.
If you don’t need GPUs or large-memory nodes, pick a CPU-only site to make your credits last longer.
-
Match the site to your workflow.
-
CPU-only workloads → pick sites that offer regular CPU nodes (no GPUs needed).
-
GPU workloads → choose GPU-capable systems.
-
High-memory jobs → pick sites offering LM/XL nodes.
-
-
Need help choosing?
You can ask the ACCESS team for personalized resource suggestions here:
👉 https://ara.access-ci.org/
For this guide, we’ll use Purdue Anvil and TACC Stampede3 because they are beginner-friendly and well-documented.
Step 3 — Prepare and Submit Your Request
To submit an Explore request, you only need a few pieces of information and some simple PDFs. The full instructions are here:
👉 https://allocations.access-ci.org/get-your-first-project
What to prepare
-
A project title
-
A short overview describing what you plan to run and why you need ACCESS
-
Your own information (you are the PI for this ACCESS project)
-
Optional: grant number if your work is funded
-
Two PDF files:
-
Your CV or résumé (max 3 pages)
-
If you are a graduate student PI, a brief advisor support letter (PDF)
-
(For class projects, upload a syllabus instead of a letter)
-
- Available Resources
- When the form asks which resources you want, simply check “ACCESS Credits.”
Submitting
- Go to 👉 https://allocations.access-ci.org/
- Click Request New Project
- Select Explore ACCESS
- Fill out the form, upload your PDFs, and click Submit
Step 4 — Pick a Site (Exchange Credits)
How to exchange credits
-
Open your project and go to Credits + Resources.
-
Click Add a Resource.
-
In the dropdown, search for the site you selected in Step 2 (e.g., a CPU-only site).
-
Enter how many SUs, Node Hours, or GPU Hours you want to allocate.
What this means
Exchanging credits “activates” your access to that cluster.
After approval, you’ll be able to log in and submit jobs there.
How long it takes
Exchange requests are usually approved within a few days (up to one week).
Step 5 — Set Up SSH Keys and Log In
Here is a brief summary of what to expect:
Anvil (Purdue) — Quick Summary
-
Visit Anvil OnDemand:
👉 https://ondemand.anvil.rcac.purdue.edu -
Log in and set up ACCESS Duo MFA.
-
Use the OnDemand interface to upload your SSH public key.
-
Then you can log in via SSH:
ssh -l x-ACCESSusername anvil.rcac.purdue.edu
Important: Your Anvil username
-
Your Anvil username starts with
x-, e.g.,x-mislam5. -
This is not the same as your ACCESS username.
-
You can see your exact Anvil username on your Credits + Resources page next to the Anvil resource entry.
Notes from the official docs
-
Anvil does not accept passwords — only SSH keys.
-
Your ACCESS password will not work for SSH.
-
SSH keys must be created or uploaded through the OnDemand interface.
Stampede3 (TACC) — Quick Summary
-
Log in to the TACC Accounts Portal:
https://accounts.tacc.utexas.edu/login?redirect_url=profile -
Accept the usage agreement and set up TACC’s MFA.
-
Then SSH into the system:
ssh myusername@stampede3.tacc.utexas.edu
You’ll be prompted for your password and MFA token.
Full instructions:
👉 https://docs.tacc.utexas.edu/hpc/stampede3/#access
Step 6 — Install TaskVine
Once you can log in to your HPC site, the next step is installing TaskVine.
TaskVine is part of the CCTools suite and is easiest to install through Conda.
Most HPC sites provide their own Conda modules, but these are often outdated.
We recommend installing a fresh Miniforge in your home directory and using that environment for TaskVine.
Official installation guide:
👉 https://cctools.readthedocs.io/en/stable/install/
Here’s the quick version:
Install CCTools (TaskVine) with Conda
# Create your environment (run once)
conda create -n cctools-env -y -c conda-forge --strict-channel-priority python ndcctools
# Activate the environment (run every time)
conda activate cctools-env
After this, TaskVine is ready to use on your HPC site.
Step 7 — Run Your First TaskVine Workflow
Now that TaskVine is installed, let’s run your very first distributed workflow.
We’ll use the TaskVine Quickstart example from the official docs, with one small change:
-
Instead of choosing a fixed port, we set
port=0, which lets TaskVine automatically pick an available port. We give the manager a name (
tv-quickstart-blog) sovine_factorycan discover it without you typing hostnames or ports.
$HOME on Anvil or /work/... on Stampede3).Inside it, create a file named
quickstart.py and paste:Run the manager
Open one terminal (or one tmux pane) and activate your environment:
conda activate cctools-env python quickstart.pyLeave this running. The manager will start, pick an open port, and publish itself to the TaskVine catalog.Start workers using vine_factory
Instead of running workers manually, we use vine_factory to submit workers through Slurm.Basic command:vine_factory -Tslurm --manager-name tv-quickstart-blogHowever, each cluster usually requires a few Slurm batch options, which you pass using --batch-options.Stampede3 exampleStampede3 generally expects a partition and a wall-clock limit:vine_factory -Tslurm --manager-name tv-quickstart-blog \--batch-options "-p icx -t 01:00:00"More details:Anvil exampleAnvil requires selecting a partition (queue). For example:vine_factory -Tslurm --manager-name tv-quickstart-blog \--batch-options "-p wide"More details:What happens next
vine_factorysubmits Slurm jobs that run Vine workers on compute nodes.When those workers start, they automatically connect to the manager using the name
tv-quickstart-blog(through the TaskVine catalog).The manager sends tasks to each worker as they connect.
In the first terminal, you will eventually see the results. Each task is computed on a node in the cluster. Depending on the queue and resource availability, it may take some time for workers to start and complete the tasks.Debugging & Next Steps
If you’ve made it this far and your workflow ran successfully — congratulations!
You’ve just deployed a real distributed TaskVine workflow on an ACCESS HPC cluster.But if things didn’t work on the first try, here are the most common issues and how to fix them.
Things That Can Go Wrong (and How to Fix Them)
1. Workers are running, but no tasks ever complete
You checked
squeueand see the Vine worker jobs are active, but the manager terminal never shows progress.This almost always means:
the compute node cannot connect back to the manager’s TCP port.To confirm, run a simple TCP test Slurm job (outside TaskVine) that tries to connect from a compute node → login node. If that fails, workers will never reach the manager.
This problem depends on site network rules, so test it early.
2. Missing or incorrect Slurm batch options
Every cluster has slightly different expectations.
Check the site’s documentation for required options like:
partition (
-p)wall time (
-t)constraints
account or project flags
If workers never start or get rejected, this is usually the reason.
3. Vine Factory should use a shared filesystem
The worker scripts need to access some files provided by
vine_factory.
To keep things simple and reliable, you should either:
run
vine_factoryfrom a shared filesystem (one that is mounted on the compute nodes), orset
--scratch-dirto a directory on a shared filesystem.
Check your site’s documentation to see which partitions/directories are visible from compute nodes. For example, some clusters share
$HOME, while others require you to use a/workor/projectdirectory.What’s Next
If your quickstart worked—great! You now have a full TaskVine workflow running on an ACCESS CI cluster.
Here are a few things you can try next:
Read the TaskVine manual to explore caching, retries, and performance tuning
Increase the number of workers and test scaling
Try larger or real research datasets
Experiment with additional ACCESS CI sites
TaskVine grows with your research, and you now have the foundation to scale up your workflows with confidence.
Tuesday, November 18, 2025
Scaling SADE (Safety Aware Drone Ecosystem): A Hybrid UAV Simulation System for High-Fidelity Research
Autonomous drones are moving into increasingly complex, real-world environments where safety, compliance, and reliability have to be built in from the start. That's the motivation behind SADE—the Safety Aware Drone Ecosystem. SADE brings together physics-accurate simulation and photorealistic rendering, paired with PX4-based autonomy, to give researchers and developers a realistic, repeatable space to design, test, and validate UAV behavior before it ever leaves the lab.
Under the hood, SADE combines Gazebo for flight dynamics and physics with Unreal Engine 5 for immersive visuals, augmented by Cesium for globe-scale terrain that closely mirrors real geography. PX4 anchors flight control, and Mavlink backed by a "Mavlink Router" keeps telemetry and commands flowing reliably across multiple endpoints. This setup makes it possible to stream camera feeds from the simulated environment back into the autopilot loop, exercise computer vision pipelines under challenging conditions, and verify perception-driven behaviors in a safe, controlled feedback cycle.
The architecture is designed for scale and clarity. A React front end, integrated with Cesium-JS, lets users define zones, geometry, and mission parameters visually. Those configurations are saved by the backend and published to an MQTT broker, which the workflow runner listens to. From there, the runner launches isolated Docker Compose networks to run concurrent simulations without interference. In these stacks, PX4, Gazebo, pose-sending utilities, and shepherd tools (suas-engine) coordinate flight behavior, while an Unreal Engine container—accelerated via NVIDIA's container toolkit—renders the world. A Pixel Streaming signaling server brings UE5 to the browser, and the Mavlink Router consolidates telemetry to a single ground control station for consistent ops.
One of SADE's most useful ideas is the "SADE Zone." These are policy-aware regions that translate rules directly into drone behavior—think altitude caps, no-capture constraints for sensitive areas, or mandatory permission requests before entering and when exiting. Instead of treating compliance as something to bolt on later, SADE embeds policy into the simulation itself so drones "self-behave" according to the rules of the environment. That blend of physical realism and operational policy is essential for regulated and safety-critical use cases.
From a user standpoint, the SADE UI is where everything comes together. It gives you a flexible way to draw zones, set rules, and turn those constraints into autonomous responses. By capturing precise coordinates and dimensions, the UI drives compliance automatically, freeing teams to focus on algorithm design, perception, and mission planning. This approach is especially powerful for rare or risky scenarios that are hard to reproduce in the field but vital for building robust autonomy. SADE is built to handle multiple users and runs at once, intelligently distributing CPU and GPU resources so experiments remain responsive at scale. Looking ahead, the roadmap emphasizes richer workflows, larger multi-vehicle scenarios, and deeper policy enforcement to narrow the gap between simulation and real operations.
If you're working on UAV autonomy, simulation tooling, or regulated operations and want to explore policy-aware zones in photo real environments, check out the demo: https://sade.crc.nd.edu/
Tuesday, November 11, 2025
Wrangling Massive Tasks Graphs with Dynamic Hierarchical Composition
With a DDR, we take advantage of the structure inherent in many HEP applications where when processing multiple collision events, the accumulation (reduction) step is typically both associative and commutative. This means that it is unnecessary to pre-determine which processed events are reduced together and can leverage factors such availability of data location. Further, the number of events processed together can respond dynamically to the resources available, and datasets can be processed independently.
In the DDR application stack, TaskVine acts as the execution platform that distributes the computation to the cluster.
As an example, we ran Cortado, a HEP application that processes 419 datasets, 19,631 files, and 14TB of data (totaling 12,000 million events) in about 5.5 hours using over 1600 cores at any one time. During the run some of these cores had to be replaced because of resources eviction.
For more information, please visit the DDR pipy page at https://pypi.org/project/dynamic-data-reduction/Tuesday, November 4, 2025
TaskVine Insights - Storage Management: Disk Load Shifting
The University of Notre Dame operates an HTCondor cluster with roughly 20,000 cores for scientific computing. The system consists of heterogeneous machines accessible to all users and is managed by the Center for Research Computing (CRC). As newer, more advanced nodes are added, the cluster continues to expand its computational capacity, enabling increasingly sophisticated research. TaskVine primarily runs on this HTCondor environment, connecting hundreds of workers and executing thousands of tasks while optimizing data locality, disk usage, and scheduling efficiency to maximize throughput. The screenshot below shows the cluster’s activity at 12:10 p.m. on November 4, from this website.
Leveraging node-local storage (NLS) to cache data locally and reduce I/O bottleneck on the centralized parallel filesystem (PFS) is a core feature of TaskVine. However, in a heterogeneous cluster where nodes differ in hardware and architecture, NLS can be both a strength and a liability. While it provides high I/O bandwidth for intermediate data, heterogeneity often leads to imbalance: some workers may run out of disk space while others remain half empty. This imbalance can cause worker crashes, unnecessary recomputation, and wasted cluster resources.
Disk load skew can arise from several factors. One major cause is cluster heterogeneity. As newer and more powerful compute nodes are added to an HPC cluster, they naturally execute tasks faster and produce more data, leading to uneven disk consumption.
To evaluate how heterogeneity contributes to this imbalance, we executed 20,000 identical tasks involving both computation and disk I/O on 100 workers, each with 16 cores. The figure below shows the number of completed tasks per worker over time. Some workers consistently outperformed others: by the end of the workflow, the fastest worker completed around 350 tasks, while the slowest managed only about 100, which is nearly a fourfold difference. As these faster nodes process tasks more quickly, they accumulate intermediate data at a higher rate and thus face an increased risk of local disk exhaustion.
On the other hand, the imbalance can also arise from the algorithm that determines task placement and data storage. To minimize the overhead of transferring data between nodes, a common approach is to schedule a task where its input data already resides. Once the task completes, its child tasks are often dispatched to the same node to exploit data locality. Over time, this strategy can lead to a subset of nodes accumulating a large amount of intermediate data.
To examine this effect, we ran a DAG-structured high-energy physics (HEP) workflow, DV5, consisting of approximately 250,000 tasks on 100 workers, each with 16 cores, and monitored the storage consumption on each node over time. We performed two runs: one prioritized assigning tasks to workers that already had the input files, while the other favored workers with the most available disk space, shown in the following two figures, respectively.
The results were somewhat surprising: despite the theoretical expectation that the latter policy should yield a more balanced distribution, both runs showed similarly skewed disk usage patterns. One possible explanation is that task scheduling occurs very rapidly and at high concurrency, so when a task becomes eligible, only a few workers are available. As a result, tasks naturally tend to execute on workers that just completed their parent tasks, maintaining locality even without explicit intent.
In TaskVine, this technique is triggered when a worker reports that a temporary file has been cached. When this happens, the manager’s cache-update handler inspects the newly staged replica and, if shifting is enabled, scans the worker pool to find a lighter host. Only workers that are active, with transfer capability, and do not already hold the file are considered. Candidates that would end up heavier than the source after receiving the file are skipped, leaving the lightest eligible worker to take the replica so that each transfer moves free space toward balance instead of swapping hot spots.
The migration reuses TaskVine’s existing peer-transfer pipeline. The destination streams the file directly from the source, the replica catalog tracks its state from CREATING to READY, and both workers update their transfer counters for admission control. Once the new replica is confirmed, the original worker releases its redundant copy to the cleanup routine, reclaiming disk space that just served its purpose. The work involved is modest, requiring only a single hash-table scan and one network transfer per staged replica, but the payoff is immediate: fast workers stay ahead of their disk usage, slower nodes lend idle capacity, and heterogeneous clusters keep their node-local storage evenly balanced without reducing throughput.
The following figure compares the NLS usage across all workers over time in the DV5 workflow, before and after enabling the two techniques.
After enabling Redundant Replica Removal and Disk Load Shifting, the NLS usage among workers became much more balanced. As shown in the bottom figure, storage consumption across nodes stayed within a narrow range under 10 GB, compared to over 20 GB and a significant skew before optimization. This indicates that the two techniques effectively prevented disk hotspots and improved overall resource utilization. In terms of overhead, the pre-optimization run completed in 206.85 seconds, while the optimized run took 311.92 seconds, indicating that the additional data transfers introduced a noticeable slowdown.
Both techniques are implemented on the TaskVine manager side in C, but from the user’s perspective they are simple to enable. After creating a manager object through the Python interface, for example:
m = vine.Manager(port=[9123, 9130]),
you can activate them individually with:
m.tune("clean-redundant-replicas", 1) and m.tune("shift-disk-load", 1).
While these modes are effective, they are not always recommended, since the additional data transfers and computations may introduce overhead and reduce overall throughput. However, if your workflow runs on disk-constrained nodes or workers are being evicted due to insufficient storage and you cannot request more disk space, enabling these options can significantly improve stability and performance.
Tuesday, October 28, 2025
Simulating Digital Agriculture in Near Real-Time with xGFabric
Advanced scientific applications in digital agriculture require coupling distributed sensor networks with high-performance computing facilities, but this integration faces significant challenges. Sensor networks provide low-performance, unreliable data access from remote locations, while HPC facilities offer enormous computing power through high-latency batch processing systems. For example, the Citrus Under Protective Screening (CUPS) project monitors environmental conditions in agricultural facilities to detect protective screening damage and requires real-time computational fluid dynamics simulations to guide interventions. Traditional networking approaches struggle to bridge these contrasting computing domains with the low latency and high reliability needed for near real-time decision making.
To address this challenge, we developed xGFabric, an end-to-end system that couples sensor networks with HPC facilities through private 5G wireless networks. The key innovation is using 5G network slicing to provide reliable, high-throughput connectivity between remote sensors and centralized computing resources. xGFabric integrates the CSPOT distributed runtime system for delay-tolerant networking and the Laminar dataflow programming environment to manage the entire pipeline from data collection to simulation execution. The system masks network interruptions, device failures, and batch-queuing delays by storing all program state in persistent logs, allowing computations to pause and resume seamlessly when resources become available.
To demonstrate the system's effectiveness, we deployed a working prototype connecting sensors at the University of Nebraska-Lincoln via private 5G networks to HPC facilities at multiple institutions. Their evaluation showed that the private 5G network achieved uplink throughput up to 65.97 Mbps and scaled effectively across multiple devices. Message latency through the CSPOT system averaged just 101 milliseconds over 5G, meeting the application's real-time requirements. The OpenFOAM CFD simulation completed in approximately 7 minutes using 64 cores, enabling the system to generate updated environmental predictions within the 30-minute decision window required by agricultural operators.
Overall, xGFabric demonstrated that private 5G networks can successfully bridge the gap between edge sensor deployments and HPC resources for time-critical scientific applications. By providing a unified software stack across all device scales and leveraging 5G's low latency and reliability, the system enables "in-the-loop" high-performance computing for real-time decision support in digital agriculture and other domains requiring coupled sensor-simulation workflows.
This work will be presented by Liubov Kurafeeva at the XLOOP workshop at SC '25 in St. Louis, Missouri
For more information you can visit our website at: https://sites.google.com/view/xgfabric








