Tuesday, November 25, 2025

Your First Distributed Workflow on ACCESS CI: A Grad Student’s Checklist with TaskVine

What This Guide Is (and Isn’t)

If you’re a student who wants to run real distributed computing experiments—without fighting half a dozen HPC manuals—you’re in the right place. This post is a quick, practical checklist to help you:

  • get an ACCESS CI allocation,
  • choose the right resources (Purdue Anvil, Stampede3, etc.),
  • log in for the first time, and
  • run a minimal TaskVine workflow across real HPC nodes.

This is not a replacement for the official docs. ACCESS CI, Anvil, Stampede3, and TaskVine all have excellent detailed documentation. You’ll find direct links throughout this guide.

Think of this post as your “start here” map—a student-friendly path from I’ve never used ACCESS to I ran my distributed workflow on an NSF funded HPC machine.

What Is ACCESS CI

ACCESS CI (Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support) is a nationwide, NSF-funded program that gives U.S.-based researchers free computing and storage resources. If you’re doing academic research—even as a first-year grad student or undergrad—you can request allocations and run your workflows on top-tier HPC systems.

When you request an allocation, you create an ACCESS project and choose a project type:
  • Explore — Small, quick-start allocations for testing resources or class projects.
  • Discover — Moderate allocations for active research or larger coursework.
  • Accelerate — Mid-scale allocations for sustained, collaborative research.
  • Maximize — Large-scale, high-impact research requiring major resources.

Most students start with Discover or Explore, and approvals usually take only a few days.

ACCESS CI vs OSPool (super short version)

OSPool is another free national computing service—part of the OSG/OSDF ecosystem—that provides high-throughput compute on opportunistic resources. You log in to a single access point and submit your workflow, and OSPool automatically distributes your tasks to whatever compute resources are available across the country.

ACCESS CI, in contrast, uses an allocation and credit-based model. You request an allocation, receive credits, and then “spend” those credits on specific HPC systems (like Anvil or Stampede3). You get scheduled jobs, predictable queue behavior, and access to fixed hardware configurations (e.g., large-memory nodes, GPUs, high-speed filesystems).

In short:

  • OSPool is great for workloads that are embarrassingly parallel, tolerant of variable performance, and don’t require specific node types.

  • ACCESS CI is great when you need predictable resources, specific hardware (like GPUs or large-memory nodes), or access to a particular HPC environment.

Checklist Overview (The Whole Journey at a Glance)

This is the full path we’ll walk through in the rest of the post:

  1. Create an ACCESS CI account
  2. Find the resources you need for your project
  3. Prepare and submit your request
  4. Pick a site (exchange credits)
  5. Set up SSH keys and log in to the sites
  6. Install TaskVine
  7. Run your first TaskVine workflow
  8. Debugging & Next Steps

Where these steps are documented

This guide brings all of them together in one place, filling in the gaps so you can follow a smooth end-to-end workflow—from getting an allocation to running your very first distributed TaskVine program on an NSF HPC machine.

Step 1 — Create an ACCESS CI Account

There are two ways to register for an ACCESS account, but the fastest option is to log in with your existing university identity. Most U.S. universities are supported—just choose your campus from the dropdown, click Log On, and follow the email-verification step.


If your university isn’t listed, you can register by creating an ACCESS-specific username, password, and Duo MFA.

Full instructions are here:

👉 https://operations.access-ci.org/identity/new-user

Step 2 — Choosing Resources for Your Project

ACCESS CI offers many HPC systems—CPUs, GPUs, large-memory nodes, and cloud resources. The easiest way to browse them is the ACCESS Resource Catalog:


How to pick the right resource

  • Choose the simplest site that fits your needs.
    For most introductory TaskVine workflows, standard CPU-only systems are sufficient.

  • Consider credit cost.
    Each site charges different ACCESS credits per CPU-hour or GPU-hour.
    If you don’t need GPUs or large-memory nodes, pick a CPU-only site to make your credits last longer. 

  • Match the site to your workflow.

    • CPU-only workloads → pick sites that offer regular CPU nodes (no GPUs needed).

    • GPU workloads → choose GPU-capable systems.

    • High-memory jobs → pick sites offering LM/XL nodes.

  • Need help choosing?
    You can ask the ACCESS team for personalized resource suggestions here:
    👉 https://ara.access-ci.org/

For this guide, we’ll use Purdue Anvil and TACC Stampede3 because they are beginner-friendly and well-documented.

Step 3 — Prepare and Submit Your Request

To submit an Explore request, you only need a few pieces of information and some simple PDFs. The full instructions are here:

👉 https://allocations.access-ci.org/get-your-first-project

What to prepare

  • A project title

  • A short overview describing what you plan to run and why you need ACCESS

  • Your own information (you are the PI for this ACCESS project)

  • Optional: grant number if your work is funded

  • Two PDF files:

    • Your CV or résumé (max 3 pages)

    • If you are a graduate student PI, a brief advisor support letter (PDF)

    • (For class projects, upload a syllabus instead of a letter)

  • Available Resources 
    • When the form asks which resources you want, simply check “ACCESS Credits.”

Submitting

Once everything is ready:
Explore requests are usually approved by the next business day, so keep an eye on your email.

Step 4 — Pick a Site (Exchange Credits)

Once your Explore project is approved, you’ll see it listed in the My Projects section of the Allocations Portal.
 


How to exchange credits

  1. Open your project and go to Credits + Resources.

  2. Click Add a Resource.

  3. In the dropdown, search for the site you selected in Step 2 (e.g., a CPU-only site).

  4. Enter how many SUs, Node Hours, or GPU Hours you want to allocate.

What this means

Exchanging credits “activates” your access to that cluster.
After approval, you’ll be able to log in and submit jobs there.

How long it takes

Exchange requests are usually approved within a few days (up to one week).

Step 5 — Set Up SSH Keys and Log In

Each ACCESS site has its own login process, but the most popular systems—Anvil and Stampede3—are very well documented. On your Credits + Resources page, you’ll see a small book icon next to every site. Clicking it takes you directly to that site’s official documentation.

For convenience, here are the two sites we are using in this guide:

Anvil Documentation:

Stampede3 Documentation:

You should always follow the official documentation, because login steps (especially MFA) may change after this blog is written.
Here is a brief summary of what to expect:

Anvil (Purdue) — Quick Summary

  1. Visit Anvil OnDemand:
    👉 https://ondemand.anvil.rcac.purdue.edu

  2. Log in and set up ACCESS Duo MFA.

  3. Use the OnDemand interface to upload your SSH public key.

  4. Then you can log in via SSH:

    ssh -l x-ACCESSusername anvil.rcac.purdue.edu

Important: Your Anvil username

  • Your Anvil username starts with x-, e.g., x-mislam5.

  • This is not the same as your ACCESS username.

  • You can see your exact Anvil username on your Credits + Resources page next to the Anvil resource entry.

Notes from the official docs

  • Anvil does not accept passwords — only SSH keys.

  • Your ACCESS password will not work for SSH.

  • SSH keys must be created or uploaded through the OnDemand interface.

Stampede3 (TACC) — Quick Summary

  1. Log in to the TACC Accounts Portal:
    https://accounts.tacc.utexas.edu/login?redirect_url=profile

  2. Accept the usage agreement and set up TACC’s MFA.

  3. Then SSH into the system:

    ssh myusername@stampede3.tacc.utexas.edu

You’ll be prompted for your password and MFA token.

Full instructions:
👉 https://docs.tacc.utexas.edu/hpc/stampede3/#access

Step 6 — Install TaskVine

Once you can log in to your HPC site, the next step is installing TaskVine.
TaskVine is part of the CCTools suite and is easiest to install through Conda.

Most HPC sites provide their own Conda modules, but these are often outdated.
We recommend installing a fresh Miniforge in your home directory and using that environment for TaskVine.

Official installation guide:
👉 https://cctools.readthedocs.io/en/stable/install/

Here’s the quick version:

Install CCTools (TaskVine) with Conda

# Create your environment (run once) conda create -n cctools-env -y -c conda-forge --strict-channel-priority python ndcctools # Activate the environment (run every time) conda activate cctools-env

After this, TaskVine is ready to use on your HPC site.

Step 7 — Run Your First TaskVine Workflow

Now that TaskVine is installed, let’s run your very first distributed workflow.
We’ll use the TaskVine Quickstart example from the official docs, with one small change:

  • Instead of choosing a fixed port, we set port=0, which lets TaskVine automatically pick an available port.

  • We give the manager a name (tv-quickstart-blog) so vine_factory can discover it without you typing hostnames or ports.


Create a directory on a shared filesystem (e.g., $HOME on Anvil or /work/... on Stampede3).
Inside it, create a file named quickstart.py and paste:

# quickstart.py

import ndcctools.taskvine as vine

# Create a new manager
m = vine.Manager(name ="tv-quickstart-blog", port=0)
print(f"Listening on port {m.port}")

# Declare a common input file to be shared by multiple tasks.
f = m.declare_url("https://www.gutenberg.org/cache/epub/2600/pg2600.txt", cache="workflow")

# Submit several tasks using that file.
print("Submitting tasks...")
for keyword in ['needle', 'house', 'water']:
task = vine.Task(f"grep {keyword} warandpeace.txt | wc")
task.add_input(f, "warandpeace.txt")
task.set_cores(1)
m.submit(task)

# As they complete, display the results:
print("Waiting for tasks to complete...")
while not m.empty():
task = m.wait(5)
if task:
print(f"Task {task.id} completed with result {task.output}")

print("All tasks done.")

Run the manager

Open one terminal (or one tmux pane) and activate your environment:

conda activate cctools-env python quickstart.py

Leave this running. The manager will start, pick an open port, and publish itself to the TaskVine catalog.

Start workers using vine_factory

Instead of running workers manually, we use vine_factory to submit workers through Slurm.

Basic command:

vine_factory -Tslurm --manager-name tv-quickstart-blog


However, each cluster usually requires a few Slurm batch options, which you pass using --batch-options.

Stampede3 example

Stampede3 generally expects a partition and a wall-clock limit:

vine_factory -Tslurm --manager-name tv-quickstart-blog \
  --batch-options "-p icx -t 01:00:00"


More details:

Anvil example

Anvil requires selecting a partition (queue). For example:

vine_factory -Tslurm --manager-name tv-quickstart-blog \
  --batch-options "-p wide"


More details:

What happens next

  • vine_factory submits Slurm jobs that run Vine workers on compute nodes.

  • When those workers start, they automatically connect to the manager using the name tv-quickstart-blog (through the TaskVine catalog).

  • The manager sends tasks to each worker as they connect.


In the first terminal, you will eventually see the results. Each task is computed on a node in the cluster. Depending on the queue and resource availability, it may take some time for workers to start and complete the tasks.

Debugging & Next Steps

If you’ve made it this far and your workflow ran successfully — congratulations!
You’ve just deployed a real distributed TaskVine workflow on an ACCESS HPC cluster.

But if things didn’t work on the first try, here are the most common issues and how to fix them.

Things That Can Go Wrong (and How to Fix Them)

1. Workers are running, but no tasks ever complete

You checked squeue and see the Vine worker jobs are active, but the manager terminal never shows progress.

This almost always means:
the compute node cannot connect back to the manager’s TCP port.

To confirm, run a simple TCP test Slurm job (outside TaskVine) that tries to connect from a compute node → login node. If that fails, workers will never reach the manager.

This problem depends on site network rules, so test it early.

2. Missing or incorrect Slurm batch options

Every cluster has slightly different expectations.
Check the site’s documentation for required options like:

  • partition (-p)

  • wall time (-t)

  • constraints

  • account or project flags

If workers never start or get rejected, this is usually the reason.

3. Vine Factory should use a shared filesystem

The worker scripts need to access some files provided by vine_factory.
To keep things simple and reliable, you should either:

  • run vine_factory from a shared filesystem (one that is mounted on the compute nodes), or

  • set --scratch-dir to a directory on a shared filesystem.

Check your site’s documentation to see which partitions/directories are visible from compute nodes. For example, some clusters share $HOME, while others require you to use a /work or /project directory.

What’s Next

If your quickstart worked—great! You now have a full TaskVine workflow running on an ACCESS CI cluster.

Here are a few things you can try next:

  • Read the TaskVine manual to explore caching, retries, and performance tuning

  • Increase the number of workers and test scaling

  • Try larger or real research datasets

  • Experiment with additional ACCESS CI sites

TaskVine grows with your research, and you now have the foundation to scale up your workflows with confidence.

Tuesday, November 18, 2025

Scaling SADE (Safety Aware Drone Ecosystem): A Hybrid UAV Simulation System for High-Fidelity Research

Autonomous drones are moving into increasingly complex, real-world environments where safety, compliance, and reliability have to be built in from the start. That's the motivation behind SADE—the Safety Aware Drone Ecosystem. SADE brings together physics-accurate simulation and photorealistic rendering, paired with PX4-based autonomy, to give researchers and developers a realistic, repeatable space to design, test, and validate UAV behavior before it ever leaves the lab.

Under the hood, SADE combines Gazebo for flight dynamics and physics with Unreal Engine 5 for immersive visuals, augmented by Cesium for globe-scale terrain that closely mirrors real geography. PX4 anchors flight control, and Mavlink backed by a "Mavlink Router" keeps telemetry and commands flowing reliably across multiple endpoints. This setup makes it possible to stream camera feeds from the simulated environment back into the autopilot loop, exercise computer vision pipelines under challenging conditions, and verify perception-driven behaviors in a safe, controlled feedback cycle.

The architecture is designed for scale and clarity. A React front end, integrated with Cesium-JS, lets users define zones, geometry, and mission parameters visually. Those configurations are saved by the backend and published to an MQTT broker, which the workflow runner listens to. From there, the runner launches isolated Docker Compose networks to run concurrent simulations without interference. In these stacks, PX4, Gazebo, pose-sending utilities, and shepherd tools (suas-engine) coordinate flight behavior, while an Unreal Engine container—accelerated via NVIDIA's container toolkit—renders the world. A Pixel Streaming signaling server brings UE5 to the browser, and the Mavlink Router consolidates telemetry to a single ground control station for consistent ops.


One of SADE's most useful ideas is the "SADE Zone." These are policy-aware regions that translate rules directly into drone behavior—think altitude caps, no-capture constraints for sensitive areas, or mandatory permission requests before entering and when exiting. Instead of treating compliance as something to bolt on later, SADE embeds policy into the simulation itself so drones "self-behave" according to the rules of the environment. That blend of physical realism and operational policy is essential for regulated and safety-critical use cases.

From a user standpoint, the SADE UI is where everything comes together. It gives you a flexible way to draw zones, set rules, and turn those constraints into autonomous responses. By capturing precise coordinates and dimensions, the UI drives compliance automatically, freeing teams to focus on algorithm design, perception, and mission planning. This approach is especially powerful for rare or risky scenarios that are hard to reproduce in the field but vital for building robust autonomy. SADE is built to handle multiple users and runs at once, intelligently distributing CPU and GPU resources so experiments remain responsive at scale. Looking ahead, the roadmap emphasizes richer workflows, larger multi-vehicle scenarios, and deeper policy enforcement to narrow the gap between simulation and real operations.

If you're working on UAV autonomy, simulation tooling, or regulated operations and want to explore policy-aware zones in photo real environments, check out the demo: https://sade.crc.nd.edu/

Tuesday, November 11, 2025

Wrangling Massive Tasks Graphs with Dynamic Hierarchical Composition

On Thursday, Octobor 30, research engineer Ben Tovar presented our recent work on accelerating the execution of High Energy Physics (HEP) workflows at the PyHEP 2025 Workshop, a hybrid workshop held at CERN. The presentation centered on an execution schema called Dynamic Data Reduction (DDR) that runs on top of TaskVine.

Current HEP analysis tools, like Coffea, provides users with an easy way to express the overall workflow and leverage local vectorization on column-oriented data. However, this often requires expressing the entire computation graph statically from the start. This introduces several issues, such as graph generation overhead which may take several hours longer than the actual computation needed, and the creation of computation units that do not fit the resources available.

With a DDR, we take advantage of the structure inherent in many HEP applications where when processing multiple collision events, the accumulation (reduction) step is typically both associative and commutative. This means that it is unnecessary to pre-determine which processed events are reduced together and can leverage factors such availability of data location. Further, the number of events processed together can respond dynamically to the resources available, and datasets can be processed independently. 

In the DDR application stack, TaskVine acts as the execution platform that distributes the computation to the cluster.


As an example, we ran Cortado, a HEP application that processes 419 datasets, 19,631 files, and 14TB of data (totaling 12,000 million events) in about 5.5 hours using over 1600 cores at any one time. During the run some of these cores had to be replaced because of resources eviction.

For more information, please visit the DDR pipy page at https://pypi.org/project/dynamic-data-reduction/

Tuesday, November 4, 2025

TaskVine Insights - Storage Management: Disk Load Shifting

The University of Notre Dame operates an HTCondor cluster with roughly 20,000 cores for scientific computing. The system consists of heterogeneous machines accessible to all users and is managed by the Center for Research Computing (CRC). As newer, more advanced nodes are added, the cluster continues to expand its computational capacity, enabling increasingly sophisticated research. TaskVine primarily runs on this HTCondor environment, connecting hundreds of workers and executing thousands of tasks while optimizing data locality, disk usage, and scheduling efficiency to maximize throughput. The screenshot below shows the cluster’s activity at 12:10 p.m. on November 4, from this website.


Leveraging node-local storage (NLS) to cache data locally and reduce I/O bottleneck on the centralized parallel filesystem (PFS) is a core feature of TaskVine. However, in a heterogeneous cluster where nodes differ in hardware and architecture, NLS can be both a strength and a liability. While it provides high I/O bandwidth for intermediate data, heterogeneity often leads to imbalance: some workers may run out of disk space while others remain half empty. This imbalance can cause worker crashes, unnecessary recomputation, and wasted cluster resources.

Disk load skew can arise from several factors. One major cause is cluster heterogeneity. As newer and more powerful compute nodes are added to an HPC cluster, they naturally execute tasks faster and produce more data, leading to uneven disk consumption.

To evaluate how heterogeneity contributes to this imbalance, we executed 20,000 identical tasks involving both computation and disk I/O on 100 workers, each with 16 cores. The figure below shows the number of completed tasks per worker over time. Some workers consistently outperformed others: by the end of the workflow, the fastest worker completed around 350 tasks, while the slowest managed only about 100, which is nearly a fourfold difference. As these faster nodes process tasks more quickly, they accumulate intermediate data at a higher rate and thus face an increased risk of local disk exhaustion.

On the other hand, the imbalance can also arise from the algorithm that determines task placement and data storage. To minimize the overhead of transferring data between nodes, a common approach is to schedule a task where its input data already resides. Once the task completes, its child tasks are often dispatched to the same node to exploit data locality. Over time, this strategy can lead to a subset of nodes accumulating a large amount of intermediate data.

To examine this effect, we ran a DAG-structured high-energy physics (HEP) workflow, DV5, consisting of approximately 250,000 tasks on 100 workers, each with 16 cores, and monitored the storage consumption on each node over time. We performed two runs: one prioritized assigning tasks to workers that already had the input files, while the other favored workers with the most available disk space, shown in the following two figures, respectively.


The results were somewhat surprising: despite the theoretical expectation that the latter policy should yield a more balanced distribution, both runs showed similarly skewed disk usage patterns. One possible explanation is that task scheduling occurs very rapidly and at high concurrency, so when a task becomes eligible, only a few workers are available. As a result, tasks naturally tend to execute on workers that just completed their parent tasks, maintaining locality even without explicit intent.

So, we must acknowledge that disk load imbalance naturally arises from cluster heterogeneity and therefore must be handled carefully. TaskVine addresses this issue using two complementary techniques: Redundant Replica Removal and Disk Load Shifting. The following sections detail each of these approaches.

1. Redundant Replica Removal

Because TaskVine inputs can migrate between workers to satisfy locality, it is quite common for several nodes to end up caching the same temporary file; the effect is most pronounced on the faster machines, which finish more tasks per unit time and accumulate many more staged inputs than their slower peers. To keep this extra baggage from overwhelming NLS, the TaskVine manager intervenes every time a task completes. It walks the task’s input list, and for each temporary file that now has more replicas than the configured target it inspects the workers currently holding that file. 

Nothing is deleted until every replica reports the READY state, guaranteeing that an in-flight transfer is not disrupted, and the manager double-checks that none of the workers are still executing tasks that depend on the file. Only after these safety checks does it queue up the redundant replicas for removal, restoring balance without jeopardizing correctness. The number of redundant replicas is calculated as the difference between the current number of replicas and the user-specified target, ensuring that at least one replica always remains available for future use. Workers holding these excess replicas are ranked by their free cache space, prioritizing cleanup on those with more available storage to maintain balanced disk utilization.

2. Disk Load Shifting

In TaskVine, this technique is triggered when a worker reports that a temporary file has been cached. When this happens, the manager’s cache-update handler inspects the newly staged replica and, if shifting is enabled, scans the worker pool to find a lighter host. Only workers that are active, with transfer capability, and do not already hold the file are considered. Candidates that would end up heavier than the source after receiving the file are skipped, leaving the lightest eligible worker to take the replica so that each transfer moves free space toward balance instead of swapping hot spots.

The migration reuses TaskVine’s existing peer-transfer pipeline. The destination streams the file directly from the source, the replica catalog tracks its state from CREATING to READY, and both workers update their transfer counters for admission control. Once the new replica is confirmed, the original worker releases its redundant copy to the cleanup routine, reclaiming disk space that just served its purpose. The work involved is modest, requiring only a single hash-table scan and one network transfer per staged replica, but the payoff is immediate: fast workers stay ahead of their disk usage, slower nodes lend idle capacity, and heterogeneous clusters keep their node-local storage evenly balanced without reducing throughput.

The following figure compares the NLS usage across all workers over time in the DV5 workflow, before and after enabling the two techniques.

After enabling Redundant Replica Removal and Disk Load Shifting, the NLS usage among workers became much more balanced. As shown in the bottom figure, storage consumption across nodes stayed within a narrow range under 10 GB, compared to over 20 GB and a significant skew before optimization. This indicates that the two techniques effectively prevented disk hotspots and improved overall resource utilization. In terms of overhead, the pre-optimization run completed in 206.85 seconds, while the optimized run took 311.92 seconds, indicating that the additional data transfers introduced a noticeable slowdown.

Both techniques are implemented on the TaskVine manager side in C, but from the user’s perspective they are simple to enable. After creating a manager object through the Python interface, for example:

m = vine.Manager(port=[9123, 9130]),

you can activate them individually with:

m.tune("clean-redundant-replicas", 1) and m.tune("shift-disk-load", 1).

While these modes are effective, they are not always recommended, since the additional data transfers and computations may introduce overhead and reduce overall throughput. However, if your workflow runs on disk-constrained nodes or workers are being evicted due to insufficient storage and you cannot request more disk space, enabling these options can significantly improve stability and performance.