Friday, February 14, 2014

Visualizing 10,000 Cores in Condor

Our Condor pool at the University of Notre Dame has been slowly growing, in no small part due to our collaboration with the Center for Research Computing, where it is now scavenging unused cycles from HPC clusters at the CRC.  When the dedicated batch system leaves a node unused, Condor is started on that node and keeps going until the dedicated system wants the node back.  Depending on the time of year, that leaves anywhere between 4K and 10K nodes available in the Condor pool.

We have tried a number of approaches at visualizing this complex system over the years.  Our latest tool, the Condor Matrix Display started as a summer project by Nick Jaeger, a student from the University of Wisconsin at Eau Claire.  The display shows a colored bar for each slot in the pool, where the width is proportional to the number of cores.

With a quick glance, you can see how many users are busy and whether they are running "thin" (1 core) or "fat" (many core) jobs.  Sorting by the machine name gives you sense of how each sub-cluster in the pool is used:


While sorting by users gives you a sense of what users are dominating the pool:


The display is always a nice way of viewing the relatively new feature of "dynamic slot" in Condor.  A large multi-core machine is now represented as a single slot with multiple resources.  For example, this bit of the display shows a cluster of 8-core machines where some of the machines are unclaimed (green), some are running 4-core jobs (blue), and some are running 1-core jobs (green):