Adaptive Computing Professional Services engagement
Production torque.cfg file is "QSUBHOST <hostname>.ecompute".
Generally we want to incentivize short walltime requests, and resource request accuracy. Not just so Moab can be more efficient, but also so admins can more readily assess which of a hopefully small handful of settings to change in order to keep job flow going as the overall running and pending workload profile changes. Our first pass at this is to have multiple queues, defined as buckets of core and RAM and walltime allocation, and to change relative resource allocation or reservation granted to jobs in the different buckets.
Doing this in a way so as to say "we're flexibly permitting both many-core "HPC" jobs and largemem/large-date "HTC" jobs to flow, in a way that we can reasonably steer as needs dictate" is our largest business goal. Others are:
- "short general-queue jobs can harvest idle buyin node cycles"
- protecting jobs from network, queueing system, or filesystem failures (e.g. with NHC)
- helping users in their goal of resource request accuracy
- remaining able to grant arbitrary exceptions to otherwise global policies to handle special cases ("PI has publication deadline and has 20 100-core jobs to run this week", e.g.)
Topics other than the below "config points":
- how the Prof Svcs is going so far, with respect to amount and 1imeliness of info exchange, and also with respect to making sure our expectations are realistic given whatever PS time is remaining.
- review Moab monitor mode and expectations for what we do with it next week
Here are specific config points UGA has brought up:
jobs requesting one of a set of mutually exclusive features (e.g. Intel vs AMD, QDR vs EDR) should always and only run on nodes possessing those features.Not an issue if those node features are assigned properly. Then users (or queues/SRs/QOSes) just request them.
- We'd like multi-node jobs to have either (a) a global preference for running on nodes all within the same internodal-fabric switch; or (b) a way to for a user to specify that a given job be treated that way. This is done by using switch and fabric names, only, as nodesets, and using the various NODESET parameters, and NODESETPLUS DELAY.
which Adaptive commands need visibility to anything stored in /var/spool/torque on the PBS server or on compute nodes?
- "HPC vs HTC" distinction
- easy way for users to fetch, or have pushed to them, info about resource request accuracy (used vs requested)? showstats does some of this.
common practices for tmp-type dirs and cgroups' ephemeral per-job /tmp?N/A.
- any easy-ROI candidates for Moab "generic events"? Will compare to production config.
- common uses for prologs/epilogs? Going to use NHC. Can we e.g. customize qsub rejection messages? Use job submit filters.
- what are candidate use cases for Processor Equivalents? Does it make sense for UGA to use these for scheduling (priorities) or reporting?
- general implementation of condo model (SRs, QOSes, ACLs, etc.), permitting short-enough "unprivileged" jobs to use idle PI-owned HW. Done, knowing backfill, reservationdepth, and priority factors will have to be tuned.
- can node owners qsub such that "use my node(s) if you can; otherwise use whatever I have access to"?
- can node owners qsub such that "this multi-node job can use some nodes in "my" queue and some generally available nodes"? Not when there are distinct classes involved.
- how to route "short" jobs (< 1 hr runtime, say) such that when submitted to the general queue, such jobs might land on buyin nodes (where the submitting user otherwise would not have privilege)? We won't be doing preemption or job suspension.
- what CLASSCFG functionality (PRIORITY, limits like MAXJOB] does a Moab remapped class have that a Torque routing queue doesn't? Remapped classes are configurable much as execution classes are.
- is there a way to have a compute node express differential preferences for jobs pending in different queues?
- the "aquarinode" shared-access-to-buyin-nodes model
- what's involved in migrating from "Owner PIs get privileged access to their own HW" to "Owner PIs get privileged access to an amount of resources, whichever HW might provide them to a given job"?
- What to look out for when putting a node in multiple queues (we might want our GPU nodes avail for CPU-only jobs, too, e.g.)?
wrapper scripts, 3rd-party tools ("PBS tools" from Ohio, one-off scripts like 'tracknodes' or 'reaver',e.g.) that can help us be better/faster admins? Showing usernames w/job priorities example.N/A.
- we want to be able to say "make this node become idle sometime between times T1 and T2, and reboot when idle". What's a simple, robust way to do that? Administrative reservation, and use unrelated method to cause reboots when idle.
- "cgroups, memory, and multinode jobs". Let's say a job requests 32 cores and pmem:1gb. If 8 allocated cores are within the same node, will each core have access to a distinct 1GB of pmem, or will there be a node-wide 8-GB pmem allocation, accessible to all 8 threads on the node?
MPI stacks linked against versioned /path/to/torque/libs
Haven't mentioned these yet:
- what should we do, and monitor, with fairshare? Relation between it and the existence of multiple queues?
- what should we do, and monitor, with backfill?
how much attention to pay to the "qsub -L" NUMA-aware stuff?N/A.