Difference between revisions of "Georgia Advanced Computing Resource Center"

From Research Computing Center Wiki
Jump to navigation Jump to search
Line 1: Line 1:
 
==Projects==
 
==Projects==
 
===[[Aug 22 outage]]===
 
===[[Aug 22 outage]]===
"W" means "don't know When this task will be done"
 
 
Soon: /etc/motd on pcluster, zcluster
 
 
========================================================================
 
Wednesday:
 
 
Morning: GD message to users
 
 
2 PM: CC VM snapshots
 
 
3 PM:
 
  PB disable logins pcluster: restrict_sshd
 
  PB disable logins zcluster (except for jkissing, students): restrict_sshd on zcluster
 
  PB kick users off pcluster
 
  PB kick users off zcluster but not jkissing or students
 
 
  PB drain all nodes or queues, pcluster: llctl -g drain
 
  PB disable all queues except somedevq, zcluster
 
 
  PB kill all jobs pcluster: llcancel $(llq | grep -w R | awk '{print $1}') 2>&1 | tee llcancel.log
 
  PB kill all jobs zcluster: qdel $( qstat -s r | egrep '^1' | awk '{print $1}') 2>&1 | tee job.qdel.log
 
 
  JS move some rack15 nodes to rack 16?
 
 
  do GE jobs testing MPI and storage I/O throughput
 
    CC to use fsr15 for iozone
 
    ST to use fsr12 for MPI/NAMD
 
    PB to use fsr7 for dumb I/O
 
 
  PB stop execd on all nodes, zcluster: qconf -ke all 2>&1 | tee qconf.ke.log
 
  shut down racks 8,9,10,11
 
 
4 PM:
 
  CC reconfigs NICs/LAGs on storage units
 
  PB modify PanFS blksize on remaining nodes
 
  CC shut down VMs (wait for Jessie off zcluster)
 
  shut down 3070s (umount first)
 
  PB shut down Panasas (umount first)
 
  CC connect ESX IPMI cat5
 
 
 
by 5PM:
 
we tell NEG "go ahead"
 
CC recable storage unit cat5, and work with Brian M on switch port configs
 
 
probably between 9 and 12PM, NEG finishes
 
 
================================================================================
 
 
"Midnight" (when NEG is done):
 
power up 3070s
 
power up Panasas
 
CC power up ESX servers and VMs
 
PB start final rsyncs
 
power up racks 8,9,10,11
 
give yea or nay to NEG
 
 
"Morning" (starting by 8AM)=======================================================
 
 
PB panasas OS upgrade
 
PB switch and test /db, /usr/local mounts on the zcluster
 
 
PB/ST, CC do GE jobs testing MPI and storage I/O throughput
 
 
PB enable Panasas jumbo frames and reboot Panasas
 
 
VMWare updates
 
 
    W: PB GE upgrade
 
    W: PB yum updates of nodes
 
    W: PB yum update of zhead
 
    W: PB update FW on Dells
 
    W: CC thumper upgrades
 
    W: CC rccstor upgrades
 
    W: PB reinstall rack11, if time
 
    W: PB upgrade PGI compiler
 
 
PB switch and test NFS mounts on pcluster
 
 
PB reenable queues on zcluster
 
PB resume queues on pcluster
 
 
enable logins pcluster
 
enable logins zcluster
 
 
email users that outage is over
 
contact Lab Storage users about their mounts
 
  
 
==Clusters==
 
==Clusters==

Revision as of 13:58, 12 September 2012