Georgia Advanced Computing Resource Center: Difference between revisions
Jump to navigation
Jump to search
No edit summary |
No edit summary |
||
| Line 21: | Line 21: | ||
kill all jobs pcluster | kill all jobs pcluster | ||
kill all jobs zcluster | kill all jobs zcluster | ||
JS move some rack15 nodes to rack 16? | |||
do GE jobs testing MPI and storage I/O throughput | do GE jobs testing MPI and storage I/O throughput | ||
CC to use fsr15 for iozone | CC to use fsr15 for iozone | ||
| Line 33: | Line 35: | ||
CC reconfigs NICs/LAGs on storage units | CC reconfigs NICs/LAGs on storage units | ||
PB modify PanFS blksize on remaining nodes | PB modify PanFS blksize on remaining nodes | ||
CC shut down VMs (wait for Jessie off zcluster) | |||
shut down 3070s | shut down 3070s | ||
shut down Panasas | shut down Panasas | ||
connect ESX IPMI cat5 | |||
| Line 72: | Line 70: | ||
W: yum update of zhead | W: yum update of zhead | ||
W: update FW on Dells | W: update FW on Dells | ||
W: reinstall rack11? | W: reinstall rack11? | ||
W: thumper upgrades | W: thumper upgrades | ||
W: rccstor upgrades | W: rccstor upgrades | ||
switch and test NFS mounts on pcluster | switch and test NFS mounts on pcluster | ||
Revision as of 15:59, 21 August 2012
Aug 22 outage
"W" means "don't know When this task will be done" Soon: /etc/motd on pcluster, zcluster Sometime Tuesday: message to users Wednesday: 2 PM: VM snapshots 3 PM: disable logins pcluster disable logins zcluster (except for jkissing, students) kick users off pcluster kick users off zcluster? drain all nodes or queues, pcluster disable all queues except somedevq, zcluster kill all jobs pcluster kill all jobs zcluster
JS move some rack15 nodes to rack 16?
do GE jobs testing MPI and storage I/O throughput
CC to use fsr15 for iozone
ST to use fsr12 for MPI/NAMD
PB to use fsr7 for dumb I/O
stop execd on all nodes, zcluster
shut down racks 8,9,10,11
4 PM:
CC reconfigs NICs/LAGs on storage units
PB modify PanFS blksize on remaining nodes
CC shut down VMs (wait for Jessie off zcluster)
shut down 3070s
shut down Panasas
connect ESX IPMI cat5
by 5PM:
we tell NEG "go ahead"
CC recable storage unit cat5, and work with Brian M on switch port configs
probably between 9 and 12PM, NEG finishes
**** AFTER ******
"Midnight" (when NEG is done):
power up 3070s
power up Panasas
power up ESX servers and VMs
start final rsyncs
power up racks 8,9,10,11
"Morning" (starting by 8AM)=======================================================
panasas OS upgrade
switch and test /db, /usr/local mounts on the zcluster
do GE jobs testing MPI and storage I/O throughput
enable Panasas jumbo frames and reboot Panasas
VMWare updates
W: GE upgrade
W: yum updates of nodes
W: yum update of zhead
W: update FW on Dells
W: reinstall rack11?
W: thumper upgrades
W: rccstor upgrades
switch and test NFS mounts on pcluster
W: upgrade PGI compiler
reenable queues on zcluster
resume queues on pcluster
enable logins pcluster
enable logins zcluster
email users that outage is over
contact Lab Storage users about their mounts