Georgia Advanced Computing Resource Center: Difference between revisions
Jump to navigation
Jump to search
No edit summary |
No edit summary |
||
Line 22: | Line 22: | ||
do GE jobs testing MPI and storage I/O throughput | do GE jobs testing MPI and storage I/O throughput | ||
CC to use fsr15 for iozone | |||
ST to use fsr12 for MPI/NAMD | |||
PB to use fsr7 for dumb I/O | |||
stop execd on all nodes, zcluster | stop execd on all nodes, zcluster | ||
Line 28: | Line 31: | ||
4 PM: | 4 PM: | ||
shut down VMs | modify PanFS blksize on remaining nodes | ||
shut down VMs | |||
shut down 3070s | |||
shut down Panasas | |||
morning W: enable Panasas jumbo frames | |||
morning W: VMware updates | |||
connect ESX IPMI cat5 | |||
W: "final rsyncs" of /db, /usr/local | W: "final rsyncs" of /db, /usr/local | ||
W: Curtis reconfig storage unit NICs/LAGs | W: Curtis reconfig storage unit NICs/LAGs | ||
Line 40: | Line 44: | ||
---- | ---- | ||
by 5PM: we tell NEG "go ahead" | by 5PM: | ||
we tell NEG "go ahead" | |||
Line 46: | Line 51: | ||
---- | ---- | ||
**** AFTER ****** | **** AFTER ****** | ||
"Midnight" (when NEG is done): | |||
power up 3070s | power up 3070s | ||
power up Panasas | power up Panasas | ||
power up ESX servers and VMs | power up ESX servers and VMs | ||
start final rsyncs | |||
power up racks 8,9,10,11 | power up racks 8,9,10,11 | ||
"Morning" (starting by 8AM) | |||
panasas OS upgrade | |||
switch and test /db, /usr/local mounts on the zcluster | switch and test /db, /usr/local mounts on the zcluster | ||
do GE jobs testing MPI and storage I/O throughput | do GE jobs testing MPI and storage I/O throughput | ||
W: GE upgrade | W: GE upgrade | ||
W: yum updates of nodes | W: yum updates of nodes | ||
Line 65: | Line 73: | ||
W: thumper upgrades | W: thumper upgrades | ||
W: rccstor upgrades | W: rccstor upgrades | ||
switch and test NFS mounts on pcluster | switch and test NFS mounts on pcluster | ||
Line 76: | Line 83: | ||
enable logins pcluster | enable logins pcluster | ||
enable logins zcluster | enable logins zcluster | ||
email users that outage is over | |||
contact Lab Storage users about their mounts | |||
==Clusters== | ==Clusters== |
Revision as of 14:47, 21 August 2012
Aug 22 outage
"W" means "don't know When this task will be done" Soon: /etc/motd on pcluster, zcluster Sometime Tuesday: message to users 2 PM: VM snapshots 3 PM: disable logins pcluster disable logins zcluster (except for jkissing, students) kick users off pcluster kick users off zcluster? drain all nodes or queues, pcluster disable all queues except somedevq, zcluster kill all jobs pcluster kill all jobs zcluster do GE jobs testing MPI and storage I/O throughput CC to use fsr15 for iozone ST to use fsr12 for MPI/NAMD PB to use fsr7 for dumb I/O stop execd on all nodes, zcluster W: shut down racks 8,9,10,11 4 PM: modify PanFS blksize on remaining nodes shut down VMs shut down 3070s shut down Panasas
morning W: enable Panasas jumbo frames morning W: VMware updates connect ESX IPMI cat5 W: "final rsyncs" of /db, /usr/local W: Curtis reconfig storage unit NICs/LAGs W: PanFS 16K blksize on remaining nodes and zcluster.rcc
by 5PM: we tell NEG "go ahead" probably between 9 and 12PM, NEG at work.
**** AFTER ****** "Midnight" (when NEG is done): power up 3070s power up Panasas power up ESX servers and VMs start final rsyncs power up racks 8,9,10,11 "Morning" (starting by 8AM)
panasas OS upgrade switch and test /db, /usr/local mounts on the zcluster do GE jobs testing MPI and storage I/O throughput
W: GE upgrade W: yum updates of nodes W: yum update of zhead W: update FW on Dells W: move some rack15 nodes to rack 16? W: reinstall rack11? W: thumper upgrades W: rccstor upgrades switch and test NFS mounts on pcluster W: upgrade PGI compiler reenable queues on zcluster resume queues on pcluster enable logins pcluster enable logins zcluster
email users that outage is over contact Lab Storage users about their mounts