Georgia Advanced Computing Resource Center: Difference between revisions
Jump to navigation
Jump to search
No edit summary |
No edit summary |
||
Line 4: | Line 4: | ||
Soon: /etc/motd on pcluster, zcluster | Soon: /etc/motd on pcluster, zcluster | ||
======================================================================== | |||
Wednesday: | Wednesday: | ||
2 PM: VM snapshots | |||
Morning: GD message to users | |||
2 PM: CC VM snapshots | |||
3 PM: | 3 PM: | ||
disable logins pcluster | PB disable logins pcluster | ||
disable logins zcluster (except for jkissing, students) | PB disable logins zcluster (except for jkissing, students) | ||
kick users off pcluster | PB kick users off pcluster | ||
kick users off zcluster? | PB kick users off zcluster? | ||
drain all nodes or queues, pcluster | PB drain all nodes or queues, pcluster | ||
disable all queues except somedevq, zcluster | PB disable all queues except somedevq, zcluster | ||
kill all jobs pcluster | PB kill all jobs pcluster | ||
kill all jobs zcluster | PB kill all jobs zcluster | ||
JS move some rack15 nodes to rack 16? | JS move some rack15 nodes to rack 16? | ||
Line 29: | Line 31: | ||
PB to use fsr7 for dumb I/O | PB to use fsr7 for dumb I/O | ||
stop execd on all nodes, zcluster | PB stop execd on all nodes, zcluster | ||
shut down racks 8,9,10,11 | shut down racks 8,9,10,11 | ||
Line 36: | Line 38: | ||
PB modify PanFS blksize on remaining nodes | PB modify PanFS blksize on remaining nodes | ||
CC shut down VMs (wait for Jessie off zcluster) | CC shut down VMs (wait for Jessie off zcluster) | ||
shut down 3070s | shut down 3070s (umount first) | ||
shut down Panasas | PB shut down Panasas (umount first) | ||
connect ESX IPMI cat5 | CC connect ESX IPMI cat5 | ||
Line 46: | Line 48: | ||
probably between 9 and 12PM, NEG finishes | probably between 9 and 12PM, NEG finishes | ||
================================================================================ | |||
"Midnight" (when NEG is done): | "Midnight" (when NEG is done): | ||
power up 3070s | power up 3070s | ||
power up Panasas | power up Panasas | ||
power up ESX servers and VMs | CC power up ESX servers and VMs | ||
start final rsyncs | PB start final rsyncs | ||
power up racks 8,9,10,11 | power up racks 8,9,10,11 | ||
"Morning" (starting by 8AM)======================================================= | "Morning" (starting by 8AM)======================================================= | ||
panasas OS upgrade | PB panasas OS upgrade | ||
switch and test /db, /usr/local mounts on the zcluster | PB switch and test /db, /usr/local mounts on the zcluster | ||
do GE jobs testing MPI and storage I/O throughput | PB/ST, CC do GE jobs testing MPI and storage I/O throughput | ||
enable Panasas jumbo frames and reboot Panasas | PB enable Panasas jumbo frames and reboot Panasas | ||
VMWare updates | VMWare updates | ||
W: GE upgrade | W: PB GE upgrade | ||
W: yum updates of nodes | W: PB yum updates of nodes | ||
W: yum update of zhead | W: PB yum update of zhead | ||
W: update FW on Dells | W: PB update FW on Dells | ||
W: | W: CC thumper upgrades | ||
W: | W: CC rccstor upgrades | ||
W: | W: PB reinstall rack11, if time | ||
W: PB upgrade PGI compiler | |||
PB switch and test NFS mounts on pcluster | |||
reenable queues on zcluster | PB reenable queues on zcluster | ||
resume queues on pcluster | PB resume queues on pcluster | ||
enable logins pcluster | enable logins pcluster |
Revision as of 15:05, 21 August 2012
Aug 22 outage
"W" means "don't know When this task will be done" Soon: /etc/motd on pcluster, zcluster ======================================================================== Wednesday:
Morning: GD message to users
2 PM: CC VM snapshots 3 PM: PB disable logins pcluster PB disable logins zcluster (except for jkissing, students) PB kick users off pcluster PB kick users off zcluster? PB drain all nodes or queues, pcluster PB disable all queues except somedevq, zcluster PB kill all jobs pcluster PB kill all jobs zcluster
JS move some rack15 nodes to rack 16?
do GE jobs testing MPI and storage I/O throughput CC to use fsr15 for iozone ST to use fsr12 for MPI/NAMD PB to use fsr7 for dumb I/O PB stop execd on all nodes, zcluster shut down racks 8,9,10,11 4 PM: CC reconfigs NICs/LAGs on storage units PB modify PanFS blksize on remaining nodes CC shut down VMs (wait for Jessie off zcluster) shut down 3070s (umount first) PB shut down Panasas (umount first) CC connect ESX IPMI cat5 by 5PM: we tell NEG "go ahead" CC recable storage unit cat5, and work with Brian M on switch port configs probably between 9 and 12PM, NEG finishes
================================================================================ "Midnight" (when NEG is done): power up 3070s power up Panasas CC power up ESX servers and VMs PB start final rsyncs power up racks 8,9,10,11 "Morning" (starting by 8AM)======================================================= PB panasas OS upgrade PB switch and test /db, /usr/local mounts on the zcluster PB/ST, CC do GE jobs testing MPI and storage I/O throughput PB enable Panasas jumbo frames and reboot Panasas VMWare updates W: PB GE upgrade W: PB yum updates of nodes W: PB yum update of zhead W: PB update FW on Dells W: CC thumper upgrades W: CC rccstor upgrades W: PB reinstall rack11, if time W: PB upgrade PGI compiler PB switch and test NFS mounts on pcluster PB reenable queues on zcluster PB resume queues on pcluster enable logins pcluster enable logins zcluster email users that outage is over contact Lab Storage users about their mounts