Georgia Advanced Computing Resource Center: Difference between revisions
		
		
		
		
		
		Jump to navigation
		Jump to search
		
				
		
		
	
| No edit summary | No edit summary | ||
| Line 36: | Line 36: | ||
|    shut down 3070s |    shut down 3070s | ||
|    shut down Panasas   |    shut down Panasas   | ||
|     connect ESX IPMI cat5 |     connect ESX IPMI cat5 | ||
|   	       W: "final rsyncs" of /db, /usr/local |   	       W: "final rsyncs" of /db, /usr/local | ||
| Line 42: | Line 42: | ||
|                 W: PanFS 16K blksize on remaining nodes and zcluster.rcc |                 W: PanFS 16K blksize on remaining nodes and zcluster.rcc | ||
|   by 5PM: |   by 5PM: | ||
|   we tell NEG "go ahead" |   we tell NEG "go ahead" | ||
| Line 48: | Line 48: | ||
|   probably between 9 and 12PM, NEG finishes |   probably between 9 and 12PM, NEG finishes | ||
|   ****  AFTER ****** |   ****  AFTER ****** | ||
|   "Midnight" (when NEG is done): |   "Midnight" (when NEG is done): | ||
| Line 58: | Line 58: | ||
|   "Morning" (starting by 8AM)======================================================= |   "Morning" (starting by 8AM)======================================================= | ||
|   panasas OS upgrade |   panasas OS upgrade | ||
|   switch and test /db, /usr/local mounts on the zcluster |   switch and test /db, /usr/local mounts on the zcluster | ||
|   do GE jobs testing MPI and storage I/O throughput |   do GE jobs testing MPI and storage I/O throughput | ||
|   enable Panasas jumbo frames and reboot Panasas |   enable Panasas jumbo frames and reboot Panasas | ||
|   VMWare updates |   VMWare updates | ||
|      W: GE upgrade |      W: GE upgrade | ||
|      W: yum updates of nodes |      W: yum updates of nodes | ||
| Line 89: | Line 89: | ||
|   enable logins pcluster |   enable logins pcluster | ||
|   enable logins zcluster |   enable logins zcluster | ||
|   email users that outage is over |   email users that outage is over | ||
|   contact Lab Storage users about their mounts |   contact Lab Storage users about their mounts | ||
Revision as of 15:55, 21 August 2012
Aug 22 outage
"W" means "don't know When this task will be done"
Soon: /etc/motd on pcluster, zcluster
Sometime Tuesday: message to users
Wednesday:
2 PM: VM snapshots
3 PM:
  disable logins pcluster
  disable logins zcluster (except for jkissing, students)
  kick users off pcluster
  kick users off zcluster?
  drain all nodes or queues, pcluster
  disable all queues except somedevq, zcluster
  kill all jobs pcluster
  kill all jobs zcluster
  do GE jobs testing MPI and storage I/O throughput
    CC to use fsr15 for iozone
    ST to use fsr12 for MPI/NAMD
    PB to use fsr7 for dumb I/O
  stop execd on all nodes, zcluster
  shut down racks 8,9,10,11
4 PM:
 CC reconfigs NICs/LAGs on storage units
 PB modify PanFS blksize on remaining nodes
 PB shut down VMs
 shut down 3070s
 shut down Panasas 
  connect ESX IPMI cat5
	       W: "final rsyncs" of /db, /usr/local
              W: Curtis reconfig storage unit NICs/LAGs
              W: PanFS 16K blksize on remaining nodes and zcluster.rcc
by 5PM:
we tell NEG "go ahead"
CC recable storage unit cat5, and work with Brian M on switch port configs
probably between 9 and 12PM, NEG finishes
****  AFTER ******
"Midnight" (when NEG is done):
power up 3070s
power up Panasas
power up ESX servers and VMs
start final rsyncs
power up racks 8,9,10,11
"Morning" (starting by 8AM)=======================================================
panasas OS upgrade
switch and test /db, /usr/local mounts on the zcluster
do GE jobs testing MPI and storage I/O throughput
enable Panasas jumbo frames and reboot Panasas
VMWare updates
   W: GE upgrade
   W: yum updates of nodes
   W: yum update of zhead
   W: update FW on Dells
   W: move some rack15 nodes to rack 16?
   W: reinstall rack11?
   W: thumper upgrades
   W: rccstor upgrades 
 morning             W: VMware updates
switch and test NFS mounts on pcluster W: upgrade PGI compiler reenable queues on zcluster resume queues on pcluster enable logins pcluster enable logins zcluster email users that outage is over contact Lab Storage users about their mounts