Difference between revisions of "Georgia Advanced Computing Resource Center"

From Research Computing Center Wiki
Jump to navigation Jump to search
Line 21: Line 21:
 
   kill all jobs pcluster
 
   kill all jobs pcluster
 
   kill all jobs zcluster
 
   kill all jobs zcluster
+
 
 +
  JS move some rack15 nodes to rack 16?
 +
 
 
   do GE jobs testing MPI and storage I/O throughput
 
   do GE jobs testing MPI and storage I/O throughput
 
     CC to use fsr15 for iozone
 
     CC to use fsr15 for iozone
Line 33: Line 35:
 
   CC reconfigs NICs/LAGs on storage units
 
   CC reconfigs NICs/LAGs on storage units
 
   PB modify PanFS blksize on remaining nodes
 
   PB modify PanFS blksize on remaining nodes
   PB shut down VMs
+
   CC shut down VMs (wait for Jessie off zcluster)
 
   shut down 3070s
 
   shut down 3070s
 
   shut down Panasas  
 
   shut down Panasas  
+
  connect ESX IPMI cat5
  connect ESX IPMI cat5
 
      W: "final rsyncs" of /db, /usr/local
 
              W: Curtis reconfig storage unit NICs/LAGs
 
              W: PanFS 16K blksize on remaining nodes and zcluster.rcc
 
 
   
 
   
 
   
 
   
Line 72: Line 70:
 
     W: yum update of zhead
 
     W: yum update of zhead
 
     W: update FW on Dells
 
     W: update FW on Dells
    W: move some rack15 nodes to rack 16?
 
 
     W: reinstall rack11?
 
     W: reinstall rack11?
 
     W: thumper upgrades
 
     W: thumper upgrades
 
     W: rccstor upgrades  
 
     W: rccstor upgrades  
 
   
 
   
  morning            W: VMware updates
 
 
 
 
  switch and test NFS mounts on pcluster
 
  switch and test NFS mounts on pcluster
 
   
 
   

Revision as of 15:59, 21 August 2012

Aug 22 outage

"W" means "don't know When this task will be done"

Soon: /etc/motd on pcluster, zcluster

Sometime Tuesday: message to users

Wednesday:
2 PM: VM snapshots

3 PM:
  disable logins pcluster
  disable logins zcluster (except for jkissing, students)

  kick users off pcluster
  kick users off zcluster?

  drain all nodes or queues, pcluster
  disable all queues except somedevq, zcluster

  kill all jobs pcluster
  kill all jobs zcluster
  JS move some rack15 nodes to rack 16?
  do GE jobs testing MPI and storage I/O throughput
    CC to use fsr15 for iozone
    ST to use fsr12 for MPI/NAMD
    PB to use fsr7 for dumb I/O

  stop execd on all nodes, zcluster
  shut down racks 8,9,10,11

4 PM:
 CC reconfigs NICs/LAGs on storage units
 PB modify PanFS blksize on remaining nodes
 CC shut down VMs (wait for Jessie off zcluster)
 shut down 3070s
 shut down Panasas 
 connect ESX IPMI cat5


by 5PM:
we tell NEG "go ahead"
CC recable storage unit cat5, and work with Brian M on switch port configs

probably between 9 and 12PM, NEG finishes

****  AFTER ******
"Midnight" (when NEG is done):
power up 3070s
power up Panasas
power up ESX servers and VMs
start final rsyncs
power up racks 8,9,10,11

"Morning" (starting by 8AM)=======================================================

panasas OS upgrade
switch and test /db, /usr/local mounts on the zcluster

do GE jobs testing MPI and storage I/O throughput

enable Panasas jumbo frames and reboot Panasas

VMWare updates

   W: GE upgrade
   W: yum updates of nodes
   W: yum update of zhead
   W: update FW on Dells
   W: reinstall rack11?
   W: thumper upgrades
   W: rccstor upgrades 

switch and test NFS mounts on pcluster

   W: upgrade PGI compiler

reenable queues on zcluster
resume queues on pcluster

enable logins pcluster
enable logins zcluster

email users that outage is over
contact Lab Storage users about their mounts

Clusters

Overview

rCluster

zCluster

ToDo List

sCluster

scluster todo List

VMWare

Virtual Machines

Storage

Overview

NAS

SAN

Networking

Overview

VLANs

IP Networks

Physical Hosts