Difference between revisions of "Georgia Advanced Computing Resource Center"

From Research Computing Center Wiki
Jump to navigation Jump to search
Line 4: Line 4:
 
  Soon: /etc/motd on pcluster, zcluster
 
  Soon: /etc/motd on pcluster, zcluster
 
   
 
   
  Sometime Tuesday: message to users
+
  ========================================================================
 
 
  Wednesday:
 
  Wednesday:
  2 PM: VM snapshots
+
 
 +
Morning: GD message to users
 +
 
 +
  2 PM: CC VM snapshots
 
   
 
   
 
  3 PM:
 
  3 PM:
   disable logins pcluster
+
   PB disable logins pcluster
   disable logins zcluster (except for jkissing, students)
+
   PB disable logins zcluster (except for jkissing, students)
 
   
 
   
   kick users off pcluster
+
   PB kick users off pcluster
   kick users off zcluster?
+
   PB kick users off zcluster?
 
   
 
   
   drain all nodes or queues, pcluster
+
   PB drain all nodes or queues, pcluster
   disable all queues except somedevq, zcluster
+
   PB disable all queues except somedevq, zcluster
 
   
 
   
   kill all jobs pcluster
+
   PB kill all jobs pcluster
   kill all jobs zcluster
+
   PB kill all jobs zcluster
  
 
   JS move some rack15 nodes to rack 16?
 
   JS move some rack15 nodes to rack 16?
Line 29: Line 31:
 
     PB to use fsr7 for dumb I/O
 
     PB to use fsr7 for dumb I/O
 
   
 
   
   stop execd on all nodes, zcluster
+
   PB stop execd on all nodes, zcluster
 
   shut down racks 8,9,10,11
 
   shut down racks 8,9,10,11
 
   
 
   
Line 36: Line 38:
 
   PB modify PanFS blksize on remaining nodes
 
   PB modify PanFS blksize on remaining nodes
 
   CC shut down VMs (wait for Jessie off zcluster)
 
   CC shut down VMs (wait for Jessie off zcluster)
   shut down 3070s
+
   shut down 3070s (umount first)
   shut down Panasas  
+
   PB shut down Panasas (umount first)
   connect ESX IPMI cat5
+
   CC connect ESX IPMI cat5
 
   
 
   
 
   
 
   
Line 46: Line 48:
 
   
 
   
 
  probably between 9 and 12PM, NEG finishes
 
  probably between 9 and 12PM, NEG finishes
 +
 +
================================================================================
 
   
 
   
****  AFTER ******
 
 
  "Midnight" (when NEG is done):
 
  "Midnight" (when NEG is done):
 
  power up 3070s
 
  power up 3070s
 
  power up Panasas
 
  power up Panasas
  power up ESX servers and VMs
+
  CC power up ESX servers and VMs
  start final rsyncs
+
  PB start final rsyncs
 
  power up racks 8,9,10,11
 
  power up racks 8,9,10,11
 
   
 
   
 
  "Morning" (starting by 8AM)=======================================================
 
  "Morning" (starting by 8AM)=======================================================
 
   
 
   
  panasas OS upgrade
+
  PB panasas OS upgrade
  switch and test /db, /usr/local mounts on the zcluster
+
  PB switch and test /db, /usr/local mounts on the zcluster
 
   
 
   
  do GE jobs testing MPI and storage I/O throughput
+
  PB/ST, CC do GE jobs testing MPI and storage I/O throughput
 
   
 
   
  enable Panasas jumbo frames and reboot Panasas
+
  PB enable Panasas jumbo frames and reboot Panasas
 
   
 
   
 
  VMWare updates
 
  VMWare updates
 
   
 
   
     W: GE upgrade
+
     W: PB GE upgrade
     W: yum updates of nodes
+
     W: PB yum updates of nodes
     W: yum update of zhead
+
     W: PB yum update of zhead
     W: update FW on Dells
+
     W: PB update FW on Dells
     W: reinstall rack11?
+
     W: CC thumper upgrades
     W: thumper upgrades
+
     W: CC rccstor upgrades  
     W: rccstor upgrades
+
     W: PB reinstall rack11, if time
+
    W: PB upgrade PGI compiler
switch and test NFS mounts on pcluster
 
 
   
 
   
    W: upgrade PGI compiler
+
PB switch and test NFS mounts on pcluster
 
   
 
   
  reenable queues on zcluster
+
  PB reenable queues on zcluster
  resume queues on pcluster
+
  PB resume queues on pcluster
 
   
 
   
 
  enable logins pcluster
 
  enable logins pcluster

Revision as of 16:05, 21 August 2012

Aug 22 outage

"W" means "don't know When this task will be done"

Soon: /etc/motd on pcluster, zcluster

========================================================================
Wednesday:
Morning: GD message to users
2 PM: CC VM snapshots

3 PM:
  PB disable logins pcluster
  PB disable logins zcluster (except for jkissing, students)

  PB kick users off pcluster
  PB kick users off zcluster?

  PB drain all nodes or queues, pcluster
  PB disable all queues except somedevq, zcluster

  PB kill all jobs pcluster
  PB kill all jobs zcluster
  JS move some rack15 nodes to rack 16?
  do GE jobs testing MPI and storage I/O throughput
    CC to use fsr15 for iozone
    ST to use fsr12 for MPI/NAMD
    PB to use fsr7 for dumb I/O

  PB stop execd on all nodes, zcluster
  shut down racks 8,9,10,11

4 PM:
 CC reconfigs NICs/LAGs on storage units
 PB modify PanFS blksize on remaining nodes
 CC shut down VMs (wait for Jessie off zcluster)
 shut down 3070s (umount first)
 PB shut down Panasas (umount first)
 CC connect ESX IPMI cat5


by 5PM:
we tell NEG "go ahead"
CC recable storage unit cat5, and work with Brian M on switch port configs

probably between 9 and 12PM, NEG finishes
================================================================================ 

"Midnight" (when NEG is done):
power up 3070s
power up Panasas
CC power up ESX servers and VMs
PB start final rsyncs
power up racks 8,9,10,11

"Morning" (starting by 8AM)=======================================================

PB panasas OS upgrade
PB switch and test /db, /usr/local mounts on the zcluster

PB/ST, CC do GE jobs testing MPI and storage I/O throughput

PB enable Panasas jumbo frames and reboot Panasas

VMWare updates

   W: PB GE upgrade
   W: PB yum updates of nodes
   W: PB yum update of zhead
   W: PB update FW on Dells
   W: CC thumper upgrades
   W: CC rccstor upgrades 
   W: PB reinstall rack11, if time
   W: PB upgrade PGI compiler

PB switch and test NFS mounts on pcluster

PB reenable queues on zcluster
PB resume queues on pcluster

enable logins pcluster
enable logins zcluster

email users that outage is over
contact Lab Storage users about their mounts

Clusters

Overview

rCluster

zCluster

ToDo List

sCluster

scluster todo List

VMWare

Virtual Machines

Storage

Overview

NAS

SAN

Networking

Overview

VLANs

IP Networks

Physical Hosts