Georgia Advanced Computing Resource Center: Difference between revisions

From Research Computing Center Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 22: Line 22:
   
   
   do GE jobs testing MPI and storage I/O throughput
   do GE jobs testing MPI and storage I/O throughput
    CC to use fsr15 for iozone
    ST to use fsr12 for MPI/NAMD
    PB to use fsr7 for dumb I/O
   
   
   stop execd on all nodes, zcluster
   stop execd on all nodes, zcluster
Line 28: Line 31:
   
   
  4 PM:
  4 PM:
   shut down VMs (but must have place to do storage shutdowns from)
  modify PanFS blksize on remaining nodes
   shut down VMs
      W: shut down 3070s
  shut down 3070s
      W: shut down Panasas
  shut down Panasas  
              W: enable Panasas jumbo frames
 
              W: VMware updates
  morning            W: enable Panasas jumbo frames
      W: connect ESX IPMI cat5
  morning            W: VMware updates
  connect ESX IPMI cat5
        W: "final rsyncs" of /db, /usr/local
        W: "final rsyncs" of /db, /usr/local
               W: Curtis reconfig storage unit NICs/LAGs
               W: Curtis reconfig storage unit NICs/LAGs
Line 40: Line 44:
   
   
----
----
  by 5PM: we tell NEG "go ahead"
  by 5PM:
we tell NEG "go ahead"
   
   
   
   
Line 46: Line 51:
----
----
  ****  AFTER ******
  ****  AFTER ******
   
  "Midnight" (when NEG is done):
  power up 3070s
  power up 3070s
  power up Panasas
  power up Panasas
  power up ESX servers and VMs
  power up ESX servers and VMs
   
  start final rsyncs
  power up racks 8,9,10,11
  power up racks 8,9,10,11
   
   
"Morning" (starting by 8AM)
panasas OS upgrade
  switch and test /db, /usr/local mounts on the zcluster
  switch and test /db, /usr/local mounts on the zcluster
   
   
  do GE jobs testing MPI and storage I/O throughput
  do GE jobs testing MPI and storage I/O throughput
 
     W: GE upgrade
     W: GE upgrade
     W: yum updates of nodes
     W: yum updates of nodes
Line 65: Line 73:
     W: thumper upgrades
     W: thumper upgrades
     W: rccstor upgrades  
     W: rccstor upgrades  
    W: panasas OS upgrade
   
   
  switch and test NFS mounts on pcluster
  switch and test NFS mounts on pcluster
Line 76: Line 83:
  enable logins pcluster
  enable logins pcluster
  enable logins zcluster
  enable logins zcluster
email users that outage is over
contact Lab Storage users about their mounts


==Clusters==
==Clusters==

Revision as of 15:47, 21 August 2012

Aug 22 outage

"W" means "don't know When this task will be done"

Soon: /etc/motd on pcluster, zcluster

Sometime Tuesday: message to users

2 PM: VM snapshots

3 PM:
  disable logins pcluster
  disable logins zcluster (except for jkissing, students)

  kick users off pcluster
  kick users off zcluster?

  drain all nodes or queues, pcluster
  disable all queues except somedevq, zcluster

  kill all jobs pcluster
  kill all jobs zcluster

  do GE jobs testing MPI and storage I/O throughput
    CC to use fsr15 for iozone
    ST to use fsr12 for MPI/NAMD
    PB to use fsr7 for dumb I/O

  stop execd on all nodes, zcluster

	      W: shut down racks 8,9,10,11

4 PM:
 modify PanFS blksize on remaining nodes
 shut down VMs
 shut down 3070s
 shut down Panasas 
 morning             W: enable Panasas jumbo frames
 morning             W: VMware updates
 connect ESX IPMI cat5
	       W: "final rsyncs" of /db, /usr/local
              W: Curtis reconfig storage unit NICs/LAGs
              W: PanFS 16K blksize on remaining nodes and zcluster.rcc


by 5PM:
we tell NEG "go ahead"


probably between 9 and 12PM, NEG at work.

****  AFTER ******
"Midnight" (when NEG is done):
power up 3070s
power up Panasas
power up ESX servers and VMs
start final rsyncs
power up racks 8,9,10,11

"Morning" (starting by 8AM)
panasas OS upgrade
switch and test /db, /usr/local mounts on the zcluster

do GE jobs testing MPI and storage I/O throughput
   W: GE upgrade
   W: yum updates of nodes
   W: yum update of zhead
   W: update FW on Dells
   W: move some rack15 nodes to rack 16?
   W: reinstall rack11?
   W: thumper upgrades
   W: rccstor upgrades 

switch and test NFS mounts on pcluster

   W: upgrade PGI compiler

reenable queues on zcluster
resume queues on pcluster

enable logins pcluster
enable logins zcluster
email users that outage is over
contact Lab Storage users about their mounts

Clusters

Overview

rCluster

zCluster

ToDo List

sCluster

scluster todo List

VMWare

Virtual Machines

Storage

Overview

NAS

SAN

Networking

Overview

VLANs

IP Networks

Physical Hosts