Georgia Advanced Computing Resource Center: Difference between revisions

Revision as of 17:01, 21 August 2012

Aug 22 outage

"W" means "don't know When this task will be done"

Soon: /etc/motd on pcluster, zcluster

========================================================================
Wednesday:

Morning: GD message to users

2 PM: CC VM snapshots

3 PM:
  PB disable logins pcluster: restrict_sshd
  PB disable logins zcluster (except for jkissing, students): restrict_sshd on zcluster 
  PB kick users off pcluster
  PB kick users off zcluster but not jkissing or students

  PB drain all nodes or queues, pcluster: llctl -g drain schedd allclasses
  PB disable all queues except somedevq, zcluster

  PB kill all jobs pcluster: llcancel $(llq | grep -w R | awk '{print $1}') 2>&1 | tee llcancel.log
  PB kill all jobs zcluster: qdel $( qstat -s r | egrep '^1' | awk '{print $1}') 2>&1 | tee job.qdel.log

  JS move some rack15 nodes to rack 16?

  do GE jobs testing MPI and storage I/O throughput
    CC to use fsr15 for iozone
    ST to use fsr12 for MPI/NAMD
    PB to use fsr7 for dumb I/O

  PB stop execd on all nodes, zcluster: qconf -ke all 2>&1 | tee qconf.ke.log
  shut down racks 8,9,10,11

4 PM:
 CC reconfigs NICs/LAGs on storage units
 PB modify PanFS blksize on remaining nodes
 CC shut down VMs (wait for Jessie off zcluster)
 shut down 3070s (umount first)
 PB shut down Panasas (umount first)
 CC connect ESX IPMI cat5


by 5PM:
we tell NEG "go ahead"
CC recable storage unit cat5, and work with Brian M on switch port configs

probably between 9 and 12PM, NEG finishes

================================================================================ 

"Midnight" (when NEG is done):
power up 3070s
power up Panasas
CC power up ESX servers and VMs
PB start final rsyncs
power up racks 8,9,10,11
give yea or nay to NEG

"Morning" (starting by 8AM)=======================================================

PB panasas OS upgrade
PB switch and test /db, /usr/local mounts on the zcluster

PB/ST, CC do GE jobs testing MPI and storage I/O throughput

PB enable Panasas jumbo frames and reboot Panasas

VMWare updates

   W: PB GE upgrade
   W: PB yum updates of nodes
   W: PB yum update of zhead
   W: PB update FW on Dells
   W: CC thumper upgrades
   W: CC rccstor upgrades 
   W: PB reinstall rack11, if time
   W: PB upgrade PGI compiler

PB switch and test NFS mounts on pcluster

PB reenable queues on zcluster
PB resume queues on pcluster

enable logins pcluster
enable logins zcluster

email users that outage is over
contact Lab Storage users about their mounts

Georgia Advanced Computing Resource Center: Difference between revisions

Revision as of 17:01, 21 August 2012

Contents

Aug 22 outage

Clusters

Overview

rCluster

zCluster

ToDo List

sCluster

scluster todo List

VMWare

Virtual Machines

Storage

Overview

NAS

SAN

Networking

Overview

VLANs

IP Networks

Physical Hosts

Navigation menu

Georgia Advanced Computing Resource Center: Difference between revisions

Revision as of 17:01, 21 August 2012

Aug 22 outage

Clusters

Overview

Storage

Overview

Networking

Overview

Physical Hosts

Navigation menu

Search