Difference between revisions of "Georgia Advanced Computing Resource Center"

From Research Computing Center Wiki
Jump to navigation Jump to search
Line 1: Line 1:
 
==Aug 22 outage==
 
==Aug 22 outage==
 +
 +
Soon: /etc/motd on pcluster, zcluster
 +
 +
Sometime Tuesday: message to users
 +
 +
2 PM: VM snapshots
 +
 +
3 PM:
 +
  disable logins pcluster
 +
  disable logins zcluster (except for jkissing, students)
 +
 +
  kick users off pcluster
 +
  kick users off zcluster?
 +
 +
  drain all nodes or queues, pcluster
 +
  disable all queues except somedevq, zcluster
 +
 +
  kill all jobs pcluster
 +
  kill all jobs zcluster
 +
 +
  do GE jobs testing MPI and storage I/O throughput
 +
 +
  stop execd on all nodes, zcluster
 +
 +
      W: shut down racks 8,9,10,11
 +
 +
4 PM:
 +
  shut down VMs (but must have place to do storage shutdowns from)
 +
 +
      W: shut down 3070s
 +
      W: shut down Panasas
 +
 +
      W: connect ESX IPMI cat5
 +
      W: "final rsyncs" of /db, /usr/local
 +
 +
 +
****  AFTER ******
 +
 +
CC configs NICs on storage for VLAN 20, maybe
 +
 +
power up 3070s
 +
power up Panasas
 +
power up ESX servers and VMs
 +
 +
 +
power up racks 8,9,10,11
 +
 +
switch and test /db, /usr/local mounts on the zcluster
 +
 +
do GE jobs testing MPI and storage I/O throughput
 +
 +
    W: GE upgrade
 +
    W: yum updates of nodes
 +
    W: yum update of zhead
 +
    W: update FW on Dells
 +
 +
    W: reinstall rack11?
 +
 +
switch and test NFS mounts on pcluster
 +
 +
    W: upgrade PGI compiler
 +
 +
reenable queues on zcluster
 +
resume queues on pcluster
 +
 +
enable logins pcluster
 +
enable logins zcluster
 +
 
==Clusters==
 
==Clusters==
 
===Overview===
 
===Overview===

Revision as of 11:49, 21 August 2012

Aug 22 outage

Soon: /etc/motd on pcluster, zcluster

Sometime Tuesday: message to users

2 PM: VM snapshots

3 PM:
  disable logins pcluster
  disable logins zcluster (except for jkissing, students)

  kick users off pcluster
  kick users off zcluster?

  drain all nodes or queues, pcluster
  disable all queues except somedevq, zcluster

  kill all jobs pcluster
  kill all jobs zcluster

  do GE jobs testing MPI and storage I/O throughput

  stop execd on all nodes, zcluster

	      W: shut down racks 8,9,10,11

4 PM:
 shut down VMs (but must have place to do storage shutdowns from)

	       W: shut down 3070s
	       W: shut down Panasas

	       W: connect ESX IPMI cat5
	       W: "final rsyncs" of /db, /usr/local


****  AFTER ******

CC configs NICs on storage for VLAN 20, maybe

power up 3070s
power up Panasas
power up ESX servers and VMs


power up racks 8,9,10,11

switch and test /db, /usr/local mounts on the zcluster

do GE jobs testing MPI and storage I/O throughput

   W: GE upgrade
   W: yum updates of nodes
   W: yum update of zhead
   W: update FW on Dells

   W: reinstall rack11?

switch and test NFS mounts on pcluster

   W: upgrade PGI compiler

reenable queues on zcluster
resume queues on pcluster

enable logins pcluster
enable logins zcluster

Clusters

Overview

rCluster

zCluster

ToDo List

sCluster

scluster todo List

VMWare

Virtual Machines

Storage

Overview

NAS

SAN

Networking

Overview

VLANs

IP Networks

Physical Hosts