Georgia Advanced Computing Resource Center: Difference between revisions

@@ Line 1: / Line 1: @@
 ==Projects==
 ===[[Aug 22 outage]]===
- "W" means "don't know When this task will be done"
- Soon: /etc/motd on pcluster, zcluster
- ========================================================================
- Wednesday:
- Morning: GD message to users
-PM: CC VM snapshots
-PM:
-   PB disable logins pcluster: restrict_sshd
-   PB disable logins zcluster (except for jkissing, students): restrict_sshd on zcluster
-   PB kick users off pcluster
-   PB kick users off zcluster but not jkissing or students
-   PB drain all nodes or queues, pcluster: llctl -g drain
-   PB disable all queues except somedevq, zcluster
-   PB kill all jobs pcluster: llcancel $(llq | grep -w R | awk '{print $1}') 2>&1 | tee llcancel.log
-   PB kill all jobs zcluster: qdel $( qstat -s r | egrep '^1' | awk '{print $1}') 2>&1 | tee job.qdel.log
-   JS move some rack15 nodes to rack 16?
-   do GE jobs testing MPI and storage I/O throughput
-     CC to use fsr15 for iozone
-     ST to use fsr12 for MPI/NAMD
-     PB to use fsr7 for dumb I/O
-   PB stop execd on all nodes, zcluster: qconf -ke all 2>&1 | tee qconf.ke.log
-   shut down racks 8,9,10,11
-PM:
-  CC reconfigs NICs/LAGs on storage units
-  PB modify PanFS blksize on remaining nodes
-  CC shut down VMs (wait for Jessie off zcluster)
-  shut down 3070s (umount first)
-  PB shut down Panasas (umount first)
-  CC connect ESX IPMI cat5
- by 5PM:
- we tell NEG "go ahead"
- CC recable storage unit cat5, and work with Brian M on switch port configs
- probably between 9 and 12PM, NEG finishes
- ================================================================================
- "Midnight" (when NEG is done):
- power up 3070s
- power up Panasas
- CC power up ESX servers and VMs
- PB start final rsyncs
- power up racks 8,9,10,11
- give yea or nay to NEG
- "Morning" (starting by 8AM)=======================================================
- PB panasas OS upgrade
- PB switch and test /db, /usr/local mounts on the zcluster
- PB/ST, CC do GE jobs testing MPI and storage I/O throughput
- PB enable Panasas jumbo frames and reboot Panasas
- VMWare updates
-    W: PB GE upgrade
-    W: PB yum updates of nodes
-    W: PB yum update of zhead
-    W: PB update FW on Dells
-    W: CC thumper upgrades
-    W: CC rccstor upgrades
-    W: PB reinstall rack11, if time
-    W: PB upgrade PGI compiler
- PB switch and test NFS mounts on pcluster
- PB reenable queues on zcluster
- PB resume queues on pcluster
- enable logins pcluster
- enable logins zcluster
- email users that outage is over
- contact Lab Storage users about their mounts
 ==Clusters==

Georgia Advanced Computing Resource Center: Difference between revisions

Revision as of 13:58, 12 September 2012

Contents

Projects

Aug 22 outage

Clusters

Overview

rCluster

zCluster

ToDo List

sCluster

scluster todo List

VMWare

Virtual Machines

Storage

Overview

NAS

SAN

Networking

Overview

VLANs

IP Networks

Physical Hosts

Navigation menu

Georgia Advanced Computing Resource Center: Difference between revisions

Revision as of 13:58, 12 September 2012

Projects

Clusters

Overview

Storage

Overview

Networking

Overview

Physical Hosts

Navigation menu

Search