Xcluster Go-Live

Go-live requirements

KB:June12_xcluster_meeting

Name the cluster!!

Win a free hot-apple pie

(Scyld) Development Environment

KB:Installing scyldev (OS and Scyld)
KB:Scyldev compute nodes
Paul's VM plus physical nodes
Intel vs. AMD software builds?

Software

Intel Compiler - Guy checking on getting it.

Apps to install on xcluster:

MPI, multi-core, big-memroy, serial?
Popular apps (regardless of type; how to determine?):
- time stamps on app dirs??
- access time of executable??

New queueing system:

how/what to configure (e.g. fairshare, queues, "complexes")

Use module or not?

Scyld

Location of /usr/local/apps, libraries, and languages e.g., Perl
A bunch more stuff needed here

Cloud Manager

Get Shan-Ho to show us POD

Use account creation process as a means to identify inactive users:

Mail PI's with list of current users

Lab Group Registration:

Lab VM login nodes

User Accounts:

New request from PI's
CAS/LDAP authentication?
affiliate accounts?

Storage

HPC IceBreaker (3 x 48TB chains)
- one chain for /home, /db, and /usr/local?
- two chains for scratch?

Archive IceBreaker (2 x 320TB chains)
- /oflow and /home, /usr/local backups

Lustre ClusterStor
- scratch for zcluster via 10GBE
- scratch for xcluster via 10GBE, maybe IB later

Other

Interactive cluster (using rack of zcluster nodes)?
New copy nodes (e.g., use some hadoop nodes)
Start draining zcluster GPU nodes for move to xcluster

PB notes

note: don't want perlwrapper this time
note /usr/share/Modules/init/.modulespath is what links env-modules to Scyld /opt/scyld/modulefiles
research: will scratch icebreakers run over Ethernet also, or only IB, for storage?
research: try out POD
research: what is difference between building on Intel vs AMD?
decision: which compiler to use by default? I think we think gcc.
research/decision: will we do "qlogin"? an interactive queue? what kind of enforceable resource limits?
- http://www.nics.tennessee.edu/~troy/pbstools/ for a qlogin
- dunno about resource limits yet
decision: paths for /usr/local apps?
RESOLVED: scyldev can see zhead license server
DONE: install PGI
decision: RPMs vs not for /usr/local
todo: install lmod (is ready to build)
- want it to heed /opt/scyld/modulefiles as well as other location
todo: install CUDA
research: can we put Intel compilers on scyldev?
decision: enable rsh on compute nodes?
research: figure out node naming scheme. See if we can get the syntax for our customary hostnamse in beowulf config file.
decision: is 1024 a sufficient max per-user process limit?
decision: is 1024 a sufficient max per-user open files limit?
research: need new IP allocation scheme
research: do any zcluster nodes have distinct kernel command line args?
todo': if we want users ssh to nodes for jobs, need e.g. /etc/profile.d/ssh-key.sh
todo: grep for "nodenumber" in the PDF files to get a list of per-node config file
todo: build Lustre client. Penguins case 62419 about this (we don't have Scyld kernel source).

Documentation

cloud, login nodes, etc.
queuing system

Zcluster Nodes (available for re-purpose)

Note: Items in bold are likely candidates.

rack 6:
- (27) Dell 1950, 8-core, 16GB

rack 7:
- (16) Dell R610, 8-core, 16GB
- (1) Dell R810 Big Memory, 32-core, 512GB
- (2) SuperMicro, 12-core, 256GB

rack 8-11:
- (123) Dell 1950, 8-core, 16GB
- (5) SuperMicro, 12-core, 256GB

rack 12:
- (10) Arch, 12-core 48GB
- (2) Dell R610 Tesla 1070, 8-core 48GB
- (2) Dell R610 8-core, 48GB (old 192GB boxes)
- (3) Dell R610 8-core, 192GB
- (1) SuperMicro Tesla 2075, 4-core, 24GB (Taha?)
- (1) Dell R900, 16-core, 128GB

rack 13:
- (27) Dell R410, 8-core 16GB

rack 14:
- (3) Dell PE C6145, 32-core, 64GB
- (1) Dell R810 Big Memory, 32-core, 512GB
- (2) Dell R815 Interactive nodes, 48-core, 128GB
- (3) SuperMicro, 12-core, 256GB

rack 15:
- (26) Arch, 12-core, 48GB

rack 16:
- (10) Arch, 12-core, 48GB

rack 17:
- (9) Arch, 24-core, 128GB, (hadoop)
- (3) Arch, 24-core, 128GB, (multi-core)

rack 18:
- (5) Penguin Kepler GPU, 12-core, 96GB

Raj's Requests

I have a few new features requests for the new cluster:

1) We definitely need a module system (as implemented by TACC) for software management and environment configuration. This will greatly help cleanup out environment variable space and make it easy for users to setup their environment to suit the needs of the software.

2) Job queues with different run times (which will affect job priorities) etc.

Thanks, -Raj

Xcluster Go-Live

Contents

Go-live requirements

Name the cluster!!

(Scyld) Development Environment

Software

Scyld

Cloud Manager

Storage

Other

PB notes

Documentation

Zcluster Nodes (available for re-purpose)

Raj's Requests

Navigation menu

Xcluster Go-Live

Go-live requirements

Name the cluster!!

(Scyld) Development Environment

Software

Scyld

Cloud Manager

Storage

Other

PB notes

Documentation

Zcluster Nodes (available for re-purpose)

Raj's Requests

Navigation menu

Search