Difference between revisions of "Xcluster Go-Live"

Latest revision as of 08:38, 18 June 2014

Go-live requirements

KB:June12_xcluster_meeting

Name the cluster!!

Win a free hot-apple pie

(Scyld) Development Environment

KB:Installing scyldev (OS and Scyld)
KB:Scyldev compute nodes
Paul's VM plus physical nodes
Intel vs. AMD software builds?

Software

Intel Compiler - Guy checking on getting it.

Apps to install on xcluster:

MPI, multi-core, big-memroy, serial?
Popular apps (regardless of type; how to determine?):
- time stamps on app dirs??
- access time of executable??

New queueing system:

how/what to configure (e.g. fairshare, queues, "complexes")

Use module or not?

Scyld

Location of /usr/local/apps, libraries, and languages e.g., Perl
A bunch more stuff needed here

Cloud Manager

Get Shan-Ho to show us POD

Use account creation process as a means to identify inactive users:

Mail PI's with list of current users

Lab Group Registration:

Lab VM login nodes

User Accounts:

New request from PI's
CAS/LDAP authentication?
affiliate accounts?

Storage

HPC IceBreaker (3 x 48TB chains)
- one chain for /home, /db, and /usr/local?
- two chains for scratch?

Archive IceBreaker (2 x 320TB chains)
- /oflow and /home, /usr/local backups

Lustre ClusterStor
- scratch for zcluster via 10GBE
- scratch for xcluster via 10GBE, maybe IB later

Other

Interactive cluster (using rack of zcluster nodes)?
New copy nodes (e.g., use some hadoop nodes)
Start draining zcluster GPU nodes for move to xcluster

PB notes

note: don't want perlwrapper this time
note /usr/share/Modules/init/.modulespath is what links env-modules to Scyld /opt/scyld/modulefiles
research: will scratch icebreakers run over Ethernet also, or only IB, for storage?
research: try out POD
research: what is difference between building on Intel vs AMD?
decision: which compiler to use by default? I think we think gcc.
research/decision: will we do "qlogin"? an interactive queue? what kind of enforceable resource limits?
- http://www.nics.tennessee.edu/~troy/pbstools/ for a qlogin
- dunno about resource limits yet
decision: paths for /usr/local apps?
RESOLVED: scyldev can see zhead license server
DONE: install PGI
decision: RPMs vs not for /usr/local
todo: install lmod (is ready to build)
- want it to heed /opt/scyld/modulefiles as well as other location
todo: install CUDA
research: can we put Intel compilers on scyldev?
decision: enable rsh on compute nodes?
research: figure out node naming scheme. See if we can get the syntax for our customary hostnamse in beowulf config file.
decision: is 1024 a sufficient max per-user process limit?
decision: is 1024 a sufficient max per-user open files limit?
research: need new IP allocation scheme
research: do any zcluster nodes have distinct kernel command line args?
todo': if we want users ssh to nodes for jobs, need e.g. /etc/profile.d/ssh-key.sh
todo: grep for "nodenumber" in the PDF files to get a list of per-node config file
todo: build Lustre client. Penguins case 62419 about this (we don't have Scyld kernel source).

Documentation

cloud, login nodes, etc.
queuing system

Zcluster Nodes (available for re-purpose)

Note: Items in bold are likely candidates.

rack 6:
- (27) Dell 1950, 8-core, 16GB

rack 7:
- (16) Dell R610, 8-core, 16GB
- (1) Dell R810 Big Memory, 32-core, 512GB
- (2) SuperMicro, 12-core, 256GB

rack 8-11:
- (123) Dell 1950, 8-core, 16GB
- (5) SuperMicro, 12-core, 256GB

rack 12:
- (10) Arch, 12-core 48GB
- (2) Dell R610 Tesla 1070, 8-core 48GB
- (2) Dell R610 8-core, 48GB (old 192GB boxes)
- (3) Dell R610 8-core, 192GB
- (1) SuperMicro Tesla 2075, 4-core, 24GB (Taha?)
- (1) Dell R900, 16-core, 128GB

rack 13:
- (27) Dell R410, 8-core 16GB

rack 14:
- (3) Dell PE C6145, 32-core, 64GB
- (1) Dell R810 Big Memory, 32-core, 512GB
- (2) Dell R815 Interactive nodes, 48-core, 128GB
- (3) SuperMicro, 12-core, 256GB

rack 15:
- (26) Arch, 12-core, 48GB

rack 16:
- (10) Arch, 12-core, 48GB

rack 17:
- (9) Arch, 24-core, 128GB, (hadoop)
- (3) Arch, 24-core, 128GB, (multi-core)

rack 18:
- (5) Penguin Kepler GPU, 12-core, 96GB

Raj's Requests

I have a few new features requests for the new cluster:

1) We definitely need a module system (as implemented by TACC) for software management and environment configuration. This will greatly help cleanup out environment variable space and make it easy for users to setup their environment to suit the needs of the software.

2) Job queues with different run times (which will affect job priorities) etc.

Thanks, -Raj

@@ Line 1: / Line 1: @@
+=== Go-live requirements ===
+* [[KB:June12_xcluster_meeting]]
 ===Name the cluster!!===
@@ Line 5: / Line 9: @@
 ===(Scyld) Development Environment===
+* [[KB:Installing scyldev]] ''(OS and Scyld)''
+* [[KB:Scyldev compute nodes]]
 *Paul's VM plus physical nodes
 *Intel vs. AMD software builds?
 ===Software===
-Intel Compiler - do we need it?
+Intel Compiler - Guy checking on getting it.
 Apps to install on xcluster:
@@ Line 32: / Line 36: @@
 ===Cloud Manager===
+Get Shan-Ho to show us POD
+Use account creation process as a means to identify inactive users:
+*Mail PI's with list of current users
 Lab Group Registration:
@@ Line 40: / Line 49: @@
 *CAS/LDAP authentication?
 *affiliate accounts?
 ===Storage===
@@ Line 55: / Line 63: @@
 **scratch for xcluster via 10GBE, maybe IB later
+===Other===
+*Interactive cluster (using rack of zcluster nodes)?
+*New copy nodes (e.g., use some hadoop nodes)
+*Start draining zcluster GPU nodes for move to xcluster
 ===PB notes===
 * '''note''': don't want perlwrapper this time
+* '''note''' /usr/share/Modules/init/.modulespath is what links env-modules to Scyld /opt/scyld/modulefiles
 * '''research''': will scratch icebreakers run over Ethernet also, or only IB, for storage?
 * '''research''': try out POD
 * '''research''': what is difference between building on Intel vs AMD?
 * '''decision''': which compiler to use by default?  I think we think gcc.
 * '''research/decision''': will we do "qlogin"?  an interactive queue?  what kind of enforceable resource limits?
 ** http://www.nics.tennessee.edu/~troy/pbstools/ for a qlogin
 ** dunno about resource limits yet
 * '''decision''': paths for /usr/local apps?
+* '''RESOLVED''': scyldev can see zhead license server
-* '''research''': can scyldev see zhead license server?
+* '''DONE''': install PGI
-** will know after testing the now-installed PGI
 * '''decision''': RPMs vs not for /usr/local
 * '''todo''':  install lmod (is ready to build)
 ** want it to heed /opt/scyld/modulefiles as well as other location
+* '''todo''': install CUDA
+* '''research''': can we put Intel compilers on scyldev?
+* '''decision''': enable rsh on compute nodes?
+* '''research''': figure out node naming scheme.  See if we can get the syntax for our customary hostnamse in beowulf config file.
+* '''decision''': is 1024 a sufficient max per-user process limit?
+* '''decision''': is 1024 a sufficient max per-user open files limit?
+* '''research''': need new IP allocation scheme
+* '''research''': do any zcluster nodes have distinct kernel command line args?
+* '''todo'''': if we want users ssh to nodes for jobs, need  e.g. /etc/profile.d/ssh-key.sh
+* '''todo''':  grep for "nodenumber" in the PDF files to get a list of per-node config file
+* '''todo''': build Lustre client.  Penguins case 62419 about this (we don't have Scyld kernel source).
 ===Documentation===
 *cloud, login nodes, etc.
 *queuing system
+===Zcluster Nodes (available for re-purpose)===
+Note: Items in '''bold''' are likely candidates.
+*rack 6:
+**(27) Dell 1950, 8-core, 16GB
+*rack 7:
+**(16) Dell R610, 8-core, 16GB
+**(1) '''Dell R810 Big Memory, 32-core, 512GB'''
+**(2) SuperMicro, 12-core, 256GB
+*rack 8-11:
+**(123) Dell 1950, 8-core, 16GB
+**(5) SuperMicro, 12-core, 256GB
+*rack 12:
+**(10) Arch, 12-core 48GB
+**(2) Dell R610 Tesla 1070, 8-core 48GB
+**(2) Dell R610 8-core, 48GB (old 192GB boxes)
+**(3) Dell R610 8-core, 192GB
+**(1) SuperMicro Tesla 2075, 4-core, 24GB (Taha?)
+**(1) Dell R900, 16-core, 128GB
+*rack 13:
+**(27) Dell R410, 8-core 16GB
+*rack 14:
+**(3) Dell PE C6145, 32-core, 64GB
+**(1) '''Dell R810 Big Memory, 32-core, 512GB'''
+**(2) '''Dell R815 Interactive nodes, 48-core, 128GB'''
+**(3) SuperMicro, 12-core, 256GB
+*rack 15:
+**(26) Arch, 12-core, 48GB
+*rack 16:
+**(10) Arch, 12-core, 48GB
+*rack 17:
+**(9) '''Arch, 24-core, 128GB, (hadoop)'''
+**(3) '''Arch, 24-core, 128GB, (multi-core)'''
+*rack 18:
+**(5) '''Penguin Kepler GPU, 12-core, 96GB'''
+===Raj's Requests===
+I have a few new features requests for the new cluster:
+) We definitely need a module system (as implemented by TACC) for software management and environment configuration.
+This will greatly help cleanup out environment variable space and make it easy for users to setup their environment to suit the needs of the software.
+) Job queues with different run times (which will affect job priorities) etc.
+Thanks,
+-Raj