Difference between revisions of "Xcluster Go-Live"

From Research Computing Center Wiki
Jump to navigation Jump to search
 
(22 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 +
=== Go-live requirements ===
 +
* [[KB:June12_xcluster_meeting]]
 +
 +
 
===Name the cluster!!===
 
===Name the cluster!!===
  
Line 5: Line 9:
  
 
===(Scyld) Development Environment===
 
===(Scyld) Development Environment===
 
+
* [[KB:Installing scyldev]] ''(OS and Scyld)''
 +
* [[KB:Scyldev compute nodes]]
 
*Paul's VM plus physical nodes
 
*Paul's VM plus physical nodes
 
*Intel vs. AMD software builds?
 
*Intel vs. AMD software builds?
 
  
 
===Software===
 
===Software===
  
Intel Compiler - do we need it?
+
Intel Compiler - Guy checking on getting it.
  
 
Apps to install on xcluster:
 
Apps to install on xcluster:
Line 32: Line 36:
  
 
===Cloud Manager===
 
===Cloud Manager===
 +
Get Shan-Ho to show us POD
 +
 +
 +
Use account creation process as a means to identify inactive users:
 +
*Mail PI's with list of current users
  
 
Lab Group Registration:
 
Lab Group Registration:
Line 40: Line 49:
 
*CAS/LDAP authentication?
 
*CAS/LDAP authentication?
 
*affiliate accounts?
 
*affiliate accounts?
 
  
 
===Storage===
 
===Storage===
Line 55: Line 63:
 
**scratch for xcluster via 10GBE, maybe IB later
 
**scratch for xcluster via 10GBE, maybe IB later
  
 +
===Other===
  
 +
*Interactive cluster (using rack of zcluster nodes)?
 +
*New copy nodes (e.g., use some hadoop nodes)
 +
*Start draining zcluster GPU nodes for move to xcluster
  
 
===PB notes===
 
===PB notes===
  
 
* '''note''': don't want perlwrapper this time
 
* '''note''': don't want perlwrapper this time
 
+
* '''note''' /usr/share/Modules/init/.modulespath is what links env-modules to Scyld /opt/scyld/modulefiles
 
* '''research''': will scratch icebreakers run over Ethernet also, or only IB, for storage?
 
* '''research''': will scratch icebreakers run over Ethernet also, or only IB, for storage?
 
 
* '''research''': try out POD
 
* '''research''': try out POD
 
 
* '''research''': what is difference between building on Intel vs AMD?
 
* '''research''': what is difference between building on Intel vs AMD?
 
 
* '''decision''': which compiler to use by default?  I think we think gcc.
 
* '''decision''': which compiler to use by default?  I think we think gcc.
 
 
* '''research/decision''': will we do "qlogin"?  an interactive queue?  what kind of enforceable resource limits?
 
* '''research/decision''': will we do "qlogin"?  an interactive queue?  what kind of enforceable resource limits?
 
** http://www.nics.tennessee.edu/~troy/pbstools/ for a qlogin
 
** http://www.nics.tennessee.edu/~troy/pbstools/ for a qlogin
 
** dunno about resource limits yet
 
** dunno about resource limits yet
 
 
* '''decision''': paths for /usr/local apps?
 
* '''decision''': paths for /usr/local apps?
 
+
* '''RESOLVED''': scyldev can see zhead license server
* '''research''': can scyldev see zhead license server?
+
* '''DONE''': install PGI
** will know after testing the now-installed PGI
 
 
 
 
* '''decision''': RPMs vs not for /usr/local
 
* '''decision''': RPMs vs not for /usr/local
 
 
* '''todo''':  install lmod (is ready to build)
 
* '''todo''':  install lmod (is ready to build)
 
** want it to heed /opt/scyld/modulefiles as well as other location
 
** want it to heed /opt/scyld/modulefiles as well as other location
 +
* '''todo''': install CUDA
 +
* '''research''': can we put Intel compilers on scyldev?
 +
* '''decision''': enable rsh on compute nodes?
 +
* '''research''': figure out node naming scheme.  See if we can get the syntax for our customary hostnamse in beowulf config file.
 +
* '''decision''': is 1024 a sufficient max per-user process limit?
 +
* '''decision''': is 1024 a sufficient max per-user open files limit?
 +
* '''research''': need new IP allocation scheme
 +
* '''research''': do any zcluster nodes have distinct kernel command line args?
 +
* '''todo'''': if we want users ssh to nodes for jobs, need  e.g. /etc/profile.d/ssh-key.sh
 +
* '''todo''':  grep for "nodenumber" in the PDF files to get a list of per-node config file
 +
* '''todo''': build Lustre client.  Penguins case 62419 about this (we don't have Scyld kernel source).
  
 
===Documentation===
 
===Documentation===
 
*cloud, login nodes, etc.
 
*cloud, login nodes, etc.
 
*queuing system
 
*queuing system
 +
 +
===Zcluster Nodes (available for re-purpose)===
 +
 +
Note: Items in '''bold''' are likely candidates.
 +
 +
*rack 6:
 +
**(27) Dell 1950, 8-core, 16GB
 +
 +
*rack 7:
 +
**(16) Dell R610, 8-core, 16GB
 +
**(1) '''Dell R810 Big Memory, 32-core, 512GB'''
 +
**(2) SuperMicro, 12-core, 256GB
 +
 +
*rack 8-11:
 +
**(123) Dell 1950, 8-core, 16GB
 +
**(5) SuperMicro, 12-core, 256GB
 +
 +
*rack 12:
 +
**(10) Arch, 12-core 48GB
 +
**(2) Dell R610 Tesla 1070, 8-core 48GB
 +
**(2) Dell R610 8-core, 48GB (old 192GB boxes)
 +
**(3) Dell R610 8-core, 192GB
 +
**(1) SuperMicro Tesla 2075, 4-core, 24GB (Taha?)
 +
**(1) Dell R900, 16-core, 128GB
 +
 +
*rack 13:
 +
**(27) Dell R410, 8-core 16GB
 +
 +
*rack 14:
 +
**(3) Dell PE C6145, 32-core, 64GB
 +
**(1) '''Dell R810 Big Memory, 32-core, 512GB'''
 +
**(2) '''Dell R815 Interactive nodes, 48-core, 128GB'''
 +
**(3) SuperMicro, 12-core, 256GB
 +
 +
*rack 15:
 +
**(26) Arch, 12-core, 48GB
 +
 +
*rack 16:
 +
**(10) Arch, 12-core, 48GB
 +
 +
*rack 17:
 +
**(9) '''Arch, 24-core, 128GB, (hadoop)'''
 +
**(3) '''Arch, 24-core, 128GB, (multi-core)'''
 +
 +
*rack 18:
 +
**(5) '''Penguin Kepler GPU, 12-core, 96GB'''
 +
 +
===Raj's Requests===
 +
 +
I have a few new features requests for the new cluster:
 +
 +
1) We definitely need a module system (as implemented by TACC) for software management and environment configuration.
 +
This will greatly help cleanup out environment variable space and make it easy for users to setup their environment to suit the needs of the software.
 +
 +
2) Job queues with different run times (which will affect job priorities) etc.
 +
 +
Thanks,
 +
-Raj

Latest revision as of 08:38, 18 June 2014

Go-live requirements


Name the cluster!!

  • Win a free hot-apple pie


(Scyld) Development Environment

Software

Intel Compiler - Guy checking on getting it.

Apps to install on xcluster:

  • MPI, multi-core, big-memroy, serial?
  • Popular apps (regardless of type; how to determine?):
    • time stamps on app dirs??
    • access time of executable??

New queueing system:

  • how/what to configure (e.g. fairshare, queues, "complexes")

Use module or not?

Scyld

  • Location of /usr/local/apps, libraries, and languages e.g., Perl
  • A bunch more stuff needed here


Cloud Manager

Get Shan-Ho to show us POD


Use account creation process as a means to identify inactive users:

  • Mail PI's with list of current users

Lab Group Registration:

  • Lab VM login nodes

User Accounts:

  • New request from PI's
  • CAS/LDAP authentication?
  • affiliate accounts?

Storage

  • HPC IceBreaker (3 x 48TB chains)
    • one chain for /home, /db, and /usr/local?
    • two chains for scratch?
  • Archive IceBreaker (2 x 320TB chains)
    • /oflow and /home, /usr/local backups
  • Lustre ClusterStor
    • scratch for zcluster via 10GBE
    • scratch for xcluster via 10GBE, maybe IB later

Other

  • Interactive cluster (using rack of zcluster nodes)?
  • New copy nodes (e.g., use some hadoop nodes)
  • Start draining zcluster GPU nodes for move to xcluster

PB notes

  • note: don't want perlwrapper this time
  • note /usr/share/Modules/init/.modulespath is what links env-modules to Scyld /opt/scyld/modulefiles
  • research: will scratch icebreakers run over Ethernet also, or only IB, for storage?
  • research: try out POD
  • research: what is difference between building on Intel vs AMD?
  • decision: which compiler to use by default? I think we think gcc.
  • research/decision: will we do "qlogin"? an interactive queue? what kind of enforceable resource limits?
  • decision: paths for /usr/local apps?
  • RESOLVED: scyldev can see zhead license server
  • DONE: install PGI
  • decision: RPMs vs not for /usr/local
  • todo: install lmod (is ready to build)
    • want it to heed /opt/scyld/modulefiles as well as other location
  • todo: install CUDA
  • research: can we put Intel compilers on scyldev?
  • decision: enable rsh on compute nodes?
  • research: figure out node naming scheme. See if we can get the syntax for our customary hostnamse in beowulf config file.
  • decision: is 1024 a sufficient max per-user process limit?
  • decision: is 1024 a sufficient max per-user open files limit?
  • research: need new IP allocation scheme
  • research: do any zcluster nodes have distinct kernel command line args?
  • todo': if we want users ssh to nodes for jobs, need e.g. /etc/profile.d/ssh-key.sh
  • todo: grep for "nodenumber" in the PDF files to get a list of per-node config file
  • todo: build Lustre client. Penguins case 62419 about this (we don't have Scyld kernel source).

Documentation

  • cloud, login nodes, etc.
  • queuing system

Zcluster Nodes (available for re-purpose)

Note: Items in bold are likely candidates.

  • rack 6:
    • (27) Dell 1950, 8-core, 16GB
  • rack 7:
    • (16) Dell R610, 8-core, 16GB
    • (1) Dell R810 Big Memory, 32-core, 512GB
    • (2) SuperMicro, 12-core, 256GB
  • rack 8-11:
    • (123) Dell 1950, 8-core, 16GB
    • (5) SuperMicro, 12-core, 256GB
  • rack 12:
    • (10) Arch, 12-core 48GB
    • (2) Dell R610 Tesla 1070, 8-core 48GB
    • (2) Dell R610 8-core, 48GB (old 192GB boxes)
    • (3) Dell R610 8-core, 192GB
    • (1) SuperMicro Tesla 2075, 4-core, 24GB (Taha?)
    • (1) Dell R900, 16-core, 128GB
  • rack 13:
    • (27) Dell R410, 8-core 16GB
  • rack 14:
    • (3) Dell PE C6145, 32-core, 64GB
    • (1) Dell R810 Big Memory, 32-core, 512GB
    • (2) Dell R815 Interactive nodes, 48-core, 128GB
    • (3) SuperMicro, 12-core, 256GB
  • rack 15:
    • (26) Arch, 12-core, 48GB
  • rack 16:
    • (10) Arch, 12-core, 48GB
  • rack 17:
    • (9) Arch, 24-core, 128GB, (hadoop)
    • (3) Arch, 24-core, 128GB, (multi-core)
  • rack 18:
    • (5) Penguin Kepler GPU, 12-core, 96GB

Raj's Requests

I have a few new features requests for the new cluster:

1) We definitely need a module system (as implemented by TACC) for software management and environment configuration. This will greatly help cleanup out environment variable space and make it easy for users to setup their environment to suit the needs of the software.

2) Job queues with different run times (which will affect job priorities) etc.

Thanks, -Raj