Xcluster Go-Live

From Research Computing Center Wiki
Jump to navigation Jump to search

Go-live requirements


Name the cluster!!

  • Win a free hot-apple pie


(Scyld) Development Environment

Software

Intel Compiler - Guy checking on getting it.

Apps to install on xcluster:

  • MPI, multi-core, big-memroy, serial?
  • Popular apps (regardless of type; how to determine?):
    • time stamps on app dirs??
    • access time of executable??

New queueing system:

  • how/what to configure (e.g. fairshare, queues, "complexes")

Use module or not?

Scyld

  • Location of /usr/local/apps, libraries, and languages e.g., Perl
  • A bunch more stuff needed here


Cloud Manager

Get Shan-Ho to show us POD


Use account creation process as a means to identify inactive users:

  • Mail PI's with list of current users

Lab Group Registration:

  • Lab VM login nodes

User Accounts:

  • New request from PI's
  • CAS/LDAP authentication?
  • affiliate accounts?

Storage

  • HPC IceBreaker (3 x 48TB chains)
    • one chain for /home, /db, and /usr/local?
    • two chains for scratch?
  • Archive IceBreaker (2 x 320TB chains)
    • /oflow and /home, /usr/local backups
  • Lustre ClusterStor
    • scratch for zcluster via 10GBE
    • scratch for xcluster via 10GBE, maybe IB later

Other

  • Interactive cluster (using rack of zcluster nodes)?
  • New copy nodes (e.g., use some hadoop nodes)
  • Start draining zcluster GPU nodes for move to xcluster

PB notes

  • note: don't want perlwrapper this time
  • note /usr/share/Modules/init/.modulespath is what links env-modules to Scyld /opt/scyld/modulefiles
  • research: will scratch icebreakers run over Ethernet also, or only IB, for storage?
  • research: try out POD
  • research: what is difference between building on Intel vs AMD?
  • decision: which compiler to use by default? I think we think gcc.
  • research/decision: will we do "qlogin"? an interactive queue? what kind of enforceable resource limits?
  • decision: paths for /usr/local apps?
  • RESOLVED: scyldev can see zhead license server
  • DONE: install PGI
  • decision: RPMs vs not for /usr/local
  • todo: install lmod (is ready to build)
    • want it to heed /opt/scyld/modulefiles as well as other location
  • todo: install CUDA
  • research: can we put Intel compilers on scyldev?
  • decision: enable rsh on compute nodes?
  • research: figure out node naming scheme. See if we can get the syntax for our customary hostnamse in beowulf config file.
  • decision: is 1024 a sufficient max per-user process limit?
  • decision: is 1024 a sufficient max per-user open files limit?
  • research: need new IP allocation scheme
  • research: do any zcluster nodes have distinct kernel command line args?
  • todo': if we want users ssh to nodes for jobs, need e.g. /etc/profile.d/ssh-key.sh
  • todo: grep for "nodenumber" in the PDF files to get a list of per-node config file
  • todo: build Lustre client. Penguins case 62419 about this (we don't have Scyld kernel source).

Documentation

  • cloud, login nodes, etc.
  • queuing system

Zcluster Nodes (available for re-purpose)

Note: Items in bold are likely candidates.

  • rack 6:
    • (27) Dell 1950, 8-core, 16GB
  • rack 7:
    • (16) Dell R610, 8-core, 16GB
    • (1) Dell R810 Big Memory, 32-core, 512GB
    • (2) SuperMicro, 12-core, 256GB
  • rack 8-11:
    • (123) Dell 1950, 8-core, 16GB
    • (5) SuperMicro, 12-core, 256GB
  • rack 12:
    • (10) Arch, 12-core 48GB
    • (2) Dell R610 Tesla 1070, 8-core 48GB
    • (2) Dell R610 8-core, 48GB (old 192GB boxes)
    • (3) Dell R610 8-core, 192GB
    • (1) SuperMicro Tesla 2075, 4-core, 24GB (Taha?)
    • (1) Dell R900, 16-core, 128GB
  • rack 13:
    • (27) Dell R410, 8-core 16GB
  • rack 14:
    • (3) Dell PE C6145, 32-core, 64GB
    • (1) Dell R810 Big Memory, 32-core, 512GB
    • (2) Dell R815 Interactive nodes, 48-core, 128GB
    • (3) SuperMicro, 12-core, 256GB
  • rack 15:
    • (26) Arch, 12-core, 48GB
  • rack 16:
    • (10) Arch, 12-core, 48GB
  • rack 17:
    • (9) Arch, 24-core, 128GB, (hadoop)
    • (3) Arch, 24-core, 128GB, (multi-core)
  • rack 18:
    • (5) Penguin Kepler GPU, 12-core, 96GB

Raj's Requests

I have a few new features requests for the new cluster:

1) We definitely need a module system (as implemented by TACC) for software management and environment configuration. This will greatly help cleanup out environment variable space and make it easy for users to setup their environment to suit the needs of the software.

2) Job queues with different run times (which will affect job priorities) etc.

Thanks, -Raj