Xcluster Go-Live

From Research Computing Center Wiki
Revision as of 13:11, 29 May 2014 by Derda (talk | contribs)
Jump to navigation Jump to search

Name the cluster!!

  • Win a free hot-apple pie


(Scyld) Development Environment

Software

Intel Compiler - Guy checking on getting it.

Apps to install on xcluster:

  • MPI, multi-core, big-memroy, serial?
  • Popular apps (regardless of type; how to determine?):
    • time stamps on app dirs??
    • access time of executable??

New queueing system:

  • how/what to configure (e.g. fairshare, queues, "complexes")

Use module or not?

Scyld

  • Location of /usr/local/apps, libraries, and languages e.g., Perl
  • A bunch more stuff needed here


Cloud Manager

Use account creation process as a means to identify inactive users:

  • Mail PI's with list of current users

Lab Group Registration:

  • Lab VM login nodes

User Accounts:

  • New request from PI's
  • CAS/LDAP authentication?
  • affiliate accounts?

Storage

  • HPC IceBreaker (3 x 48TB chains)
    • one chain for /home, /db, and /usr/local?
    • two chains for scratch?
  • Archive IceBreaker (2 x 320TB chains)
    • /oflow and /home, /usr/local backups
  • Lustre ClusterStor
    • scratch for zcluster via 10GBE
    • scratch for xcluster via 10GBE, maybe IB later

Other

  • Interactive cluster (using rack of zcluster nodes)?
  • New copy nodes (e.g., use some hadoop nodes)


PB notes

  • note: don't want perlwrapper this time
  • research: will scratch icebreakers run over Ethernet also, or only IB, for storage?
  • research: try out POD
  • research: what is difference between building on Intel vs AMD?
  • decision: which compiler to use by default? I think we think gcc.
  • research/decision: will we do "qlogin"? an interactive queue? what kind of enforceable resource limits?
  • decision: paths for /usr/local apps?
  • RESOLVED: scyldev can see zhead license server
  • DONE: install PGI
  • decision: RPMs vs not for /usr/local
  • todo: install lmod (is ready to build)
    • want it to heed /opt/scyld/modulefiles as well as other location
  • todo: install CUDA
  • research: can we put Intel compilers on scyldev?
  • decision: enable rsh on compute nodes?
  • research: figure out node naming scheme. See if we can get the syntax for our customary hostnamse in beowulf config file.
  • decision: is 1024 a sufficient max per-user process limit?
  • decision: is 1024 a sufficient max per-user open files limit?
  • research: need new IP allocation scheme
  • research: do any zcluster nodes have distinct kernel command line args?
  • todo': if we want users ssh to nodes for jobs, need e.g. /etc/profile.d/ssh-key.sh
  • todo: grep for "nodenumber" in the PDF files to get a list of per-node config file
  • todo: build Lustre client. Penguins case 62419 about this (we don't have Scyld kernel source).

Documentation

  • cloud, login nodes, etc.
  • queuing system

Zcluster nodes available for re-purpose

  • rack 6:
    • (27) Dell 1950, 8-core, 16GB
  • rack 7:
    • (16) Dell R610, 8-core, 16GB
    • (1) Dell R810 Big Memory, 32-core, 512GB
    • (2) SuperMicro, 12-core, 256GB
  • rack 8-11:
    • (123) Dell 1950, 8-core, 16GB
    • (5) SuperMicro, 12-core, 256GB
  • rack 12:
    • (10) Arch, 12-core 48GB
    • (2) Dell R610 Tesla 1070, 8-core 48GB
    • (2) Dell R610 8-core, 48GB (old 192GB boxes)
    • (3) Dell R610 8-core, 192GB
    • (1) SuperMicro Tesla 2075, 4-core, 24GB (Taha?)
    • (1) Dell R900, 16-core, 128GB (what is this?)
  • rack 13:
    • (27) Dell R410, 8-core 16GB
  • rack 14:
    • (3) Dell PE C6145, 32-core, 64GB
    • (1) Dell R810 Big Memory, 32-core, 512GB
    • (2) Dell R815 Interactive nodes, 48-core, 128GB
    • (3) SuperMicro, 12-core, 256GB
  • rack 15:
    • (26) Arch, 12-core, 48GB
  • rack 16:
    • (10) Arch, 12-core, 48GB
  • rack 17:
    • (9) Arch, 24-core, 128GB, (hadoop)
    • (3) Arch, 24-core, 128GB, (multi-core)
  • rack 18:
    • (5) Penguin Kepler GPU, 12-core, 96GB