Ongoing issues: Difference between revisions

From Research Computing Center Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 17: Line 17:
# nodes' SSH keys regenerated on install, conflicting with stored keys on headnode/login node
# nodes' SSH keys regenerated on install, conflicting with stored keys on headnode/login node
#: The cluster is designed to allow passwordless ssh for everyone from head/login node to any compute node.  But when a node is reinstalled, it gets a new key, conflicting with the stored one, and so ssh'd commands to those nodes don't execute.  We have to steal/modify/write a scanssh-type script that repopulates the stored keys with the current ones.
#: The cluster is designed to allow passwordless ssh for everyone from head/login node to any compute node.  But when a node is reinstalled, it gets a new key, conflicting with the stored one, and so ssh'd commands to those nodes don't execute.  We have to steal/modify/write a scanssh-type script that repopulates the stored keys with the current ones.
# security hole in GE
#: not urgent--the vulnerability requires root-run "prolog" or "epilog" GE scripts, which we don't have yet.  Discussion is at [http://comments.gmane.org/gmane.comp.clustering.opengridengine.user/3271 http://comments.gmane.org/gmane.comp.clustering.opengridengine.user/3271]

Revision as of 16:04, 25 April 2012

  1. File Vanishing Problem:
    User zheng has seen a file vanish right after saving it in vi. I don't know which node he was on.
    pbrunk saw this happen with vi on zcluster.
    yhuang has seen it as yhuang on zluster, and root on rcluster's c9-1(!).
    It's possible that clock skew contributed to this, but not certain.
  2. node config drift away from baseline:
    nodes that are down or unresponsive can't execute 'pdsh'd commands. So once those nodes are back in service, unless they've been reinstalled, by default they don't get the changes applied to them which were applied to the other nodes while the formerly-down nodes were down.
    We can either write detection and/or remediation scripts (and/or ones which assess the node's condition at bootup time, or from cron, and make the node refuse to accept jobs if some conditions aren't met), or use a system like Puppet, Chef, cfengine, etc.
  3. GACRC_Repo in yum
    Some nodes have a GACRC_Repo defined in yum. This was from Curtis' worthy experiments in RPM'ing our customizations. It also means that one must use 'yum --disablerepo=GACRC_Repo" on those nodes, whenever doing anything with yum, and one must not use the disablerepo option on nodes which don't have that repo defined.
  4. watchdog for "dnotify" process
    dnotify is what detects the "request" files made by make_escratch. So we want it to be always running. I've not seen it crash, but for safety's sake I altered /etc/init.d/dnotify (part of an RPM I made) to accept the argument "start-if-dead". This starts dnotify if it has crashed (if lock file exists but no dnotify process). I have to make sure it works as intended (I think I already did) and put the cron job everywhere.
  5. Shell prompts and xterm titles
    We'd like the hostname and CWD of shells to be reflected in the provided-by-us prompt, and for that prompt to also issue the escape sequences to set the title bar of relevant X apps like xterm. Maybe this is already working as desired; we should check. I think shtsai said it's not.
  6. accounting and the flat file
    Right now zcluster copies the /etc/accounts flat file from where rcluster puts a copy for it. We want, instead, for the zcluster to generate its own file, since rcluster will be going away. This is a todo item, but not I think a problematic one.
  7. nodes' SSH keys regenerated on install, conflicting with stored keys on headnode/login node
    The cluster is designed to allow passwordless ssh for everyone from head/login node to any compute node. But when a node is reinstalled, it gets a new key, conflicting with the stored one, and so ssh'd commands to those nodes don't execute. We have to steal/modify/write a scanssh-type script that repopulates the stored keys with the current ones.
  8. security hole in GE
    not urgent--the vulnerability requires root-run "prolog" or "epilog" GE scripts, which we don't have yet. Discussion is at http://comments.gmane.org/gmane.comp.clustering.opengridengine.user/3271