Ongoing issues: Difference between revisions

From Research Computing Center Wiki
Jump to navigation Jump to search
(Created page with "# File Vanishing Problem: #: User zheng has seen a file vanish right after saving it in vi. I don't know which node he was on. #: pbrunk saw this happen with vi on zcluster. #: ...")
 
No edit summary
Line 13: Line 13:
# Shell prompts and xterm titles
# Shell prompts and xterm titles
#: We'd like the hostname and CWD of shells to be reflected in the provided-by-us prompt, and for that prompt to also issue the escape sequences to set the title bar of relevant X apps like xterm.  Maybe this is already working as desired; we should check.  I think shtsai said it's not.
#: We'd like the hostname and CWD of shells to be reflected in the provided-by-us prompt, and for that prompt to also issue the escape sequences to set the title bar of relevant X apps like xterm.  Maybe this is already working as desired; we should check.  I think shtsai said it's not.
# accounting and the flat file
#: Right now zcluster copies the /etc/accounts flat file from where rcluster puts a copy for it.  We want, instead, for the zcluster to generate its own file, since rcluster will be going away.  This is a todo item, but not I think a problematic one.

Revision as of 16:43, 25 April 2012

  1. File Vanishing Problem:
    User zheng has seen a file vanish right after saving it in vi. I don't know which node he was on.
    pbrunk saw this happen with vi on zcluster.
    yhuang has seen it as yhuang on zluster, and root on rcluster's c9-1(!).
    It's possible that clock skew contributed to this, but not certain.
  2. node config drift away from baseline:
    nodes that are down or unresponsive can't execute 'pdsh'd commands. So once those nodes are back in service, unless they've been reinstalled, by default they don't get the changes applied to them which were applied to the other nodes while the formerly-down nodes were down.
    We can either write detection and/or remediation scripts (and/or ones which assess the node's condition at bootup time, or from cron, and make the node refuse to accept jobs if some conditions aren't met), or use a system like Puppet, Chef, cfengine, etc.
  3. GACRC_Repo in yum
    Some nodes have a GACRC_Repo defined in yum. This was from Curtis' worthy experiments in RPM'ing our customizations. It also means that one must use 'yum --disablerepo=GACRC_Repo" on those nodes, whenever doing anything with yum, and one must not use the disablerepo option on nodes which don't have that repo defined.
  4. watchdog for "dnotify" process
    dnotify is what detects the "request" files made by make_escratch. So we want it to be always running. I've not seen it crash, but for safety's sake I altered /etc/init.d/dnotify (part of an RPM I made) to accept the argument "start-if-dead". This starts dnotify if it has crashed (if lock file exists but no dnotify process). I have to make sure it works as intended (I think I already did) and put the cron job everywhere.
  5. Shell prompts and xterm titles
    We'd like the hostname and CWD of shells to be reflected in the provided-by-us prompt, and for that prompt to also issue the escape sequences to set the title bar of relevant X apps like xterm. Maybe this is already working as desired; we should check. I think shtsai said it's not.
  6. accounting and the flat file
    Right now zcluster copies the /etc/accounts flat file from where rcluster puts a copy for it. We want, instead, for the zcluster to generate its own file, since rcluster will be going away. This is a todo item, but not I think a problematic one.