Ongoing issues: Difference between revisions

From Research Computing Center Wiki
Jump to navigation Jump to search
No edit summary
(Blanked the page)
 
Line 1: Line 1:
# File Vanishing Problem:
 
#: User zheng has seen a file vanish right after saving it in vi.  I don't know which node he was on.
#: pbrunk saw this happen with vi on zcluster.
#: yhuang has seen it as yhuang on zluster, and root on rcluster's c9-1(!).
#: It's possible that clock skew contributed to this, but not certain.
# node config drift away from baseline:
#: nodes that are down or unresponsive can't execute 'pdsh'd commands.  So once those nodes are back in service, unless they've been reinstalled, by default they don't get the changes applied to them which were applied to the other nodes while the formerly-down nodes were down.
#: We can either write detection and/or remediation scripts (and/or ones which assess the node's condition at bootup time, or from cron, and make the node refuse to accept jobs if some conditions aren't met), or use a system like Puppet, Chef, cfengine, etc.
# GACRC_Repo in yum
#: Some nodes have a GACRC_Repo defined in yum.  This was from Curtis' worthy experiments in RPM'ing our customizations. It also means that one must use 'yum --disablerepo=GACRC_Repo" on those nodes, whenever doing anything with yum, and one must not use the disablerepo option on nodes which don't have that repo defined.
# watchdog for "dnotify" process
#: dnotify is what detects the "request" files made by make_escratch.  So we want it to be always running.  I've not seen it crash, but for safety's sake I altered /etc/init.d/dnotify (part of an RPM I made) to accept the argument "start-if-dead".  This starts dnotify if it has crashed (if lock file exists but no dnotify process).  I have to make sure it works as intended (I think I already did) and put the cron job everywhere.
# Shell prompts and xterm titles
#: We'd like the hostname and CWD of shells to be reflected in the provided-by-us prompt, and for that prompt to also issue the escape sequences to set the title bar of relevant X apps like xterm.  Maybe this is already working as desired; we should check.  I think shtsai said it's not.
# accounting and the flat file
#: Right now zcluster copies the /etc/accounts flat file from where rcluster puts a copy for it.  We want, instead, for the zcluster to generate its own file, since rcluster will be going away.  This is a todo item, but not I think a problematic one.
# nodes' SSH keys regenerated on install, conflicting with stored keys on headnode/login node
#: The cluster is designed to allow passwordless ssh for everyone from head/login node to any compute node.  But when a node is reinstalled, it gets a new key, conflicting with the stored one, and so ssh'd commands to those nodes don't execute.  We have to steal/modify/write a scanssh-type script that repopulates the stored keys with the current ones.
# security hole in GE
#: If you use root-run "prologs" or "epilogs" (we don't, yet), or if you use sshd for your qlogin daemon (we do), a crafty user could elevate her privileges.  Discussion is at [http://comments.gmane.org/gmane.comp.clustering.opengridengine.user/3271 http://comments.gmane.org/gmane.comp.clustering.opengridengine.user/3271]

Latest revision as of 17:14, 25 April 2012