System Administration

From Research Computing Center Wiki
Revision as of 17:21, 23 December 2013 by Curtis E. Combs Jr. (talk | contribs) (→‎Overview)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Introduction

System administration at GACRC requires experience and advanced knowledge managing the Linux operating system as well the ability to quickly and accurately troubleshoot, configure and plan the automated clustering of many nodes with little or no manual intervention. The cluster platform provides a fundamental environment for allowing users to efficiently run scientific applications in either a serial or, preferably, a parallel context. The following sections break out the system administration role into the most important categories; however, covering every aspect of Linux usage and administration is beyond the scope of any document.

Overview

At GACRC, production clusters are a collection of server-class hardware nodes managed by a platform management solution which allows users to submit their jobs to a resource scheduler software. The platform management solution in use at the GACRC is named Rocks. The Rocks distribution provides infrastructure-level services, manages installation of nodes and provides the administrator with the ability to use a command-line interface to initiate platform management events.

The resource scheduler is Univa Grid Engine. A resource scheduler manages and divides the total number of compute, memory and I/O resources in the cluster according to a set policy. Users submit their jobs to the cluster scheduler through scripts that run binaries with arguments that describe their usage request and, by default, pends the jobs until the resources become available.

The sections below divide these tie these high-level descriptions to practical management topics.

Cluster Overview

Rocks: Infrastructure and Node Information and Management

IPMI: Out-of-band Node management

UGE: Scheduler Information and Management

Other: User and Group Information and Management