sge_shepherd (8) - Linux Manuals

sge_shepherd: xxQS_NAMExx single job-controlling agent

NAME

xxqs_name_sxx_shepherd - xxQS_NAMExx single job-controlling agent

SYNOPSIS

xxqs_name_sxx_shepherd

DESCRIPTION

xxqs_name_sxx_shepherd provides the parent process functionality for a single xxQS_NAMExx job. The parent functionality is necessary on UNIX systems to retrieve resource usage information (see after a job has finished. In addition, the xxqs_name_sxx_shepherd forwards signals to the job, such for suspension, enabling, termination, and the xxQS_NAMExx checkpointing signal (see and for details).

The xxqs_name_sxx_shepherd receives information about the job to be started from the During the execution of the job it actually starts up to 5 child processes. First a prolog script is run if this feature is enabled by the prolog parameter in the cluster configuration. (See Next a parallel environment startup procedure is run if the job is a parallel job. (See for more information.) After that, the job itself is run, followed by a parallel environment shutdown procedure for parallel jobs, and finally an epilog script if requested by the epilog parameter in the cluster configuration. The prolog and epilog scripts, as well as the parallel environment startup and shutdown procedures, are to be provided by the xxQS_NAMExx administrator and are intended for site-specific actions to be taken before and after execution of the actual user job.

After the job has finished and the epilog script is processed, xxqs_name_sxx_shepherd retrieves resource usage statistics about the job, places them in a job-specific subdirectory of the spool directory for reporting through and finishes.

xxqs_name_sxx_shepherd also places an exit status file in the spool directory. This exit status can be viewed with qacct -j JobId (see it is not the exit status of xxqs_name_sxx_shepherd itself but of one of the methods executed by xxqs_name_sxx_shepherd. This exit status can have several meanings, depending on the method in which an error occurred (if any). The possible methods are: prolog, parallel start, job, parallel stop, epilog, suspend, restart, terminate, clean, migrate, and checkpoint.

The following exit values are returned:

0
All methods: Operation was executed successfully.
99
Job script, prolog and epilog: When FORBID_RESCHEDULE is not set in the configuration (see the job gets re-queued. Otherwise see "Other".
100
Job script, prolog and epilog: When FORBID_APPERROR is not set in the configuration (see the job gets re-queued. Otherwise see "Other".
Other
Job script: This is the exit status of the job itself. No action is taken upon this exit status because the meaning of this exit status is not known.
Prolog, epilog and parallel start: The queue is set to error state and the job is re-queued.
Parallel stop: The queue is set to error state, but the job is not re-queued. It is assumed that the job itself ran successfully and only the clean up script failed.
Suspend, restart, terminate, clean, and migrate: Always successful.
Checkpoint: Success, except for kernel checkpointing: checkpoint was not successful, did not happen (but migration will happen).

For the meaning of the return codes of the shepherd itself (which are interpreted by see

RESTRICTIONS

xxqs_name_sxx_shepherd should not be invoked manually, but only by

ENVIRONMENT VARIABLES

xxQS_NAME_Sxx_ROOT
Specifies the location of the xxQS_NAMExx standard configuration files.
xxQS_NAME_Sxx_CELL
If set, specifies the default xxQS_NAMExx cell. To address a xxQS_NAMExx cell xxqs_name_sxx_execd uses (in the order of precedence):

The name of the cell specified in the environment variable xxQS_NAME_Sxx_CELL, if it is set.

The name of the default cell, i.e. default.

SGE_ENABLE_COREDUMP
If set, enable core dumps on Linux when the admin_user is not root. Linux normally disables core dumps when the daemon has changed uid or gid. Setting SGE_ENABLE_COREDUMP in xxqs_name_sxx_execd's environment defeats that to enable core dumps for debugging if they are otherwise allowed. This is typically not a big hazard with xxQS_NAME_Sxx, since most information is exposed in the spool area anyhow. Dumps will appear in the qmaster spool directory, which need not be world-readable.
On Solaris, may be used to enable such dumps.
SGE_CGROUP_DIR
If Linux cgroups handling is enabled, this variable names a directory under the cgroup mount point in which to create job-specific directories. The default is sge.SGE_CELL so, for instance, the cpuset cgroup for a job might be /sys/fs/cgroup/cpuset/sge.default/123.

FILES

sgepasswd contains a list of user names and their corresponding encrypted passwords. If available, the password file will be used by sge_shepherd. To change the contents of this file please use the sgepasswd command. It is not advised to change that file manually.
<execd_spool>/job_dir/<job_id>     job specific directory
<xxqs_name_sxx_root>/<cell>/common/sgepasswd
                                   Password information used on Microsoft Windows hosts.  See

COPYRIGHT

See for a full statement of rights and permissions.

SEE ALSO