queue_conf (5) - Linux Manuals
queue_conf: xxQS_NAMExx queue configuration file format
queue_conf - xxQS_NAMExx queue configuration file format
DESCRIPTIONThis manual page describes the format of the template file for the cluster queue configuration. Via the -aq and -mq options of the command, you can add cluster queues and modify the configuration of any queue in the cluster. Any of these change operations can be rejected as a result of a failed integrity verification.
The queue configuration parameters take as values strings, integer decimal numbers, booleans, or time and memory specifiers (see time_specifier and memory_specifier in as well as comma-separated lists.
FORMATThe list of parameters below specifies the queue configuration file content.
For each parameter except qname and hostlist, it is possible to specify host-dependent values instead of a single value. This "enhanced queue configuration specifier syntax" takes the form
An entry without brackets is always required as the default setting for all queue instances which don't override it. Tuples with a hostgroup_name (see host_id override the default setting. Tuples with a host_name host_id override both the default and the host group setting. As an example, PEs with different allocation rules may be specified according to the core count of different node types:
pe_list NONE,[@dual=all-mpi mpi-4],[@quad=all-mpi mpi-8]
The queue configuration is rejected if a default setting is absent.
Ambiguous configurations (those with more than one attribute setting for a particular host) cause the relevant queue instances to go into a "configuration ambiguous" state and not accept jobs. This is reported as "c" by and and may be diagnosed with qstat -explain c. Configurations containing override values for hosts not in the execution host list are accepted as "detached", as indicated by the -sds argument of
qnameThe name of the cluster queue in the format for queue_name in As template default "template" is used.
hostlistA list of host identifiers in the format for host_identifier in For each host xxQS_NAMExx maintains a queue instance for running jobs on that particular host. Large numbers of hosts can easily be managed by using host groups rather than single host names. Both white-space and "," can be used as list separators. (Template default: NONE, i.e. no hosts support the queue.)
seq_noIn conjunction with the hosts load situation at some time, this parameter specifies this queue's position in the scheduling order within the suitable queues for a job to be dispatched according to the queue_sort_method (see
Regardless of the queue_sort_method setting, reports queue information in the order defined by the value of the seq_no. Set this parameter to a monotonically increasing sequence. (Type: number; template default: 0.)
load_thresholdsload_thresholds is a list of load thresholds. When one of the thresholds is exceeded no further jobs will be scheduled to the queues and the relevant queue instance will be put into the "alarm" state by the overload condition. Arbitrary load values defined in the "host" and "global" complexes (see for details) can be used.
The syntax is that of a comma-separated list, with each list element consisting of the complex_name (see of a load value, an equal sign and the threshold value intended to trigger the overload situation (e.g. load_avg=1.75,users_logged_in=5).
Note: Load values as well as consumable resources may be scaled differently for different hosts if specified in the corresponding execution host definitions (refer to for more information). Load thresholds are compared against the scaled load and consumable values. Boolean complexes can be used to set an alarm state with the value false, typically from a load sensor which checks a host's "health", e.g. load_avg=1.75,health=false.
suspend_thresholdsA list of load thresholds with the same semantics as the load_thresholds parameter (see above), except that exceeding one of these thresholds initiates suspension of one of multiple jobs in the queue. See the nsuspend parameter below for details on the number of jobs which are suspended. There is an important relationship between the suspend_threshold and the scheduler_interval. If you have for example a suspend threshold on the np_load_avg, and the load exceeds the threshold, this does not have immediate effect. Jobs continue running until the next scheduling run, where the scheduler detects the threshold has been exceeded and sends an order to qmaster to suspend the job. The same applies for unsuspending.
nsuspendThe number of jobs which are suspended/enabled per time interval if at least one of the load thresholds in the suspend_thresholds list is exceeded or if no suspend_threshold is violated anymore, respectively. Nsuspend jobs are suspended in each time interval until no suspend_thresholds are exceeded anymore or all jobs in the queue are suspended. Jobs are enabled in the corresponding way if the suspend_thresholds are no longer exceeded. The time interval in which the suspensions of the jobs occur is defined in suspend_interval below.
suspend_intervalThe time interval in which further nsuspend jobs are suspended if one of the suspend_thresholds (see above for both) is exceeded by the current load on the host on which the queue is located. The time interval is also used when enabling the jobs. The syntax is that of a time_specifier in
priorityThe priority parameter specifies the nice(2) value at which jobs in this queue will be run. It is of type "number" and the default is zero (which means no nice value is set explicitly). Negative values (up to -20) correspond to a higher scheduling priority; positive values (up to +20) correspond to a lower scheduling priority.
Note, the value of priority has no effect if xxQS_NAMExx adjusts priorities dynamically to implement ticket-based entitlement policy goals. Dynamic priority adjustment is switched off by default due to reprioritize being set to false.
min_cpu_intervalThe time between two automatic checkpoints in case of transparently checkpointing jobs. The maximum of the time requested by the user via and the time defined by the queue configuration is used as the checkpoint interval. Since checkpoint files may be quite large, and thus writing them to the file system may become expensive, users and administrators are advised to choose sufficiently large time intervals. min_cpu_interval is of type "time" and the default is 5 minutes (which usually is suitable for test purposes only). The syntax is that of a time_specifier in
qtypeThe type of queue. Currently BATCH, INTERACTIVE, a combination in a comma-separated list of both, or NONE.
Jobs submitted with option -now y can only be scheduled on interactive queues, and -now n targets batch queues. -now y is the default for qsh, qrsh, and qlogin, while -now n is the default for qsub. Nevertheless, the option can be applied to all commands, with either argument, to direct jobs to specific queue types.
The formerly supported types
are not allowed anymore. A queue
instance is implicitly of type parallel/checkpointing
if there is a parallel environment or a checkpointing interface specified
for this queue instance in pe_list/ckpt_list, and is
if it has a parallel environment attached. Formerly possible settings e.g.
could be changed to
qtype NONE pe_list pe_name
pe_listThe list of administrator-defined parallel environment (see names to be associated with the queue. The default is NONE.
ckpt_listThe list of administrator-defined checkpointing interface names (see ckpt_name in to be associated with the queue. The default is NONE.
rerunDefines a default behavior for jobs which are aborted by system crashes or manual "violent" (via kill(1)) shutdown of the complete xxQS_NAMExx system (including the of the jobs and their process hierarchy) on the queue host. As soon as is restarted and detects that a job has been aborted for such reasons it can be restarted if the jobs are restartable. A job may not be restartable, for example, if it updates databases (first reads then writes to the same record of a database/file), because aborting the job may have left the database in an inconsistent state. If the owner of a job wants to overrule the default behavior for the jobs in the queue the -r option of can be used.
slotsThe maximum number of slots that may be scheduled concurrently in instances of the queue. Type is number, valid values are 0 to 9999999.
If there are multiple queues defined on a host and they are not mutually suspendable, the host slots value should be set to the processor count on the host if you want to avoid potential over-subscription due to scheduling to more than one queue at a time.
tmpdirThe tmpdir parameter specifies the absolute path to the base of the temporary directory filesystem. When launches a job, it creates a uniquely-named directory in this filesystem for the purpose of holding scratch files during job execution. At job completion, this directory and its contents are removed automatically. The environment variables TMPDIR and TMP are set to the path of each job's scratch directory. (Type string; default: /tmp.)
shellIf either posix_compliant or script_from_stdin is specified as the shell_start_mode parameter in the shell parameter specifies the executable path of the command interpreter (e.g. sh(1) or csh(1)) to be used to process the job scripts executed in the queue. The definition of shell can be overruled by the job owner via the -S option.
shell_start_modeThis parameter defines the mechanisms which are used to actually invoke the job scripts on the execution hosts. The following values are recognized:
- If a user starts a job shell script under UNIX interactively by invoking it just with the script name, the operating system's executable loader uses the information provided in a comment such as `#!/bin/csh' in the first line of the script to detect which command interpreter to start to interpret the script. This mechanism is used by xxQS_NAMExx when starting jobs if unix_behavior is defined as shell_start_mode.
- POSIX does not consider first script line comments such as `#!/bin/csh' significant. The POSIX standard for batch queuing systems (P1003.2d) therefore requires a compliant queuing system to ignore such lines and to use user specified or configured default command interpreters instead. Thus, if shell_start_mode is set to posix_compliant xxQS_NAMExx will either use the command interpreter indicated by the -S option of the command or the shell parameter of the queue to be used (see above).
Setting the shell_start_mode parameter either to posix_compliant
or unix_behavior requires you to set the umask in use for
such that every user has read access to the active_jobs directory in the
spool directory of the corresponding execution daemon. In case you have
prolog and epilog scripts configured, they also need to be
readable by any user who may execute jobs.
If this violates your site's security policies you may want to set shell_start_mode to script_from_stdin. This will force xxQS_NAMExx to open the job script, as well as the epilogue and prologue scripts, for reading into STDIN as root (if was started as root) before changing to the job owner's user account. The script is then fed into the STDIN stream of the command interpreter indicated by the -S option of the command or the shell parameter of the queue to be used (see above).
Thus setting shell_start_mode to script_from_stdin also implies posix_compliant behavior. Note, however, that feeding scripts into the STDIN stream of a command interpreter may cause trouble if commands like rsh(1) are invoked inside a job script as they also process the STDIN stream of the command interpreter. These problems can usually be resolved by redirecting the STDIN channel of those commands to come from /dev/null (e.g. rsh host date < /dev/null). Note also, that any command-line options associated with the job are passed to the executing shell. The shell will only forward them to the job if they are not recognized as valid shell options.
prologThis queue configuration entry overwrites cluster global or execution host-specific prolog definitions (see
epilogThis queue configuration entry overwrites cluster global or execution host-specific epilog definitions (see
starter_methodThe specified executable path will be used as a job starter facility responsible for starting batch jobs instead of the built-in starter (which typically passes the job to a shell). The starter method is passed as arguments the command to run. This is typically the name of a copy of the batch script file, followed by any arguments supplied at job submission. However, depending on how the job was submitted, it might be a binary (with arguments), or a more general shell command line, e.g. supplied to qrsh. The following environment variables are used to pass information to the job starter concerning the shell environment which was configured or requested to start the job.
- The name of the requested shell to start the job
- The configured shell_start_mode
- Set to "true" if the shell is supposed to be used as a login shell (see login_shells in
Ignoring those, a trivial starter method could be
#!/bin/sh # set the environment somehow exec "$@"It is, at best, tricky to write a proper substitute for the builtin method as a shell script, taking account of the variables above. It is probably best to do so in a non-macro expanded scripting language (or a compiled language, as appropriate).
The starter_method will not be invoked for qsh, qlogin, or qrsh acting as rlogin.
These parameters can be used for overwriting the default method used by xxQS_NAMExx for suspension, release of a suspension and for termination of a job. Per default, the signals SIGSTOP, SIGCONT and SIGKILL are delivered to the job to perform these actions. However, for some applications this is not appropriate.
If no executable path is given, xxQS_NAMExx takes the specified parameter entries as the signal to be delivered instead of the default signal. A signal must be either a positive number or a signal name with the SIG prefix, as printed by kill -l (e.g. SIGTERM).
If an executable path is given (it must be an absolute path starting with a "/"); then this command, together with its arguments, is started by xxQS_NAMExx to perform the appropriate action. The following special variables are expanded at runtime, and can be used (besides any other strings which have to be interpreted by the procedures) to compose a command line:
- The name of the host on which the procedure is started.
- The array job task index (0 if not an array job).
- The user name of the job owner.
- xxQS_NAMExx's unique job identification number.
- The name of the job.
- The name of the queue.
- The pid of the job.
- The SGE_CELL environment variable (useful for locating files).
- The SGE_ROOT environment variable (useful for locating files).
Note that a method is only executed on the master node of a parallel job, so it may be necessary to propagate the necessary action to slave nodes explicitly. (However, MPI implementations may, for instance, respond to SIGTSTP sent to the master process by stopping all the distributed processes.) If an executable is used for a method, it is started in the same environment as for the job concerned (see
notifyThe time to wait between delivery of SIGUSR1/SIGUSR2 notification signals and suspend/kill signals if the job was submitted with the -notify option.
owner_listThe owner_list comprises comma-separated login(1) user names (see user_name in of those users who are authorized to disable and suspend this queue through (xxQS_NAMExx operators and managers can do this by default.) It is customary to set this field for queues on interactive workstations where the computing resources are shared between interactive sessions and xxQS_NAMExx jobs, allowing the workstation owner to have priority access. Owners can be managers, operators, or users. Owner privileges are necessary to use qidle (see (Default: NONE.)
user_listsThe user_lists parameter contains a comma-separated list of xxQS_NAMExx user access list names as described in Each user contained in at least one of the given access lists has access to the queue. If the user_lists parameter is set to NONE (the default) any user has access if not explicitly excluded via the xuser_lists parameter described below. If a user is contained both in an access list in xuser_lists and user_lists, the user is denied access to the queue.
xuser_listsThe xuser_lists parameter contains a comma-separated list of xxQS_NAMExx user access list names as described in Each user contained in at least one of the given access lists is not allowed to access the queue. If the xuser_lists parameter is set to NONE (the default) any user has access. If a user is contained both in an access list in xuser_lists and user_lists, the user is denied access to the queue.
projectsThe projects parameter contains a comma-separated list of xxQS_NAMExx projects (see that have access to the queue. Any project not in this list is denied access to the queue. If set to NONE (the default), any project has access that is not specifically excluded via the xprojects parameter described below. If a project is in both the projects and xprojects parameters, the project is denied access to the queue.
xprojectsThe xprojects parameter contains a comma-separated list of xxQS_NAMExx projects (see that are denied access to the queue. If set to NONE (the default), no projects are denied access other than those denied access based on the projects parameter described above. If a project is in both the projects and xprojects parameters, the project is denied access to the queue.
subordinate_listThere are two different types of subordination:
1. Queuewise subordination
A list of xxQS_NAMExx queue names in the format for queue_name in Subordinate relationships are in effect only between queue instances residing at the same host. The relationship does not apply and is ignored when jobs are running in queue instances on other hosts. Queue instances residing on the same host will be suspended when a specified count of jobs is running in this queue instance. The list specification is the same as that of the load_thresholds parameter above, e.g. low_pri_q=5,small_q. The numbers denote the job slots of the queue that have to be filled in the superordinated queue to trigger the suspension of the subordinated queue. If no value is assigned, a suspension is triggered if all slots of the queue are filled.
On nodes which host more than one queue, you might wish to accord better service to certain classes of jobs (e.g., queues that are dedicated to parallel processing might need priority over low priority production queues). The default is NONE.
2. Slotwise preemption
Slotwise preemption provides a means to ensure that high priority jobs get the resources they need, while at the same time low priority jobs on the same host are not unnecessarily preempted, maximizing the host utilization. Slotwise preemption is designed to provide different preemption actions, but with the current implementation only suspension is provided. This means there is a subordination relationship defined between queues similar to the queue-wise subordination, but if the suspend threshold is exceeded, the whole subordinated queue is not suspended, only single tasks running in single slots.
As with queue-wise subordination, the subordination relationships are in effect only between queue instances residing at the same host. The relationship does not apply and is ignored when jobs and tasks are running in queue instances on other hosts.
The syntax is:
threshold =a positive integer number
- queue_def =queue[:seq_no][:action]
- queue =a xxQS_NAMExx queue name in the format for queue_name in
- "seq_no" =sequence number among all subordinated queues
- of the same depth in the tree.
The higher the sequence number, the lower is the priority of the queue. Default is 0, which is the highest priority.
- action =the action to be taken if the threshold is
"sr": Suspend the task with the shortest run time.
"lr": Suspend the task with the longest run time.
Default is "sr".
Some examples of possible configurations and their functionalities:
a) The simplest configuration
which means the queue "B.q" is subordinated to the current queue (let's call it "A.q"), the suspend threshold for all tasks running in "A.q" and "B.q" on the current host is two, the sequence number of "B.q" is "0" and the action is "suspend task with shortest run time first". This subordination relationship looks like this:
A.q | B.q
This could be a typical configuration for a host with a dual core CPU. This subordination configuration ensures that tasks that are scheduled to "A.q" always get a CPU core for themselves, while jobs in "B.q" are not preempted as long as there are no jobs running in "A.q".
If there is no task running in "A.q", two tasks are running in "B.q" and a new task is scheduled to "A.q", the sum of tasks running in "A.q" and "B.q" is three. Three is greater than two, so this triggers the defined action. This causes the task with the shortest run time in the subordinated queue "B.q" to be suspended. After suspension, there is one task running in "A.q", one task running in "B.q", and one task suspended in "B.q".
b) A simple tree
subordinate_list slots=2(B.q:1, C.q:2)
This defines a small tree that looks like this:
A.q / \ B.q C.q
A use case for this configuration could be a host with a dual core CPU and queue "B.q" and "C.q" for jobs with different requirements, e.g. "B.q" for interactive jobs, "C.q" for batch jobs. Again, the tasks in "A.q" always get a CPU core, while tasks in "B.q" and "C.q" are suspended only if the threshold of running tasks is exceeded. Here the sequence number among the queues of the same depth comes into play. Tasks scheduled to "B.q" can't directly trigger the suspension of tasks in "C.q", but if there is a task to be suspended, first "C.q" will be searched for a suitable task.
If there is one task running in "A.q", one in "C.q" and a new task is scheduled to "B.q", the threshold of "2" in "A.q", "B.q" and "C.q" is exceeded. This triggers the suspension of one task in either "B.q" or "C.q". The sequence number gives "B.q" a higher priority than "C.q", therefore the task in "C.q" is suspended. After suspension, there is one task running in "A.q", one task running in "B.q" and one task suspended in "C.q".
c) More than two levels
Configuration of A.q: subordinate_list slots=2(B.q)
Configuration of B.q: subordinate_list slots=2(C.q)
looks like this:
A.q | B.q | C.q
These are three queues with high, medium and low priority. If a task is scheduled to "C.q", first the subtree consisting of "B.q" and "C.q" is checked, the number of tasks running there is counted. If the threshold which is defined in "B.q" is exceeded, the job in "C.q" is suspended. Then the whole tree is checked, if the number of tasks running in "A.q", "B.q" and "C.q" exceeds the threshold defined in "A.q" the task in "C.q" is suspended. This means, the effective threshold of any subtree is not higher than the threshold of the root node of the tree. If in this example a task is scheduled to "A.q", immediately the number of tasks running in "A.q", "B.q" and "C.q" is checked against the threshold defined in "A.q".
d) Any tree
A.q / \ B.q C.q / / \ D.q E.q F.q \ G.q
The computation of the tasks that are to be (un)suspended always starts at the queue instance that is modified, i.e. a task is scheduled to, a task ends at, the configuration is modified, a manual or other automatic (un)suspend is issued, except when it is a leaf node, like "D.q", "E.q" and "G.q" in this example. Then the computation starts at its parent queue instance (like "B.q", "C.q" or "F.q" in this example). From there first all running tasks in the whole subtree of this queue instance are counted. If the sum exceeds the threshold configured in the subordinate_list, in this subtree a task is sought to be suspended. Then the algorithm proceeds to the parent of this queue instance, counts all running tasks in the whole subtree below the parent, and checks if the number exceeds the threshold configured in the parent's subordinate_list. If so, it searches for a task to suspend in the whole subtree below the parent. And so on, until it did this computation for the root node of the tree.
complex_valuescomplex_values defines quotas for resource attributes managed via this queue. The syntax is the same as for load_thresholds (see above). The quotas are related to the resource consumption of all jobs in a queue in the case of consumable resources (see for details on consumable resources) or they are interpreted on a per queue slot (see slots above) basis in the case of non-consumable resources. Consumable resource attributes are commonly used to manage free memory, free disk space or available floating software licenses, while non-consumable attributes usually define distinctive characteristics, like the type of hardware installed.
For consumable resource attributes an available resource amount is determined by subtracting the current resource consumption of all running jobs in the queue from the quota in the complex_values list. Jobs can only be dispatched to a queue if no resource requests exceed any corresponding resource availability obtained by this scheme. The quota definition in the complex_values list is automatically replaced by the current load value reported for this attribute if load is monitored for this resource and if the reported load value is more stringent than the quota. This effectively avoids oversubscription of resources.
Note: Load values replacing the quota specifications may have become more stringent because they have been scaled (see and/or load adjusted (see The -F option of and the load display in the queue control dialog (activated by clicking on a queue icon while the "Shift" key is pressed) provide detailed information on the actual availability of consumable resources and on the origin of the values taken into account currently.
Note also: The resource consumption of running jobs (used for the availability calculation) as well as the resource requests of the jobs waiting to be dispatched either may be derived from explicit user requests during job submission (see the -l option to or from a "default" value configured for an attribute by the administrator (see The -r option to can be used for retrieving full detail on the actual resource requests of all jobs in the system.
For non-consumable resources xxQS_NAMExx simply compares the job's attribute requests with the corresponding specification in complex_values, taking the relation operator of the complex attribute definition into account (see If the result of the comparison is "true", the queue is suitable for the job with respect to the particular attribute. For parallel jobs each queue slot to be occupied by a parallel task is meant to provide the same resource attribute value.
Note: Only numeric complex attributes can be defined as consumable resources, hence non-numeric attributes are always handled on a per queue slot basis.
calendarspecifies the calendar to be valid for this queue or contains NONE (the default). A calendar defines the availability of a queue depending on time of day, week and year. Please refer to for details on the xxQS_NAMExx calendar facility.
initial_statedefines an initial state for the queue, either when adding the queue to the system for the first time or on start-up of the on the host on which the queue resides. Possible values are:
- The queue is enabled when adding the queue, or is reset to the previous status when comes up (this corresponds to the behavior in earlier xxQS_NAMExx releases not supporting initial_state).
- The queue is enabled in either case. This is equivalent to a manual and explicit 'qmod -e' command (see
- The queue is disabled in either case. This is equivalent to a manual and explicit 'qmod -d' command (see
RESOURCE LIMITSThe first two resource limit parameters, s_rt and h_rt, are implemented by xxQS_NAMExx. They define the "real time" (also called "elapsed" or "wall clock" time) passed since the start of the job. If h_rt is exceeded by a job running in the queue, it is aborted via the SIGKILL signal (see kill(1)). If s_rt is exceeded, the job is first "warned" via the SIGUSR1 signal (which can be caught by the job) and finally aborted after the notification time defined in the queue configuration parameter notify (see above) has passed. In cases when s_rt is used in combination with job notification it might be necessary to configure a signal other than SIGUSR1 using the NOTIFY_KILL and NOTIFY_SUSP execd_params (see so that the jobs' signal-catching mechanism can differ in each case and react accordingly.
The resource limit parameters s_cpu and h_cpu are implemented by xxQS_NAMExx as a job limit. They impose a limit on the amount of combined CPU time consumed by all the processes in the job. If h_cpu is exceeded by a job running in the queue, it is aborted via a SIGKILL signal (see kill(1)). If s_cpu is exceeded, the job is sent a SIGXCPU signal which can be caught by the job. If you wish to allow a job to be "warned" so it can exit gracefully before it is killed, then you should set the s_cpu limit to a lower value than h_cpu. For parallel processes, the limit is applied per slot, which means that the limit is multiplied by the number of slots being used by the job before being applied.
The resource limit parameters s_vmem and h_vmem are implemented by xxQS_NAMExx as a job limit. They impose a limit on the amount of combined virtual memory consumed by all the processes in the job. If h_vmem is exceeded by a job running in the queue, it is aborted via a SIGKILL signal (see kill(1)). If s_vmem is exceeded, the job is sent a SIGXCPU signal which can be caught by the job. If you wish to allow a job to be "warned" so it can exit gracefully before it is killed, then you should set the s_vmem limit to a lower value than h_vmem. For parallel processes, the limit is applied per slot which means that the limit is multiplied by the number of slots being used by the job before being applied.
The remaining parameters in the queue configuration template specify per-job soft and hard resource limits as implemented by the setrlimit(2) system call. See this manual page on your system for more information. By default, each limit field is set to infinity (which means RLIM_INFINITY as described in the setrlimit(2) manual page). The value type for the CPU-time limits s_cpu and h_cpu is time. The value type for the other limits is memory. Note: Not all systems support setrlimit(2).
Note also: s_vmem and h_vmem (virtual memory) are only available on systems supporting RLIMIT_VMEM (see setrlimit(2) on your operating system).
SECURITYSee for security considerations when specifying prolog and epilog with a user@ prefix.
COPYRIGHTSee for a full statement of rights and permissions.