psij.executors.batch package

Submodules

psij.executors.batch.batch_scheduler_executor module

class BatchSchedulerExecutor(url=None, config=None)[source]

Bases: JobExecutor

A base class for batch scheduler executors.

This class implements a generic JobExecutor that interacts with batch schedulers. There are two main components to the executor: job submission and queue polling. Submission is implemented by generating a submit script which is then fed to the queuing system submit command.

The submit script is generated using a generate_submit_script(). An implementation of this functionality based on Mustache/Pystache (see https://mustache.github.io/ and https://pypi.org/project/pystache/) exists in TemplatedScriptGenerator. This class can be instantiated by concrete implementations of a batch scheduler executor and the submit script generation can be delegated to that instance, which has a method whose signature matches that of generate_submit_script(). Besides an opened file which points to where the contents of the submit script are to be written, the parameters to generate_submit_script() are the Job that is being submitted and a context, which is a dictionary with the following structure:

{
    'job': <the job being submitted>
    'psij': {
        'lib': <dict; function library>,
        'launch_command': <str; launch command>,
        'script_dir': <str; directory where the submit script is generated>
    }
}

The script directory is a directory (typically ~/.psij/work) where submit scripts are written; it is also used for auxiliary files, such as the exit code file (see below) or the script output file.

The launch command is a list of strings which the script generator should render as the command to execute. It wraps the job executable in the proper Launcher.

The function library is a dictionary mapping function names to functions for all public functions in the template_function_library module.

The submit script must perform two essential actions:

1. redirect the output of the executable part of the script to the script output file, which is a file in <script_dir> named <native_id>.out, where <native_id> is the id given to the job by the queuing system.

2. store the exit code of the launch command in the exit code file named <native_id>.ec, also inside <script_dir>.

Once the submit script is generated, the executor renders the submit command using get_submit_command() and executes it. Its output is then parsed using job_id_from_submit_output() to retrieve the native_id of the job. Subsequently, the job is registered with the queue polling thread.

The queue polling thread regularly polls the batch scheduler queue for updates to job states. It builds the command for polling the queue using get_status_command(), which takes a list of native_id strings corresponding to all registered jobs. Implementations are strongly encouraged to restrict the query of job states to the specified jobs in order to reduce the load on the queuing system. The output of the status command is then parsed using parse_status_output() and the status of each job is updated accordingly. If the status of a registered job is not found in the ouput of the queue status command, it is assumed completed (or failed, depending on its exit code), since most queuing systems automatically purge completed jobs from their databases after a short period of time. The exit code is read from the exit code file, as described above. If the exit code value is not zero, the job is assumed failed and an attempt is made to read an error message from the script output file.

Initializes a BatchSchedulerExecutor.

Parameters
attach(job, native_id)[source]

Attaches a job to a native job.

Attempts to connect job to a native job with native_id such that the job correctly reflects updates to the status of the native job. If the native job was previously submitted using this executor (hence having an exit code file and a script output file), the executor will attempt to retrieve the exit code and errors from the job. Otherwise, it may be impossible for the executor to distinguish between a failed and successfully completed job.

Parameters
  • job (Job) – The PSI/J job to attach.

  • native_id (str) – The id of the batch scheduler job to attach to.

Return type

None

cancel(job)[source]

Cancels a job if it has not otherwise completed.

A command is constructed using get_cancel_command() and executed in order to cancel the job. Also see cancel().

Parameters

job (Job) –

Return type

None

abstract generate_submit_script(job, context, submit_file)[source]

Called to generate a submit script for a job.

Concrete implementations of batch scheduler executors must override this method in order to generate a submit script for a job.

Parameters
  • job (Job) – The job to be submitted.

  • context (Dict[str, object]) – A dictionary containing information about the context in which the job is being submitted. For details, see the description of this class.

  • submit_file (TextIO) – An opened file-like object to which the contents of the submit script should be written.

Return type

None

abstract get_cancel_command(native_id)[source]

Constructs a command to cancel a batch scheduler job.

Concrete implementations of batch scheduler executors must override this method.

Parameters

native_id (str) – The native id of the job being cancelled.

Returns

A list of strings representing the command and arguments to execute in order to cancel the job, such as, e.g., [‘qdel’, native_id].

Return type

List[str]

abstract get_status_command(native_ids)[source]

Constructs a command to retrieve the status of a list of jobs.

Concrete implementations of batch scheduler executors must override this method. In order to prevent overloading the queueing system, concrete implementations are strongly encouraged to return a command that only queries for the status of the indicated jobs. The command returned by this method should produce an output that is understood by parse_status_output().

Parameters
  • jobs – A collection of native ids corresponding to the jobs whose status is sought.

  • native_ids (Collection[str]) –

Returns

A list of strings representing the command and arguments to execute in order to get the status of the jobs.

Return type

List[str]

abstract get_submit_command(job, submit_file_path)[source]

Constructs a command to submit a job to a batch scheduler.

Concrete implementations of batch scheduler executors must override this method.

Parameters
Returns

A list of strings representing the command and arguments to execute in order to submit the job, such as [‘qsub’, str(submit_file_path)].

Return type

List[str]

abstract job_id_from_submit_output(out)[source]

Extracts a native job id from the output of the submit command.

Concrete implementations of batch scheduler executors must override this method. This method is only invoked if the submit command completes with a zero exit code, so implementations of this method do not need to determine whether the output reflects an error from the submit command.

Parameters

out (str) – The output from the submit command.

Returns

A string representing the native id of the newly submitted job.

Return type

str

list()[source]

Returns a list of jobs known to the underlying implementation.

See list(). The returned list is a list of native_id strings representing jobs known to the underlying batch scheduler implementation, whether submitted through this executor or not. Implementations are encouraged to restrict the results to jobs accessible by the current user.

Return type

List[str]

abstract parse_status_output(exit_code, out)[source]

Parses the output of a job status command.

Concrete implementations of batch scheduler executors must override this method. The output is meant to have been produced by the command generated by get_status_command().

Parameters
Returns

A dictionary mapping native job ids to JobStatus objects. The implementation of this method need not process the exit code file or the script output file since it is done by the base BatchSchedulerExecutor implementation.

Return type

Dict[str, JobStatus]

abstract process_cancel_command_output(exit_code, out)[source]

Handle output from a failed cancel command.

The main purpose of this method is to help distinguish between the cancel command failing due to an invalid job state (such as the job having completed before the cancel command was invoked) and other types of errors. Since job state errors are ignored, there are two options:

1. Instruct the cancel command to not fail on invalid state errors and have this method always raise a psij.SubmitException, since it is only invoked on “other” errors.

2. Have the cancel command fail on both invalid state errors and other errors and interpret the output from the cancel command to distinguish between the two and raise the appropriate exception.

Parameters
  • exit_code (int) – The exit code from the cancel command.

  • out (str) – The output from the cancel command.

Raises
  • InvalidJobStateError – Raised if the job cancellation has failed because the job was in a completed or failed state at the time when the cancellation command was invoked.

  • SubmitException – Raised for all other reasons.

Return type

None

submit(job)[source]

See submit().

Parameters

job (Job) –

Return type

None

class BatchSchedulerExecutorConfig(launcher_log_file=None, work_directory=None, queue_polling_interval=30, initial_queue_polling_delay=2, queue_polling_error_threshold=2, keep_files=False)[source]

Bases: JobExecutorConfig

A base configuration class for BatchSchedulerExecutor implementations.

When subclassing BatchSchedulerExecutor, specific configuration classes inheriting from this class should be defined, even if empty.

Initializes a base batch scheduler executor configuration.

Parameters
  • launcher_log_file (Optional[Path]) – See JobExecutorConfig.

  • work_directory (Optional[Path]) – See JobExecutorConfig.

  • queue_polling_interval (int) – an interval, in seconds, at which the batch scheduler queue will be polled for updates to jobs.

  • initial_queue_polling_delay (int) – the time to wait before polling the queue for the first time; for quick tests that only submit a short job that completes nearly instantly or for jobs that fail very quickly, this can dramatically reduce the time taken to get the necessary job status update.

  • queue_polling_error_threshold (int) – The number of times consecutive queue polls have to fail in order for the executor to report them as job failures.

  • keep_files (bool) – Whether to keep submit files and auxiliary job files (exit code and output files) after a job has completed.

exception InvalidJobStateError[source]

Bases: Exception

An exception that signals that a job cannot be cancelled due to it being already done.

check_status_exit_code(command, exit_code, out)[source]

Check if exit_code is nonzero and if so raise a RuntimeError.

Parameters
  • command (str) –

  • exit_code (int) –

  • out (str) –

Return type

None

psij.executors.batch.cobalt module

Defines a JobExecutor for the Cobalt resource manager.

class CobaltExecutorConfig(launcher_log_file=None, work_directory=None, queue_polling_interval=30, initial_queue_polling_delay=2, queue_polling_error_threshold=2, keep_files=False)[source]

Bases: BatchSchedulerExecutorConfig

A configuration class for the Cobalt executor.

Initializes a base batch scheduler executor configuration.

Parameters
  • launcher_log_file (Optional[Path]) – See JobExecutorConfig.

  • work_directory (Optional[Path]) – See JobExecutorConfig.

  • queue_polling_interval (int) – an interval, in seconds, at which the batch scheduler queue will be polled for updates to jobs.

  • initial_queue_polling_delay (int) – the time to wait before polling the queue for the first time; for quick tests that only submit a short job that completes nearly instantly or for jobs that fail very quickly, this can dramatically reduce the time taken to get the necessary job status update.

  • queue_polling_error_threshold (int) – The number of times consecutive queue polls have to fail in order for the executor to report them as job failures.

  • keep_files (bool) – Whether to keep submit files and auxiliary job files (exit code and output files) after a job has completed.

class CobaltJobExecutor(url=None, config=None)[source]

Bases: BatchSchedulerExecutor

A JobExecutor for the Cobalt Workload Manager.

The Cobalt HPC Job Scheduler, is used by Argonne’s ALCF systems.

Uses the qsub, qstat, and qdel commands, respectively, to submit, monitor, and cancel jobs.

Creates a batch script with #COBALT directives when submitting a job.

Initializes a CobaltJobExecutor.

Parameters
generate_submit_script(job, context, submit_file)[source]

See generate_submit_script().

Parameters
Return type

None

get_cancel_command(native_id)[source]

See get_cancel_command().

Parameters

native_id (str) –

Return type

List[str]

get_status_command(native_ids)[source]

See get_status_command().

Parameters

native_ids (Collection[str]) –

Return type

List[str]

get_submit_command(job, submit_file_path)[source]

See get_submit_command().

Parameters
  • job (Job) –

  • submit_file_path (Path) –

Return type

List[str]

job_id_from_submit_output(out)[source]

See job_id_from_submit_output().

Parameters

out (str) –

Return type

str

parse_status_output(exit_code, out)[source]

See parse_status_output().

Parameters
  • exit_code (int) –

  • out (str) –

Return type

Dict[str, JobStatus]

process_cancel_command_output(exit_code, out)[source]

See process_cancel_command_output().

This should be unnecessary because qdel only seems to fail on non-integer job IDs.

Parameters
  • exit_code (int) –

  • out (str) –

Return type

None

psij.executors.batch.escape_functions module

bash_escape(o)[source]

Escape object to bash string.

Renders and escapes an object to a string such that its value is preserved when substituted in a bash script between double quotes. Numeric values are simply rendered without any escaping. Path objects are converted to absolute path and escaped. All other objects are converted to string and escaped.

Parameters

o (object) – The object to escape.

Returns

An escaped representation of the object that can be substituted in bash scripts.

Return type

str

psij.executors.batch.lsf module

Defines the LsfJobExecutor class and its config class.

class LsfExecutorConfig(launcher_log_file=None, work_directory=None, queue_polling_interval=30, initial_queue_polling_delay=2, queue_polling_error_threshold=2, keep_files=False)[source]

Bases: BatchSchedulerExecutorConfig

A configuration class for the LSF executor.

Initializes a base batch scheduler executor configuration.

Parameters
  • launcher_log_file (Optional[Path]) – See JobExecutorConfig.

  • work_directory (Optional[Path]) – See JobExecutorConfig.

  • queue_polling_interval (int) – an interval, in seconds, at which the batch scheduler queue will be polled for updates to jobs.

  • initial_queue_polling_delay (int) – the time to wait before polling the queue for the first time; for quick tests that only submit a short job that completes nearly instantly or for jobs that fail very quickly, this can dramatically reduce the time taken to get the necessary job status update.

  • queue_polling_error_threshold (int) – The number of times consecutive queue polls have to fail in order for the executor to report them as job failures.

  • keep_files (bool) – Whether to keep submit files and auxiliary job files (exit code and output files) after a job has completed.

class LsfJobExecutor(url, config=None)[source]

Bases: BatchSchedulerExecutor

A JobExecutor for the LSF Workload Manager.

The IBM Spectrum LSF workload manager is the system resource manager on LLNL’s Sierra and Lassen, and ORNL’s Summit.

Uses the ‘bsub’, ‘bjobs’, and ‘bkill’ commands, respectively, to submit, monitor, and cancel jobs.

Creates a batch script with #BSUB directives when submitting a job.

Initializes a LsfJobExecutor.

Parameters
generate_submit_script(job, context, submit_file)[source]

See generate_submit_script().

Parameters
Return type

None

get_cancel_command(native_id)[source]

See get_cancel_command().

bkill will exit with an error set if the job does not exist or has already finished.

Parameters

native_id (str) –

Return type

List[str]

get_status_command(native_ids)[source]

See get_status_command().

Parameters

native_ids (Collection[str]) –

Return type

List[str]

get_submit_command(job, submit_file_path)[source]

See get_submit_command().

Parameters
  • job (Job) –

  • submit_file_path (Path) –

Return type

List[str]

job_id_from_submit_output(out)[source]

See job_id_from_submit_output().

Parameters

out (str) –

Return type

str

parse_status_output(exit_code, out)[source]

See parse_status_output().

Iterate through the RECORDS entry, grabbing JOBID and STAT entries, as well as any state-change reasons if present.

Parameters
  • exit_code (int) –

  • out (str) –

Return type

Dict[str, JobStatus]

process_cancel_command_output(exit_code, out)[source]

See process_cancel_command_output().

Check if the error was raised only because a job already exited.

Parameters
  • exit_code (int) –

  • out (str) –

Return type

None

psij.executors.batch.pbspro module

class PBSProExecutorConfig(launcher_log_file=None, work_directory=None, queue_polling_interval=30, initial_queue_polling_delay=2, queue_polling_error_threshold=2, keep_files=False)[source]

Bases: BatchSchedulerExecutorConfig

A configuration class for the PBS executor.

This doesn’t have any fields in addition to BatchSchedulerExecutorConfig, but it is expected that some will appear during further development.

Initializes a base batch scheduler executor configuration.

Parameters
  • launcher_log_file (Optional[Path]) – See JobExecutorConfig.

  • work_directory (Optional[Path]) – See JobExecutorConfig.

  • queue_polling_interval (int) – an interval, in seconds, at which the batch scheduler queue will be polled for updates to jobs.

  • initial_queue_polling_delay (int) – the time to wait before polling the queue for the first time; for quick tests that only submit a short job that completes nearly instantly or for jobs that fail very quickly, this can dramatically reduce the time taken to get the necessary job status update.

  • queue_polling_error_threshold (int) – The number of times consecutive queue polls have to fail in order for the executor to report them as job failures.

  • keep_files (bool) – Whether to keep submit files and auxiliary job files (exit code and output files) after a job has completed.

class PBSProJobExecutor(url=None, config=None)[source]

Bases: BatchSchedulerExecutor

A JobExecutor for PBS Pro.

PBS Pro is a resource manager on certain machines at Argonne National Lab, among others.

Uses the qsub, qstat, and qdel commands, respectively, to submit, monitor, and cancel jobs.

Creates a batch script with #PBS directives when submitting a job.

Initializes a PBSProJobExecutor.

Parameters
generate_submit_script(job, context, submit_file)[source]

See generate_submit_script().

Parameters
Return type

None

get_cancel_command(native_id)[source]

See get_cancel_command().

Parameters

native_id (str) –

Return type

List[str]

get_status_command(native_ids)[source]

See get_status_command().

Parameters

native_ids (Collection[str]) –

Return type

List[str]

get_submit_command(job, submit_file_path)[source]

See get_submit_command().

Parameters
  • job (Job) –

  • submit_file_path (Path) –

Return type

List[str]

job_id_from_submit_output(out)[source]

See job_id_from_submit_output().

Parameters

out (str) –

Return type

str

parse_status_output(exit_code, out)[source]

See parse_status_output().

Parameters
  • exit_code (int) –

  • out (str) –

Return type

Dict[str, JobStatus]

process_cancel_command_output(exit_code, out)[source]

See process_cancel_command_output().

Parameters
  • exit_code (int) –

  • out (str) –

Return type

None

psij.executors.batch.script_generator module

class SubmitScriptGenerator(config)[source]

Bases: ABC

A base class representing a submit script generator.

A submit script generator is used to render a Job (together with all its properties, including JobSpec, ResourceSpec, etc.) into a submit script specific to a certain batch scheduler.

Initializes this SubmitScriptGenerator with an executor configuration.

Parameters

config (JobExecutorConfig) – An executor configuration containing configuration properties for the executor that is attempting to use this generator. Submit script generators are meant to work in close cooperation with batch scheduler job executors, hence the sharing of a configuration mechanism.

Return type

None

generate_submit_script(job, context, out)[source]

Generates a job submit script.

Concerete implementations of submit script generators must implement this method. Its purpose is to generate the content of the submit script. For an extensive explanation of the mechanism behind this process, see BatchSchedulerExecutor.

Parameters
  • job (Job) – The job for which the submit script is to be generated.

  • context (Dict[str, object]) – A dictionary containing information about the context in which the job is being submitted. For details, see BatchSchedulerExecutor.

  • out (TextIO) – An opened file-like object to which the contents of the submit script should be written.

Return type

None

class TemplatedScriptGenerator(config, template_path, escape=<function bash_escape>)[source]

Bases: SubmitScriptGenerator

A Mustache templates submit script generator.

This script generator uses Pystache (https://pypi.org/project/pystache/), which is a Python implementation of the Mustache templating language (https://mustache.github.io/).

Initializes this script generator.

Parameters
  • config (JobExecutorConfig) – A configuration, which is passed to the base class.

  • template_path (Path) – The path to a Mustache template.

  • escape (Callable[[object], str]) – An escape function to use for escaping values. By default, a function that escapes strings for use in bash scripts is used.

Return type

None

generate_submit_script(job, context, out)[source]

See generate_submit_script().

Renders a submit script using the template specified when this generator was constructed.

Parameters
Return type

None

psij.executors.batch.slurm module

class SlurmExecutorConfig(launcher_log_file=None, work_directory=None, queue_polling_interval=30, initial_queue_polling_delay=2, queue_polling_error_threshold=2, keep_files=False)[source]

Bases: BatchSchedulerExecutorConfig

A configuration class for the Slurm executor.

Initializes a base batch scheduler executor configuration.

Parameters
  • launcher_log_file (Optional[Path]) – See JobExecutorConfig.

  • work_directory (Optional[Path]) – See JobExecutorConfig.

  • queue_polling_interval (int) – an interval, in seconds, at which the batch scheduler queue will be polled for updates to jobs.

  • initial_queue_polling_delay (int) – the time to wait before polling the queue for the first time; for quick tests that only submit a short job that completes nearly instantly or for jobs that fail very quickly, this can dramatically reduce the time taken to get the necessary job status update.

  • queue_polling_error_threshold (int) – The number of times consecutive queue polls have to fail in order for the executor to report them as job failures.

  • keep_files (bool) – Whether to keep submit files and auxiliary job files (exit code and output files) after a job has completed.

class SlurmJobExecutor(url=None, config=None)[source]

Bases: BatchSchedulerExecutor

A JobExecutor for the Slurm Workload Manager.

The Slurm Workload Manager is a widely used resource manager running on machines such as NERSC’s Perlmutter, as well as a variety of LLNL machines.

Uses the ‘sbatch’, ‘squeue’, and ‘scancel’ commands, respectively, to submit, monitor, and cancel jobs.

Creates a batch script with #SBATCH directives when submitting a job.

Initializes a SlurmJobExecutor.

Parameters
generate_submit_script(job, context, submit_file)[source]

See generate_submit_script().

Parameters
Return type

None

get_cancel_command(native_id)[source]

See get_cancel_command().

Parameters

native_id (str) –

Return type

List[str]

get_status_command(native_ids)[source]

See get_status_command().

Parameters

native_ids (Collection[str]) –

Return type

List[str]

get_submit_command(job, submit_file_path)[source]

See get_submit_command().

Parameters
  • job (Job) –

  • submit_file_path (Path) –

Return type

List[str]

job_id_from_submit_output(out)[source]

See job_id_from_submit_output().

Parameters

out (str) –

Return type

str

parse_status_output(exit_code, out)[source]

See parse_status_output().

Parameters
  • exit_code (int) –

  • out (str) –

Return type

Dict[str, JobStatus]

process_cancel_command_output(exit_code, out)[source]

See process_cancel_command_output().

Parameters
  • exit_code (int) –

  • out (str) –

Return type

None

psij.executors.batch.template_function_library module

ALL: Dict[str, Callable[[...], Any]] = {'walltime_to_minutes': <function walltime_to_minutes>}

A dictionary of all template-accessible functions for the batch executor templating mechanism.

The dictionary which maps function names to their implementation. All public functions in this module are present in this dictionary and their corresponding keys are the same as their names.

walltime_to_minutes(walltime)[source]

Converts a walltime object to a number of minutes.

The walltime can either be a Python timedelta, an integer, in which case it is interpreted directly as a number of minutes, or a string with a format of either HH:MM:SS, HH:MM, or MM. :param walltime: the walltime to convert

Parameters

walltime (Union[timedelta, int, str]) –

Return type

int

Module contents

A package containing infrastructure for implementing batch scheduler executors.