The PSI/J API

The most important classes in this library are Job and JobExecutor, followed by Launcher.

The Job class and its modifiers

The Job-related classes listed in this section (Job, JobSpec, ResourceSpec, and JobAttributes) are independent of executor implementations. The authors strongly recommend that users program against these classes, rather than adding executor-specific configuration options, to the extent possible.

class Job(spec=None)[source]

This class represents a PSI/J job.

It encapsulates all of the information needed to run a job as well as the job’s state.

Constructs a Job object.

The object can optionally be initialized with the given JobSpec. After construction, the job will be in the NEW state.

Parameters

spec (Optional[JobSpec]) – an optional JobSpec

Return type

None

cancel()[source]

Cancels this job.

The job is canceled by calling cancel() on the job executor that was used to submit this job.

Raises

SubmitException – if the job has not yet been submitted.

Return type

None

property id: str

This job’s ID, read-only.

The ID is assigned automatically by the implementation when this Job object is constructed. The ID is guaranteed to be unique on the machine on which the Job object was instantiated. The ID does not have to match the ID of the underlying LRM job, but is used to identify Job instances as seen by a client application.

property native_id: Optional[str]

The ID of this job according to the underlying LRM, read-only.

The native ID may not be available until after the job is submitted to a JobExecutor, in which case the attribute is None.

set_job_status_callback(cb)[source]

Registers a status callback with this job.

The callback can either be a subclass of JobStatusCallback or a function accepting two arguments: a Job and a JobStatus and returning nothing.

The callback will be invoked whenever a status change occurs for this job, independent of any callback registered on the job’s JobExecutor. To remove the callback, set it to None.

Parameters

cb (Union[JobStatusCallback, Callable[[Job, JobStatus], None]]) – An instance of JobStatusCallback or a callable with two parameters, job of type Job and job_status of type JobStatus returning nothing.

Return type

None

spec

The job specification for this job. A valid job requires a valid specification.

property status: JobStatus

Returns the current status of the job.

It is guaranteed that the status returned by this method is monotonic in time with respect to the partial ordering of JobStatus types. That is, if job_status_1.state and job_status_2.state are comparable and job_status_1.state < job_status_2.state, then it is impossible for job_status_2 to be returned by a call placed prior to a call that returns job_status_1 if both calls are placed from the same thread or if a proper memory barrier is placed between the calls. Furthermore the job is guaranteed to go through all intermediate states in the state model before reaching a particular state.

Returns

the current state of this job

wait(timeout=None, target_states=None)[source]

Waits for the job to reach certain states.

This method returns either when the job reaches one of the target_states or when an amount of time indicated by the timeout parameter, if specified, passes. Returns the JobStatus object that has one of the desired target_states or None if the timeout is reached. If none of the states in target_states can be reached (such as, for example, because the job has entered the FAILED state while target_states consists of COMPLETED), this method throws an UnreachableStateException.

Parameters
  • timeout (Optional[timedelta]) – An optional timeout after which this method returns even if none of the target_states was reached. If not specified, wait indefinitely.

  • target_states (Optional[Sequence[JobState]]) – A set of states to wait for. If not specified, wait for any of the final states.

Returns

returns the JobStatus object that caused the caused this call to complete or None if the timeout is specified and reached.

Return type

Optional[JobStatus]

class JobStatus(state, time=None, message=None, exit_code=None, metadata=None)[source]

A class containing details about job transitions to new states.

Constructs a JobStatus object.

Parameters
  • state (JobState) – The JobState of this status.

  • time (Optional[float]) – The time, as would be returned by time.time() that the transition to the new state occurred. If None, the current time will be used.

  • message (Optional[str]) – An optional message associated with the transition.

  • exit_code (Optional[int]) – An optional exit code for the job, if the job has completed.

  • metadata (Optional[Dict[str, object]]) – Optional metadata provided by the JobExecutor.

Return type

None

property final: bool

Returns the final property of the underlying state.

Returns

True if the state is final and False otherwise.

class JobState(value)[source]

An enumeration holding the possible job states.

The possible states are: NEW, QUEUED, ACTIVE, COMPLETED, FAILED, and CANCELED.

ACTIVE = 2

This state represents an actively running job.

CANCELED = 5

Represents a job that was canceled by a call to cancel().

COMPLETED = 3

This state represents a job that has completed successfully (i.e., with a zero exit code). In other words, a job with the executable set to /bin/false cannot enter this state.

FAILED = 4

Represents a job that has either completed unsuccessfully (with a non-zero exit code) or a job whose handling and/or execution by the backend has failed in some way.

NEW = 0

This is the state of a job immediately after the Job object is created and before being submitted to a JobExecutor.

QUEUED = 1

This is the state of the job after being accepted by a backend for execution, but before the execution of the job begins.

property final: bool

Returns True if this state final.

A state is final when no other state transition can occur after that state has been reached.

Returns

True if this is a final state and False otherwise

is_greater_than(other)[source]

Defines a (strict) partial ordering on the states.

Not all states are comparable. State transitions cannot violate this ordering.

Parameters

other (JobState) – the other JobState to compare to

Returns

if this state is comparable with other, this method returns True or False depending on the relative order between this state and other. That is, True is returned if and only if this state can come after other. If this state is not comparable with other, this method returns None.

Return type

Optional[bool]

Job modifiers

There can be a lot of configuration information that goes into each resource manager job. Its walltime, partition/queue, the number of nodes it needs, what kind of nodes, what quality of service the job requires, and so on.

PSI/J splits those three attributes into three groups: one for generic POSIX information, one for resource information, and one for resource manager scheduling policies.

class JobSpec(name=None, executable=None, arguments=None, directory=None, inherit_environment=True, environment=None, stdin_path=None, stdout_path=None, stderr_path=None, resources=None, attributes=None, pre_launch=None, post_launch=None, launcher=None)[source]

A class to hold information about the characteristics of a:class:~psij.Job.

Constructs a JobSpec object while allowing its properties to be initialized.

Parameters
  • name (Optional[str]) – A name for the job. The name plays no functional role except that JobExecutor implementations may attempt to use the name to label the job as presented by the underlying implementation.

  • executable (Optional[str]) – An executable, such as “/bin/date”.

  • arguments (Optional[List[str]]) – The argument list to be passed to the executable. Unlike with execve(), the first element of the list will correspond to argv[1] when accessed by the invoked executable.

  • directory (Optional[Path]) – The directory, on the compute side, in which the executable is to be run

  • inherit_environment (bool) – If this flag is set to False, the job starts with an empty environment. The only environment variables that will be accessible to the job are the ones specified by this property. If this flag is set to True, which is the default, the job will also have access to variables inherited from the environment in which the job is run.

  • environment (Optional[Dict[str, str]]) – A mapping of environment variable names to their respective values.

  • stdin_path (Optional[Path]) – Path to a file whose contents will be sent to the job’s standard input.

  • stdout_path (Optional[Path]) – A path to a file in which to place the standard output stream of the job.

  • stderr_path (Optional[Path]) – A path to a file in which to place the standard error stream of the job.

  • resources (Optional[ResourceSpec]) – The resource requirements specify the details of how the job is to be run on a cluster, such as the number and type of compute nodes used, etc.

  • attributes (Optional[JobAttributes]) – Job attributes are details about the job, such as the walltime, that are descriptive of how the job behaves. Attributes are, in principle, non-essential in that the job could run even though no attributes are specified. In practice, specifying a walltime is often necessary to prevent LRMs from prematurely terminating a job.

  • pre_launch (Optional[Path]) – An optional path to a pre-launch script. The pre-launch script is sourced before the launcher is invoked. It, therefore, runs on the service node of the job rather than on all of the compute nodes allocated to the job.

  • post_launch (Optional[Path]) – An optional path to a post-launch script. The post-launch script is sourced after all the ranks of the job executable complete and is sourced on the same node as the pre-launch script.

  • launcher (Optional[str]) – The name of a launcher to use, such as “mpirun”, “srun”, “single”, etc. For a list of available launchers,:ref:launchers

property name: Optional[str]

Returns the name of the job.

property to_dict: Dict[str, Any]

Returns a dictionary representation of this object.

class ResourceSpec[source]

A base class for resource specifications.

The ResourceSpec class is an abstract base class for all possible resource specification classes in PSI/J.

abstract property version: int

Returns the version of this resource specification class.

class JobAttributes(duration=datetime.timedelta(seconds=600), queue_name=None, project_name=None, reservation_id=None, custom_attributes=None)[source]

A class containing ancillary job information that describes how a job is to be run.

Constructs a JobAttributes instance while allowing its various fields to be initialized.

Parameters
  • duration (timedelta) – Specifies the duration (walltime) of the job. A job whose execution exceeds its walltime can be terminated forcefully.

  • queue_name (Optional[str]) – If a backend supports multiple queues, this parameter can be used to instruct the backend to send this job to a particular queue.

  • project_name (Optional[str]) – If a backend supports multiple projects for billing purposes, setting this attribute instructs the backend to bill the indicated project for the resources consumed by this job.

  • reservation_id (Optional[str]) – Allows specifying an advanced reservation ID. Advanced reservations enable the pre-allocation of a set of resources/compute nodes for a certain duration such that jobs can be run immediately, without waiting in the queue for resources to become available.

  • custom_attributes (Optional[Dict[str, object]]) – Specifies a dictionary of custom attributes. Implementations of JobExecutor define and are responsible for interpreting custom attributes.

Return type

None

get_custom_attribute(name)[source]

Retrieves the value of a custom attribute.

Parameters

name (str) –

Return type

Optional[object]

set_custom_attribute(name, value)[source]

Sets a custom attribute.

Parameters
Return type

None

Executors

Executors are concrete implementations of mechanisms that execute jobs. To get an instance of a specific executor, call JobExecutor.get_instance(name), with name being one of the installed executor names. Alternatively, directly instantiate the executor, e.g.

from psij.executors.flux import FluxJobExecutor

ex = FluxJobExecutor()

Rather than

import psij

ex = psij.JobExecutor.get_instance('flux')

Executors can be installed from multiple sources, so the precise list of executors available to a specific installation of the PSI/J Python library can vary. In order to get a list of available executors, you can run, in a terminal:

$ python -m psij plugins

JobExecutor Base Class

The psij.JobExecutor class is abstract, but offers concrete static methods for registering, fetching, and listing subclasses of itself.

class JobExecutor(url=None, config=None)[source]

An abstract base class for all JobExecutor implementations.

Initializes this executor using an optional url and an optional configuration.

Parameters
  • url (Optional[str]) – The URL is a string that a JobExecutor implementation can interpret as the location of a backend.

  • config (Optional[JobExecutorConfig]) – An configuration specific to each JobExecutor implementation. This parameter is marked as optional such that concrete JobExecutor classes can be instantiated with no config parameter. However, concrete JobExecutor classes must pass a default configuration up the inheritance tree and ensure that the config parameter of the ABC constructor is non-null.

The concrete executor implementations provided by this version of PSI/J Python are:

Cobalt

class CobaltJobExecutor(url=None, config=None)[source]

A JobExecutor for the Cobalt Workload Manager.

The Cobalt HPC Job Scheduler, is used by Argonne’s ALCF systems.

Uses the qsub, qstat, and qdel commands, respectively, to submit, monitor, and cancel jobs.

Creates a batch script with #COBALT directives when submitting a job.

Initializes a CobaltJobExecutor.

Parameters

Flux

class FluxJobExecutor(url=None, config=None)[source]

A JobExecutor for the Flux scheduler.

The Flux resource manager framework is deployed and used on a per-user basis at many sites, and is slated to become the system-level resource manager at LLNL.

Uses Flux’s python library/bindings to submit, monitor, and manipulate jobs.

Initializes a FluxJobExecutor.

Parameters
  • url (Optional[str]) – Not used, but required by the spec for automatic initialization.

  • config (Optional[JobExecutorConfig]) – The FluxJobExecutor does not have any configuration options.

Return type

None

LSF

class LsfJobExecutor(url, config=None)[source]

A JobExecutor for the LSF Workload Manager.

The IBM Spectrum LSF workload manager is the system resource manager on LLNL’s Sierra and Lassen, and ORNL’s Summit.

Uses the ‘bsub’, ‘bjobs’, and ‘bkill’ commands, respectively, to submit, monitor, and cancel jobs.

Creates a batch script with #BSUB directives when submitting a job.

Initializes a LsfJobExecutor.

Parameters

PBS

class PBSProJobExecutor(url=None, config=None)[source]

A JobExecutor for PBS Pro.

PBS Pro is a resource manager on certain machines at Argonne National Lab, among others.

Uses the qsub, qstat, and qdel commands, respectively, to submit, monitor, and cancel jobs.

Creates a batch script with #PBS directives when submitting a job.

Initializes a PBSProJobExecutor.

Parameters

Slurm

class SlurmJobExecutor(url=None, config=None)[source]

A JobExecutor for the Slurm Workload Manager.

The Slurm Workload Manager is a widely used resource manager running on machines such as NERSC’s Perlmutter, as well as a variety of LLNL machines.

Uses the ‘sbatch’, ‘squeue’, and ‘scancel’ commands, respectively, to submit, monitor, and cancel jobs.

Creates a batch script with #SBATCH directives when submitting a job.

Initializes a SlurmJobExecutor.

Parameters

Local

class LocalJobExecutor(url=None, config=None)[source]

A job executor that runs jobs locally using subprocess.Popen.

This job executor is intended to be used when there is no resource manager, only the operating system. Or when there is a resource manager, but it should be ignored.

Limitations: in Linux, attached jobs always appear to complete with a zero exit code regardless of the actual exit code.

Initializes a LocalJobExecutor.

Parameters
  • url (Optional[str]) – Not used, but required by the spec for automatic initialization.

  • config (JobExecutorConfig) – The LocalJobExecutor does not have any configuration options.

Return type

None

Radical Pilot

class RPJobExecutor(url=None, config=None)[source]

A job executor that runs jobs via radical.pilot.

The RADICAL Pilot system.

Initializes a RPJobExecutor.

Parameters
  • url (Optional[str]) – Not used, but required by the spec for automatic initialization.

  • config (JobExecutorConfig) – The RPJobExecutor does not have any configuration options.

Return type

None

Launchers

Launchers are mechanisms to start the actual jobs on batch schedulers once a set of nodes has been allocated for the job. In essence, launchers are wrappers around the job executable which can provide additional features, such as setting up an MPI environment, starting a copy of the job executable on each allocated node, etc. To get a launcher instance, call Launcher.get_instance(name) with name being the name of a launcher. Like job executors, above, launchers are plugins and can come from various places. To obtain a list of launchers, you can run:

$ python -m psij plugins

Launcher base class

Like the executor, the Launcher base class is abstract, but offers concrete static methods for registering and fetching subclasses of itself.

class Launcher(config=None)[source]

An abstract base class for all launchers.

Base constructors for launchers.

Parameters

config (Optional[JobExecutorConfig]) – An optional configuration. If not specified, DEFAULT is used.

Return type

None

The PSI/J Python library comes with a core set of launchers, which are:

aprun

class AprunLauncher(config=None)[source]

Launches a job using Cobalt’s aprun.

Initializes this launcher using an optional configuration.

Parameters

config (Optional[JobExecutorConfig]) – An optional configuration.

jsrun

class JsrunLauncher(config=None)[source]

Launches a job using LSF’s jsrun.

Initializes this launcher using an optional configuration.

Parameters

config (Optional[JobExecutorConfig]) – An optional configuration.

srun

class SrunLauncher(config=None)[source]

Launches a job using Slurm’s srun.

See the Slurm Workload Manager.

Initializes this launcher using an optional configuration.

Parameters

config (Optional[JobExecutorConfig]) – An optional configuration.

mpirun

class MPILauncher(config=None)[source]

Launches jobs using mpirun.

mpirun is a tool provided by MPI implementations, such as Open MPI.

Initializes this launcher using an optional configuration.

Parameters

config (Optional[JobExecutorConfig]) – An optional configuration.

single

class SingleLauncher(config=None)[source]

A launcher that launches a single copy of the executable. This is the default launcher.

Initializes this launcher using an optional configuration.

Parameters

config (Optional[JobExecutorConfig]) – An optional configuration.

multiple

class MultipleLauncher(script_path=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/psij-python/checkouts/0.1.0.post2/src/psij/launchers/scripts/multi_launch.sh'), config=None)[source]

A launcher that launches multiple identical copies of the executable.

The exit code of the job corresponds to the first non-zero exit code encountered in one of the executable copies or zero if all invocations of the executable succeed.

Initializes this launcher using an optional configuration.

Parameters
get_additional_args(job)[source]

See get_additional_args().

Parameters

job (Job) –

Return type

List[str]

Other Package Contents

exception InvalidJobException(message, exception=None)[source]

An exception describing a problem with a job specification.

Constructs an InvalidJobException while allowing properties to be initialized.

Parameters
Return type

None

exception

Returns an optional underlying exception that can potentially be used for debugging purposes, but which should not, in general, be presented to an end-user.

message

Retrieves the message associated with this exception. This is a descriptive message that is sufficiently clear to be presented to an end-user.

exception SubmitException(message, exception=None, transient=False)[source]

An exception representing job submission issues.

This exception is thrown when the submit() call fails for a reason that is independent of the job that is being submitted.

Constructs a SubmitException and allows properties to be initialized.

Parameters
Return type

None

exception

Returns an optional underlying exception that can potentially be used for debugging purposes, but which should not, in general, be presented to an end-user.

message

Retrieves the message associated with this exception. This is a descriptive message that is sufficiently clear to be presented to an end-user.

transient

Returns True if the underlying condition that triggered this exception is transient. Jobs that cannot be submitted due to a transient exceptional condition have chance of being successfully re-submitted at a later time, which is a suggestion to client code that it could re-attempt the operation that triggered this exception. However, the exact chances of success depend on many factors and are not guaranteed in any particular case. For example, a DNS resolution failure while attempting to connect to a remote service is a transient error since it can be reasonably assumed that DNS resolution is a persistent feature of an Internet-connected network. By contrast, an authentication failure due to an invalid username/password combination would not be a transient failure. While it may be possible for a temporary defect in a service to cause such a failure, under normal operating conditions such an error would persist across subsequent re-tries until correct credentials are used.

exception UnreachableStateException(status)[source]

Indicates that a job state being waited for cannot be reached.

This exception is thrown when the wait() method is called with a set of states that cannot be reached by the job when the call is made.

Constructs an UnreachableStateException.

Parameters

status (JobStatus) – The JobStatus that the job was in when wait() was called and which prevents the desired states to be reached.

Return type

None

status

Returns the job status that has caused an implementation to determine that the desired states passed to the wait() method cannot be reached.

API Reference