psij package¶
Subpackages¶
- psij.executors package
- Subpackages
- psij.executors.batch package
- Submodules
- psij.executors.batch.batch_scheduler_executor module
- psij.executors.batch.cobalt module
- psij.executors.batch.escape_functions module
- psij.executors.batch.lsf module
- psij.executors.batch.pbspro module
- psij.executors.batch.script_generator module
- psij.executors.batch.slurm module
- psij.executors.batch.template_function_library module
- Module contents
- psij.executors.batch package
- Submodules
- psij.executors.flux module
- psij.executors.local module
- psij.executors.rp module
- Module contents
- Subpackages
- psij.launchers package
Submodules¶
psij.descriptor module¶
- class Descriptor(name, version, cls)[source]¶
Bases:
objectThis class is used to enable PSI/J to discover and register executors and/or launchers.
Executors wanting to register with PSI/J must place an instance of this class in a global module list named __PSI_J_EXECUTORS__ or __PSI_J_LAUNCHERS__ in a module placed in the psij-descriptors package. In other words, in order to automatically register an executor or launcher, a python file should be created inside a psij-descriptors package, such as:
<project_root>/ src/ psij-descriptors/ descriptors_for_project.pyThe contents of descriptors_for_project.py could then be as follows:
from distutils.version import StrictVersion from psij.descriptor import Descriptor __PSI_J_EXECUTORS__ = [ Descriptor(name=<name>, version=StrictVersion(<version_str>), cls=<fqn_str>), ... ] __PSI_J_LAUNCHERS__ = [ Descriptor(name=<name>, version=StrictVersion(<version_str>), cls=<fqn_str>), ... ]
where <name> stands for the name used to instantiate the executor or launcher, <version_str> is a version string such as 1.0.2, and <fqn_str> is the fully qualified class name that implements the executor or launcher such as psij.executors.local.LocalJobExecutor.
Initializes a descriptor.
- Parameters
name (str) – The name of the executor or launcher. The automatic registration system will register the executor or launcher using this name. That is, the executor or launcher represented by this descriptor will be available for instantiation using either
get_instance()orget_instance()version (StrictVersion) – The version of the executor/launcher. Multiple versions can be registered under a single name.
cls (str) – A fully qualified name pointing to the class implementing an executor or launcher.
- Return type
None
psij.exceptions module¶
- exception InvalidJobException(message, exception=None)[source]¶
Bases:
ExceptionAn exception describing a problem with a job specification.
Constructs an InvalidJobException while allowing properties to be initialized.
- Parameters
- Return type
None
- exception¶
Returns an optional underlying exception that can potentially be used for debugging purposes, but which should not, in general, be presented to an end-user.
- message¶
Retrieves the message associated with this exception. This is a descriptive message that is sufficiently clear to be presented to an end-user.
- exception SubmitException(message, exception=None, transient=False)[source]¶
Bases:
ExceptionAn exception representing job submission issues.
This exception is thrown when the
submit()call fails for a reason that is independent of the job that is being submitted.Constructs a SubmitException and allows properties to be initialized.
- Parameters
- Return type
None
- exception¶
Returns an optional underlying exception that can potentially be used for debugging purposes, but which should not, in general, be presented to an end-user.
- message¶
Retrieves the message associated with this exception. This is a descriptive message that is sufficiently clear to be presented to an end-user.
- transient¶
Returns True if the underlying condition that triggered this exception is transient. Jobs that cannot be submitted due to a transient exceptional condition have chance of being successfully re-submitted at a later time, which is a suggestion to client code that it could re-attempt the operation that triggered this exception. However, the exact chances of success depend on many factors and are not guaranteed in any particular case. For example, a DNS resolution failure while attempting to connect to a remote service is a transient error since it can be reasonably assumed that DNS resolution is a persistent feature of an Internet-connected network. By contrast, an authentication failure due to an invalid username/password combination would not be a transient failure. While it may be possible for a temporary defect in a service to cause such a failure, under normal operating conditions such an error would persist across subsequent re-tries until correct credentials are used.
- exception UnreachableStateException(status)[source]¶
Bases:
ExceptionIndicates that a job state being waited for cannot be reached.
This exception is thrown when the
wait()method is called with a set of states that cannot be reached by the job when the call is made.Constructs an UnreachableStateException.
psij.job module¶
- class FunctionJobStatusCallback(fn)[source]¶
Bases:
JobStatusCallbackA JobStatusCallback that wraps a function.
Initializes a _FunctionJobStatusCallback.
- job_status_changed(job, job_status)[source]¶
See
job_status_changed().
- class Job(spec=None)[source]¶
Bases:
objectThis class represents a PSI/J job.
It encapsulates all of the information needed to run a job as well as the job’s state.
Constructs a Job object.
The object can optionally be initialized with the given
JobSpec. After construction, the job will be in theNEWstate.- cancel()[source]¶
Cancels this job.
The job is canceled by calling
cancel()on the job executor that was used to submit this job.- Raises
SubmitException – if the job has not yet been submitted.
- Return type
None
- property id: str¶
This job’s ID, read-only.
The ID is assigned automatically by the implementation when this Job object is constructed. The ID is guaranteed to be unique on the machine on which the Job object was instantiated. The ID does not have to match the ID of the underlying LRM job, but is used to identify Job instances as seen by a client application.
- property native_id: Optional[str]¶
The ID of this job according to the underlying LRM, read-only.
The native ID may not be available until after the job is submitted to a
JobExecutor, in which case the attribute isNone.
- set_job_status_callback(cb)[source]¶
Registers a status callback with this job.
The callback can either be a subclass of
JobStatusCallbackor a function accepting two arguments: aJoband aJobStatusand returning nothing.The callback will be invoked whenever a status change occurs for this job, independent of any callback registered on the job’s
JobExecutor. To remove the callback, set it to None.- Parameters
cb (Union[JobStatusCallback, Callable[[Job, JobStatus], None]]) – An instance of
JobStatusCallbackor a callable with two parameters, job of typeJoband job_status of typeJobStatusreturning nothing.- Return type
None
- spec¶
The job specification for this job. A valid job requires a valid specification.
- property status: JobStatus¶
Returns the current status of the job.
It is guaranteed that the status returned by this method is monotonic in time with respect to the partial ordering of
JobStatustypes. That is, if job_status_1.state and job_status_2.state are comparable and job_status_1.state < job_status_2.state, then it is impossible for job_status_2 to be returned by a call placed prior to a call that returns job_status_1 if both calls are placed from the same thread or if a proper memory barrier is placed between the calls. Furthermore the job is guaranteed to go through all intermediate states in the state model before reaching a particular state.- Returns
the current state of this job
- wait(timeout=None, target_states=None)[source]¶
Waits for the job to reach certain states.
This method returns either when the job reaches one of the target_states or when an amount of time indicated by the timeout parameter, if specified, passes. Returns the
JobStatusobject that has one of the desired target_states or None if the timeout is reached. If none of the states in target_states can be reached (such as, for example, because the job has entered theFAILEDstate while target_states consists ofCOMPLETED), this method throws anUnreachableStateException.- Parameters
- Returns
returns the
JobStatusobject that caused the caused this call to complete or None if the timeout is specified and reached.- Return type
- class JobStatusCallback[source]¶
Bases:
ABCAn interface used to listen to job status change events.
- abstract job_status_changed(job, job_status)[source]¶
This method is invoked when a status change occurs on a job.
Client code interested in receiving status notifications must implement this method. It is entirely possible that
psij.Job.statuswhen referenced from the body of this method would return something different from the status passed to this callback. This is because the status of the job can be updated during the execution of the body of this method and, in particular, before the potential dereference topsij.Job.statusis made.Client code implementing this method must return quickly and cannot be used for lengthy processing. Furthermore, client code implementing this method should not throw exceptions.
psij.job_attributes module¶
- class JobAttributes(duration=datetime.timedelta(seconds=600), queue_name=None, project_name=None, reservation_id=None, custom_attributes=None)[source]¶
Bases:
objectA class containing ancillary job information that describes how a job is to be run.
Constructs a JobAttributes instance while allowing its various fields to be initialized.
- Parameters
duration (timedelta) – Specifies the duration (walltime) of the job. A job whose execution exceeds its walltime can be terminated forcefully.
queue_name (Optional[str]) – If a backend supports multiple queues, this parameter can be used to instruct the backend to send this job to a particular queue.
project_name (Optional[str]) – If a backend supports multiple projects for billing purposes, setting this attribute instructs the backend to bill the indicated project for the resources consumed by this job.
reservation_id (Optional[str]) – Allows specifying an advanced reservation ID. Advanced reservations enable the pre-allocation of a set of resources/compute nodes for a certain duration such that jobs can be run immediately, without waiting in the queue for resources to become available.
custom_attributes (Optional[Dict[str, object]]) – Specifies a dictionary of custom attributes. Implementations of
JobExecutordefine and are responsible for interpreting custom attributes.
- Return type
None
psij.job_executor module¶
- class JobExecutor(url=None, config=None)[source]¶
Bases:
ABCAn abstract base class for all JobExecutor implementations.
Initializes this executor using an optional url and an optional configuration.
- Parameters
url (Optional[str]) – The URL is a string that a JobExecutor implementation can interpret as the location of a backend.
config (Optional[JobExecutorConfig]) – An configuration specific to each JobExecutor implementation. This parameter is marked as optional such that concrete JobExecutor classes can be instantiated with no config parameter. However, concrete JobExecutor classes must pass a default configuration up the inheritance tree and ensure that the config parameter of the ABC constructor is non-null.
- abstract cancel(job)[source]¶
Cancels a job that has been submitted to underlying executor implementation.
A successful return of this method only indicates that the request for cancelation has been communicated to the underlying implementation. The job will then be canceled at the discretion of the implementation, which may be at some later time. A successful cancelation is reflected in a change of status of the respective job to
CANCELED. User code can synchronously wait until theCANCELEDstate is reached using job.wait(JobState.CANCELED) or even job.wait(), since the latter would wait for all final states, including JobState.CANCELED. In fact, it is recommended that job.wait() be used because it is entirely possible for the job to complete before the cancelation is communicated to the underlying implementation and before the client code receives the completion notification. In such a case, the job will never enter the CANCELED state and job.wait(JobState.CANCELED) would hang indefinitely.- Parameters
job (Job) – The job to be canceled.
- Raises
SubmitException – Thrown if the request cannot be sent to the underlying implementation.
- Return type
None
- static get_executor_names()[source]¶
Returns a set of registered executor names.
Names returned by this method can be passed to
get_instance()as the name parameter.
- static get_instance(name, version_constraint=None, url=None, config=None)[source]¶
Returns an instance of a JobExecutor.
- Parameters
name (str) – The name of the executor to return. This must be one of the values returned by
get_executor_names(). If the value of the name parameter is not one of the valid values returned byget_executor_names(), ValueError is raised.version_constraint (Optional[str]) – A version constraint for the executor in the form ‘(’ <op> <version>[, <op> <version[, …]] ‘)’, such as “( > 0.0.2, != 0.0.4)”.
url (Optional[str]) – An optional URL to pass to the JobExecutor instance.
config (Optional[JobExecutorConfig]) – An optional configuration to pass to the instance.
- Returns
A JobExecutor.
- Return type
- static register_executor(desc, root)[source]¶
Registers a JobExecutor class through a
Descriptor.The class can then be later instantiated using
get_instance().- Parameters
desc (Descriptor) – A
Descriptorwith information about the executor to be registered.root (str) – A filesystem path under which the implementation of the executor is to be loaded from. Executors from other locations, even if under the correct package, will not be registered by this method. If an executor implementation is only available under a different root path, this method will throw an exception.
- Return type
None
- set_job_status_callback(cb)[source]¶
Registers a status callback with this executor.
The callback can either be a subclass of
JobStatusCallbackor a function accepting two arguments: aJoband aJobStatusand returning nothing.The callback will be invoked whenever a status change occurs for any of the jobs submitted to this job executor, whether they were submitted with an individual job status callback or not. To remove the callback, set it to None.
- Parameters
cb (Union[JobStatusCallback, Callable[[Job, JobStatus], None]]) – An instance of
JobStatusCallbackor a callable with two parameters, job of typeJoband job_status of typeJobStatusreturning nothing.- Return type
None
- abstract submit(job)[source]¶
Submits a Job to the underlying implementation.
Successful return of this method indicates that the job has been sent to the underlying implementation and all changes in the job status, including failures, are reported using notifications. Conversely, if one of the two possible exceptions is thrown, then the job has not been successfully sent to the underlying implementation, the job status remains unchanged, and no status notifications about the job will be fired.
- Raises
InvalidJobException – Thrown if the job specification cannot be understood. This exception is fatal in that submitting another job with the exact same details will also fail with an ~psij.InvalidJobException. In principle, the underlying implementation / LRM is the entity ultimately responsible for interpreting a specification and reporting any errors associated with it. However, in many cases, this reporting may come after a significant delay. In the interest of failing fast, library implementations should make an effort of validating specifications early and throwing this exception as soon as possible if that validation fails.
SubmitException – Thrown if the request cannot be sent to the underlying implementation. Unlike ~psij.InvalidJobException, this exception can occur for reasons that are transient.
- Parameters
job (Job) –
- Return type
None
- property version: distutils.version.Version¶
Returns the version of this executor.
psij.job_executor_config module¶
- class JobExecutorConfig(launcher_log_file=None, work_directory=None)[source]¶
Bases:
objectAn abstract configuration class for
JobExecutorinstances.Initializes a configuration object.
- Parameters
launcher_log_file (Optional[Path]) – If specified, log messages from launcher scripts (including output from pre- and post- launch scripts) will be directed to this file.
work_directory (Optional[Path]) – A directory where submit scripts and auxiliary job files will be generated. In a, cluster this directory needs to point to a directory on a shared filesystem. This is so that the exit code file, likely written on a service node, can be accessed by PSI/J, likely running on a head node.
- Return type
None
- DEFAULT: JobExecutorConfig = <psij.job_executor_config.JobExecutorConfig object>¶
- DEFAULT_WORK_DIRECTORY = PosixPath('/home/docs/.psij/work')¶
- property launcher_log_file: Optional[Path]¶
Configure the executor’s launcher log file.
- Parameters
launcher_log_file – If specified, log messages from launcher scripts (including output from pre- and post- launch scripts) will be directed to this file.
- property work_directory: Path¶
Configure the execor’s work directory.
- Parameters
work_directory – A directory where submit scripts and auxiliary job files will be generated. In a, cluster this directory needs to point to a directory on a shared filesystem. This is so that the exit code file, likely written on a service node, can be accessed by PSI/J, likely running on a head node.
psij.job_launcher module¶
This module contains the core classes of the launchers infrastructure.
- class Launcher(config=None)[source]¶
Bases:
ABCAn abstract base class for all launchers.
Base constructors for launchers.
- Parameters
config (Optional[JobExecutorConfig]) – An optional configuration. If not specified,
DEFAULTis used.- Return type
None
- DEFAULT_LAUNCHER_NAME = 'single'¶
- static get_instance(name, version_constraint=None, config=None)[source]¶
Returns an instance of a launcher optionally configured using a certain configuration.
The returned instance may or may not be a singleton object.
- abstract get_launch_command(job)[source]¶
Constructs a command to launch a job given a job specification.
- abstract get_launcher_failure_message(output)[source]¶
Extracts the launcher error message from the output of this launcher’s invocation.
It is understood that the output is such that
is_launcher_failure()returns True on it.
- abstract is_launcher_failure(output)[source]¶
Determines whether the launcher invocation output contains a launcher failure or not.
- static register_launcher(desc, root)[source]¶
Registers a launcher class.
The registered class can then be instantiated using
get_instance().- Parameters
desc (Descriptor) – A
Descriptorwith information about the launcher to register.root (str) – A filesystem path under which the implementation of the launcher is to be loaded from. Launchers from other locations, even if under the correct package, will not be registered by this method. If a launcher implementation is only available under a different root path, this method will throw an exception.
- Return type
None
psij.job_spec module¶
- class JobSpec(name=None, executable=None, arguments=None, directory=None, inherit_environment=True, environment=None, stdin_path=None, stdout_path=None, stderr_path=None, resources=None, attributes=None, pre_launch=None, post_launch=None, launcher=None)[source]¶
Bases:
objectA class to hold information about the characteristics of a:class:~psij.Job.
Constructs a JobSpec object while allowing its properties to be initialized.
- Parameters
name (Optional[str]) – A name for the job. The name plays no functional role except that
JobExecutorimplementations may attempt to use the name to label the job as presented by the underlying implementation.executable (Optional[str]) – An executable, such as “/bin/date”.
arguments (Optional[List[str]]) – The argument list to be passed to the executable. Unlike with execve(), the first element of the list will correspond to argv[1] when accessed by the invoked executable.
directory (Optional[Path]) – The directory, on the compute side, in which the executable is to be run
inherit_environment (bool) – If this flag is set to False, the job starts with an empty environment. The only environment variables that will be accessible to the job are the ones specified by this property. If this flag is set to True, which is the default, the job will also have access to variables inherited from the environment in which the job is run.
environment (Optional[Dict[str, str]]) – A mapping of environment variable names to their respective values.
stdin_path (Optional[Path]) – Path to a file whose contents will be sent to the job’s standard input.
stdout_path (Optional[Path]) – A path to a file in which to place the standard output stream of the job.
stderr_path (Optional[Path]) – A path to a file in which to place the standard error stream of the job.
resources (Optional[ResourceSpec]) – The resource requirements specify the details of how the job is to be run on a cluster, such as the number and type of compute nodes used, etc.
attributes (Optional[JobAttributes]) – Job attributes are details about the job, such as the walltime, that are descriptive of how the job behaves. Attributes are, in principle, non-essential in that the job could run even though no attributes are specified. In practice, specifying a walltime is often necessary to prevent LRMs from prematurely terminating a job.
pre_launch (Optional[Path]) – An optional path to a pre-launch script. The pre-launch script is sourced before the launcher is invoked. It, therefore, runs on the service node of the job rather than on all of the compute nodes allocated to the job.
post_launch (Optional[Path]) – An optional path to a post-launch script. The post-launch script is sourced after all the ranks of the job executable complete and is sourced on the same node as the pre-launch script.
launcher (Optional[str]) – The name of a launcher to use, such as “mpirun”, “srun”, “single”, etc. For a list of available launchers,:ref:launchers
psij.job_state module¶
- class JobState(value)[source]¶
-
An enumeration holding the possible job states.
The possible states are: NEW, QUEUED, ACTIVE, COMPLETED, FAILED, and CANCELED.
- ACTIVE = 2¶
This state represents an actively running job.
- COMPLETED = 3¶
This state represents a job that has completed successfully (i.e., with a zero exit code). In other words, a job with the executable set to /bin/false cannot enter this state.
- FAILED = 4¶
Represents a job that has either completed unsuccessfully (with a non-zero exit code) or a job whose handling and/or execution by the backend has failed in some way.
- NEW = 0¶
This is the state of a job immediately after the
Jobobject is created and before being submitted to aJobExecutor.
- QUEUED = 1¶
This is the state of the job after being accepted by a backend for execution, but before the execution of the job begins.
- property final: bool¶
Returns True if this state final.
A state is final when no other state transition can occur after that state has been reached.
- Returns
True if this is a final state and False otherwise
- is_greater_than(other)[source]¶
Defines a (strict) partial ordering on the states.
Not all states are comparable. State transitions cannot violate this ordering.
- Parameters
other (JobState) – the other JobState to compare to
- Returns
if this state is comparable with other, this method returns True or False depending on the relative order between this state and other. That is, True is returned if and only if this state can come after other. If this state is not comparable with other, this method returns None.
- Return type
- class JobStateOrder[source]¶
Bases:
objectA class that can be used to reconstruct missing states.
- static prev(state)[source]¶
Returns the state previous to the given state.
The “previous” state is a state that must have occurred immediately prior to this state given the state transition diagram if such a state is unique. Not all states have a previous state. For example, the FAILED state does not have a previous state, since it can be reached from multiple states.
psij.job_status module¶
- class JobStatus(state, time=None, message=None, exit_code=None, metadata=None)[source]¶
Bases:
objectA class containing details about job transitions to new states.
Constructs a JobStatus object.
- Parameters
time (Optional[float]) – The time, as would be returned by
time.time()that the transition to the new state occurred. If None, the current time will be used.message (Optional[str]) – An optional message associated with the transition.
exit_code (Optional[int]) – An optional exit code for the job, if the job has completed.
metadata (Optional[Dict[str, object]]) – Optional metadata provided by the
JobExecutor.
- Return type
None
psij.launcher module¶
This module contains the core classes of the launchers infrastructure.
- class Launcher(config=None)[source]¶
Bases:
ABCAn abstract base class for all launchers.
Base constructors for launchers.
- Parameters
config (Optional[JobExecutorConfig]) – An optional configuration. If not specified,
DEFAULTis used.- Return type
None
- DEFAULT_LAUNCHER_NAME = 'single'¶
- static get_instance(name, version_constraint=None, config=None)[source]¶
Returns an instance of a launcher optionally configured using a certain configuration.
The returned instance may or may not be a singleton object.
- abstract get_launch_command(job)[source]¶
Constructs a command to launch a job given a job specification.
- abstract get_launcher_failure_message(output)[source]¶
Extracts the launcher error message from the output of this launcher’s invocation.
It is understood that the output is such that
is_launcher_failure()returns True on it.
- abstract is_launcher_failure(output)[source]¶
Determines whether the launcher invocation output contains a launcher failure or not.
- static register_launcher(desc, root)[source]¶
Registers a launcher class.
The registered class can then be instantiated using
get_instance().- Parameters
desc (Descriptor) – A
Descriptorwith information about the launcher to register.root (str) – A filesystem path under which the implementation of the launcher is to be loaded from. Launchers from other locations, even if under the correct package, will not be registered by this method. If a launcher implementation is only available under a different root path, this method will throw an exception.
- Return type
None
psij.resource_spec module¶
- class ResourceSpec[source]¶
Bases:
ABCA base class for resource specifications.
The ResourceSpec class is an abstract base class for all possible resource specification classes in PSI/J.
- class ResourceSpecV1(node_count=None, process_count=None, processes_per_node=None, cpu_cores_per_process=None, gpu_cores_per_process=None, exclusive_node_use=True)[source]¶
Bases:
ResourceSpecThis class implements V1 of the PSI/J resource specification.
Constructs a ResourceSpecV1 object and optionally initializes its properties.
Some of the properties of this class are constrained. Specifically, process_count = node_count * processes_per_node. Specifying all constrained properties in a way that does not satisfy the constraint will result in an error. Specifying some of the constrained properties will result in the remaining one being inferred based on the constraint. This inference is done by this class. However, executor implementations may chose to delegate this inference to an underlying implementation and ignore the values inferred by this class.
- Parameters
node_count (Optional[int]) – If specified, request that the backend allocate this many compute nodes for the job.
process_count (Optional[int]) – If specified, instruct the backend to start this many process instances. This defaults to 1.
processes_per_node (Optional[int]) – Instruct the backend to run this many process instances on each node.
cpu_cores_per_process (Optional[int]) – Request this many CPU cores for each process instance. This property is used by a backend to calculate the number of nodes from the process_count
exclusive_node_use (bool) –
- Return type
None
- property computed_node_count: int¶
Returns or calculates a node count.
If the node_count property is specified, this method returns it. If not, a node count is calculated from process_count and processes_per_node.
- Returns
An integer value with the specified or calculated node count.
- property computed_process_count: int¶
Returns or calculates a process count.
If the process_count property is specified, this method returns it, otherwise it returns 1.
- Returns
An integer value with either the value of process_count or one if the former is not specified.
- property computed_processes_per_node: int¶
Returns or calculates the number of processes per node.
If the processes_per_node property is specified, this method returns it, otherwise calculates it based on process_count and node_count if possible, or defaults to 1.
- Returns
An integer value with either the value of processes_per_node or one if the former cannot be determined.
psij.serialize module¶
- class Export[source]¶
Bases:
objectA class for exporting psij data types.
Initializes an export object.
- Return type
None
psij.utils module¶
psij.version module¶
Set module version.
<Major>.<Minor>.<maintenance>[alpha/beta/..] Alphas will be numbered like this -> 1.0.0-a0
Module contents¶
The package containing the jobs module of this PSI implementation.
- class Export[source]¶
Bases:
objectA class for exporting psij data types.
Initializes an export object.
- Return type
None
- class Import[source]¶
Bases:
objectA class for importing psij data types.
- exception InvalidJobException(message, exception=None)[source]¶
Bases:
ExceptionAn exception describing a problem with a job specification.
Constructs an InvalidJobException while allowing properties to be initialized.
- Parameters
- Return type
None
- exception¶
Returns an optional underlying exception that can potentially be used for debugging purposes, but which should not, in general, be presented to an end-user.
- message¶
Retrieves the message associated with this exception. This is a descriptive message that is sufficiently clear to be presented to an end-user.
- class Job(spec=None)[source]¶
Bases:
objectThis class represents a PSI/J job.
It encapsulates all of the information needed to run a job as well as the job’s state.
Constructs a Job object.
The object can optionally be initialized with the given
JobSpec. After construction, the job will be in theNEWstate.- cancel()[source]¶
Cancels this job.
The job is canceled by calling
cancel()on the job executor that was used to submit this job.- Raises
SubmitException – if the job has not yet been submitted.
- Return type
None
- property id: str¶
This job’s ID, read-only.
The ID is assigned automatically by the implementation when this Job object is constructed. The ID is guaranteed to be unique on the machine on which the Job object was instantiated. The ID does not have to match the ID of the underlying LRM job, but is used to identify Job instances as seen by a client application.
- property native_id: Optional[str]¶
The ID of this job according to the underlying LRM, read-only.
The native ID may not be available until after the job is submitted to a
JobExecutor, in which case the attribute isNone.
- set_job_status_callback(cb)[source]¶
Registers a status callback with this job.
The callback can either be a subclass of
JobStatusCallbackor a function accepting two arguments: aJoband aJobStatusand returning nothing.The callback will be invoked whenever a status change occurs for this job, independent of any callback registered on the job’s
JobExecutor. To remove the callback, set it to None.- Parameters
cb (Union[JobStatusCallback, Callable[[Job, JobStatus], None]]) – An instance of
JobStatusCallbackor a callable with two parameters, job of typeJoband job_status of typeJobStatusreturning nothing.- Return type
None
- spec¶
The job specification for this job. A valid job requires a valid specification.
- property status: JobStatus¶
Returns the current status of the job.
It is guaranteed that the status returned by this method is monotonic in time with respect to the partial ordering of
JobStatustypes. That is, if job_status_1.state and job_status_2.state are comparable and job_status_1.state < job_status_2.state, then it is impossible for job_status_2 to be returned by a call placed prior to a call that returns job_status_1 if both calls are placed from the same thread or if a proper memory barrier is placed between the calls. Furthermore the job is guaranteed to go through all intermediate states in the state model before reaching a particular state.- Returns
the current state of this job
- wait(timeout=None, target_states=None)[source]¶
Waits for the job to reach certain states.
This method returns either when the job reaches one of the target_states or when an amount of time indicated by the timeout parameter, if specified, passes. Returns the
JobStatusobject that has one of the desired target_states or None if the timeout is reached. If none of the states in target_states can be reached (such as, for example, because the job has entered theFAILEDstate while target_states consists ofCOMPLETED), this method throws anUnreachableStateException.- Parameters
- Returns
returns the
JobStatusobject that caused the caused this call to complete or None if the timeout is specified and reached.- Return type
- class JobAttributes(duration=datetime.timedelta(seconds=600), queue_name=None, project_name=None, reservation_id=None, custom_attributes=None)[source]¶
Bases:
objectA class containing ancillary job information that describes how a job is to be run.
Constructs a JobAttributes instance while allowing its various fields to be initialized.
- Parameters
duration (timedelta) – Specifies the duration (walltime) of the job. A job whose execution exceeds its walltime can be terminated forcefully.
queue_name (Optional[str]) – If a backend supports multiple queues, this parameter can be used to instruct the backend to send this job to a particular queue.
project_name (Optional[str]) – If a backend supports multiple projects for billing purposes, setting this attribute instructs the backend to bill the indicated project for the resources consumed by this job.
reservation_id (Optional[str]) – Allows specifying an advanced reservation ID. Advanced reservations enable the pre-allocation of a set of resources/compute nodes for a certain duration such that jobs can be run immediately, without waiting in the queue for resources to become available.
custom_attributes (Optional[Dict[str, object]]) – Specifies a dictionary of custom attributes. Implementations of
JobExecutordefine and are responsible for interpreting custom attributes.
- Return type
None
- class JobExecutor(url=None, config=None)[source]¶
Bases:
ABCAn abstract base class for all JobExecutor implementations.
Initializes this executor using an optional url and an optional configuration.
- Parameters
url (Optional[str]) – The URL is a string that a JobExecutor implementation can interpret as the location of a backend.
config (Optional[JobExecutorConfig]) – An configuration specific to each JobExecutor implementation. This parameter is marked as optional such that concrete JobExecutor classes can be instantiated with no config parameter. However, concrete JobExecutor classes must pass a default configuration up the inheritance tree and ensure that the config parameter of the ABC constructor is non-null.
- abstract cancel(job)[source]¶
Cancels a job that has been submitted to underlying executor implementation.
A successful return of this method only indicates that the request for cancelation has been communicated to the underlying implementation. The job will then be canceled at the discretion of the implementation, which may be at some later time. A successful cancelation is reflected in a change of status of the respective job to
CANCELED. User code can synchronously wait until theCANCELEDstate is reached using job.wait(JobState.CANCELED) or even job.wait(), since the latter would wait for all final states, including JobState.CANCELED. In fact, it is recommended that job.wait() be used because it is entirely possible for the job to complete before the cancelation is communicated to the underlying implementation and before the client code receives the completion notification. In such a case, the job will never enter the CANCELED state and job.wait(JobState.CANCELED) would hang indefinitely.- Parameters
job (Job) – The job to be canceled.
- Raises
SubmitException – Thrown if the request cannot be sent to the underlying implementation.
- Return type
None
- static get_executor_names()[source]¶
Returns a set of registered executor names.
Names returned by this method can be passed to
get_instance()as the name parameter.
- static get_instance(name, version_constraint=None, url=None, config=None)[source]¶
Returns an instance of a JobExecutor.
- Parameters
name (str) – The name of the executor to return. This must be one of the values returned by
get_executor_names(). If the value of the name parameter is not one of the valid values returned byget_executor_names(), ValueError is raised.version_constraint (Optional[str]) – A version constraint for the executor in the form ‘(’ <op> <version>[, <op> <version[, …]] ‘)’, such as “( > 0.0.2, != 0.0.4)”.
url (Optional[str]) – An optional URL to pass to the JobExecutor instance.
config (Optional[JobExecutorConfig]) – An optional configuration to pass to the instance.
- Returns
A JobExecutor.
- Return type
- static register_executor(desc, root)[source]¶
Registers a JobExecutor class through a
Descriptor.The class can then be later instantiated using
get_instance().- Parameters
desc (Descriptor) – A
Descriptorwith information about the executor to be registered.root (str) – A filesystem path under which the implementation of the executor is to be loaded from. Executors from other locations, even if under the correct package, will not be registered by this method. If an executor implementation is only available under a different root path, this method will throw an exception.
- Return type
None
- set_job_status_callback(cb)[source]¶
Registers a status callback with this executor.
The callback can either be a subclass of
JobStatusCallbackor a function accepting two arguments: aJoband aJobStatusand returning nothing.The callback will be invoked whenever a status change occurs for any of the jobs submitted to this job executor, whether they were submitted with an individual job status callback or not. To remove the callback, set it to None.
- Parameters
cb (Union[JobStatusCallback, Callable[[Job, JobStatus], None]]) – An instance of
JobStatusCallbackor a callable with two parameters, job of typeJoband job_status of typeJobStatusreturning nothing.- Return type
None
- abstract submit(job)[source]¶
Submits a Job to the underlying implementation.
Successful return of this method indicates that the job has been sent to the underlying implementation and all changes in the job status, including failures, are reported using notifications. Conversely, if one of the two possible exceptions is thrown, then the job has not been successfully sent to the underlying implementation, the job status remains unchanged, and no status notifications about the job will be fired.
- Raises
InvalidJobException – Thrown if the job specification cannot be understood. This exception is fatal in that submitting another job with the exact same details will also fail with an ~psij.InvalidJobException. In principle, the underlying implementation / LRM is the entity ultimately responsible for interpreting a specification and reporting any errors associated with it. However, in many cases, this reporting may come after a significant delay. In the interest of failing fast, library implementations should make an effort of validating specifications early and throwing this exception as soon as possible if that validation fails.
SubmitException – Thrown if the request cannot be sent to the underlying implementation. Unlike ~psij.InvalidJobException, this exception can occur for reasons that are transient.
- Parameters
job (Job) –
- Return type
None
- property version: distutils.version.Version¶
Returns the version of this executor.
- class JobExecutorConfig(launcher_log_file=None, work_directory=None)[source]¶
Bases:
objectAn abstract configuration class for
JobExecutorinstances.Initializes a configuration object.
- Parameters
launcher_log_file (Optional[Path]) – If specified, log messages from launcher scripts (including output from pre- and post- launch scripts) will be directed to this file.
work_directory (Optional[Path]) – A directory where submit scripts and auxiliary job files will be generated. In a, cluster this directory needs to point to a directory on a shared filesystem. This is so that the exit code file, likely written on a service node, can be accessed by PSI/J, likely running on a head node.
- Return type
None
- DEFAULT: JobExecutorConfig = <psij.job_executor_config.JobExecutorConfig object>¶
- DEFAULT_WORK_DIRECTORY = PosixPath('/home/docs/.psij/work')¶
- property launcher_log_file: Optional[Path]¶
Configure the executor’s launcher log file.
- Parameters
launcher_log_file – If specified, log messages from launcher scripts (including output from pre- and post- launch scripts) will be directed to this file.
- property work_directory: Path¶
Configure the execor’s work directory.
- Parameters
work_directory – A directory where submit scripts and auxiliary job files will be generated. In a, cluster this directory needs to point to a directory on a shared filesystem. This is so that the exit code file, likely written on a service node, can be accessed by PSI/J, likely running on a head node.
- class JobSpec(name=None, executable=None, arguments=None, directory=None, inherit_environment=True, environment=None, stdin_path=None, stdout_path=None, stderr_path=None, resources=None, attributes=None, pre_launch=None, post_launch=None, launcher=None)[source]¶
Bases:
objectA class to hold information about the characteristics of a:class:~psij.Job.
Constructs a JobSpec object while allowing its properties to be initialized.
- Parameters
name (Optional[str]) – A name for the job. The name plays no functional role except that
JobExecutorimplementations may attempt to use the name to label the job as presented by the underlying implementation.executable (Optional[str]) – An executable, such as “/bin/date”.
arguments (Optional[List[str]]) – The argument list to be passed to the executable. Unlike with execve(), the first element of the list will correspond to argv[1] when accessed by the invoked executable.
directory (Optional[Path]) – The directory, on the compute side, in which the executable is to be run
inherit_environment (bool) – If this flag is set to False, the job starts with an empty environment. The only environment variables that will be accessible to the job are the ones specified by this property. If this flag is set to True, which is the default, the job will also have access to variables inherited from the environment in which the job is run.
environment (Optional[Dict[str, str]]) – A mapping of environment variable names to their respective values.
stdin_path (Optional[Path]) – Path to a file whose contents will be sent to the job’s standard input.
stdout_path (Optional[Path]) – A path to a file in which to place the standard output stream of the job.
stderr_path (Optional[Path]) – A path to a file in which to place the standard error stream of the job.
resources (Optional[ResourceSpec]) – The resource requirements specify the details of how the job is to be run on a cluster, such as the number and type of compute nodes used, etc.
attributes (Optional[JobAttributes]) – Job attributes are details about the job, such as the walltime, that are descriptive of how the job behaves. Attributes are, in principle, non-essential in that the job could run even though no attributes are specified. In practice, specifying a walltime is often necessary to prevent LRMs from prematurely terminating a job.
pre_launch (Optional[Path]) – An optional path to a pre-launch script. The pre-launch script is sourced before the launcher is invoked. It, therefore, runs on the service node of the job rather than on all of the compute nodes allocated to the job.
post_launch (Optional[Path]) – An optional path to a post-launch script. The post-launch script is sourced after all the ranks of the job executable complete and is sourced on the same node as the pre-launch script.
launcher (Optional[str]) – The name of a launcher to use, such as “mpirun”, “srun”, “single”, etc. For a list of available launchers,:ref:launchers
- class JobState(value)[source]¶
-
An enumeration holding the possible job states.
The possible states are: NEW, QUEUED, ACTIVE, COMPLETED, FAILED, and CANCELED.
- ACTIVE = 2¶
This state represents an actively running job.
- COMPLETED = 3¶
This state represents a job that has completed successfully (i.e., with a zero exit code). In other words, a job with the executable set to /bin/false cannot enter this state.
- FAILED = 4¶
Represents a job that has either completed unsuccessfully (with a non-zero exit code) or a job whose handling and/or execution by the backend has failed in some way.
- NEW = 0¶
This is the state of a job immediately after the
Jobobject is created and before being submitted to aJobExecutor.
- QUEUED = 1¶
This is the state of the job after being accepted by a backend for execution, but before the execution of the job begins.
- property final: bool¶
Returns True if this state final.
A state is final when no other state transition can occur after that state has been reached.
- Returns
True if this is a final state and False otherwise
- is_greater_than(other)[source]¶
Defines a (strict) partial ordering on the states.
Not all states are comparable. State transitions cannot violate this ordering.
- Parameters
other (JobState) – the other JobState to compare to
- Returns
if this state is comparable with other, this method returns True or False depending on the relative order between this state and other. That is, True is returned if and only if this state can come after other. If this state is not comparable with other, this method returns None.
- Return type
- class JobStatus(state, time=None, message=None, exit_code=None, metadata=None)[source]¶
Bases:
objectA class containing details about job transitions to new states.
Constructs a JobStatus object.
- Parameters
time (Optional[float]) – The time, as would be returned by
time.time()that the transition to the new state occurred. If None, the current time will be used.message (Optional[str]) – An optional message associated with the transition.
exit_code (Optional[int]) – An optional exit code for the job, if the job has completed.
metadata (Optional[Dict[str, object]]) – Optional metadata provided by the
JobExecutor.
- Return type
None
- class JobStatusCallback[source]¶
Bases:
ABCAn interface used to listen to job status change events.
- abstract job_status_changed(job, job_status)[source]¶
This method is invoked when a status change occurs on a job.
Client code interested in receiving status notifications must implement this method. It is entirely possible that
psij.Job.statuswhen referenced from the body of this method would return something different from the status passed to this callback. This is because the status of the job can be updated during the execution of the body of this method and, in particular, before the potential dereference topsij.Job.statusis made.Client code implementing this method must return quickly and cannot be used for lengthy processing. Furthermore, client code implementing this method should not throw exceptions.
- class Launcher(config=None)[source]¶
Bases:
ABCAn abstract base class for all launchers.
Base constructors for launchers.
- Parameters
config (Optional[JobExecutorConfig]) – An optional configuration. If not specified,
DEFAULTis used.- Return type
None
- DEFAULT_LAUNCHER_NAME = 'single'¶
- static get_instance(name, version_constraint=None, config=None)[source]¶
Returns an instance of a launcher optionally configured using a certain configuration.
The returned instance may or may not be a singleton object.
- abstract get_launch_command(job)[source]¶
Constructs a command to launch a job given a job specification.
- abstract get_launcher_failure_message(output)[source]¶
Extracts the launcher error message from the output of this launcher’s invocation.
It is understood that the output is such that
is_launcher_failure()returns True on it.
- abstract is_launcher_failure(output)[source]¶
Determines whether the launcher invocation output contains a launcher failure or not.
- static register_launcher(desc, root)[source]¶
Registers a launcher class.
The registered class can then be instantiated using
get_instance().- Parameters
desc (Descriptor) – A
Descriptorwith information about the launcher to register.root (str) – A filesystem path under which the implementation of the launcher is to be loaded from. Launchers from other locations, even if under the correct package, will not be registered by this method. If a launcher implementation is only available under a different root path, this method will throw an exception.
- Return type
None
- class ResourceSpec[source]¶
Bases:
ABCA base class for resource specifications.
The ResourceSpec class is an abstract base class for all possible resource specification classes in PSI/J.
- class ResourceSpecV1(node_count=None, process_count=None, processes_per_node=None, cpu_cores_per_process=None, gpu_cores_per_process=None, exclusive_node_use=True)[source]¶
Bases:
ResourceSpecThis class implements V1 of the PSI/J resource specification.
Constructs a ResourceSpecV1 object and optionally initializes its properties.
Some of the properties of this class are constrained. Specifically, process_count = node_count * processes_per_node. Specifying all constrained properties in a way that does not satisfy the constraint will result in an error. Specifying some of the constrained properties will result in the remaining one being inferred based on the constraint. This inference is done by this class. However, executor implementations may chose to delegate this inference to an underlying implementation and ignore the values inferred by this class.
- Parameters
node_count (Optional[int]) – If specified, request that the backend allocate this many compute nodes for the job.
process_count (Optional[int]) – If specified, instruct the backend to start this many process instances. This defaults to 1.
processes_per_node (Optional[int]) – Instruct the backend to run this many process instances on each node.
cpu_cores_per_process (Optional[int]) – Request this many CPU cores for each process instance. This property is used by a backend to calculate the number of nodes from the process_count
exclusive_node_use (bool) –
- Return type
None
- property computed_node_count: int¶
Returns or calculates a node count.
If the node_count property is specified, this method returns it. If not, a node count is calculated from process_count and processes_per_node.
- Returns
An integer value with the specified or calculated node count.
- property computed_process_count: int¶
Returns or calculates a process count.
If the process_count property is specified, this method returns it, otherwise it returns 1.
- Returns
An integer value with either the value of process_count or one if the former is not specified.
- property computed_processes_per_node: int¶
Returns or calculates the number of processes per node.
If the processes_per_node property is specified, this method returns it, otherwise calculates it based on process_count and node_count if possible, or defaults to 1.
- Returns
An integer value with either the value of processes_per_node or one if the former cannot be determined.
- exception SubmitException(message, exception=None, transient=False)[source]¶
Bases:
ExceptionAn exception representing job submission issues.
This exception is thrown when the
submit()call fails for a reason that is independent of the job that is being submitted.Constructs a SubmitException and allows properties to be initialized.
- Parameters
- Return type
None
- exception¶
Returns an optional underlying exception that can potentially be used for debugging purposes, but which should not, in general, be presented to an end-user.
- message¶
Retrieves the message associated with this exception. This is a descriptive message that is sufficiently clear to be presented to an end-user.
- transient¶
Returns True if the underlying condition that triggered this exception is transient. Jobs that cannot be submitted due to a transient exceptional condition have chance of being successfully re-submitted at a later time, which is a suggestion to client code that it could re-attempt the operation that triggered this exception. However, the exact chances of success depend on many factors and are not guaranteed in any particular case. For example, a DNS resolution failure while attempting to connect to a remote service is a transient error since it can be reasonably assumed that DNS resolution is a persistent feature of an Internet-connected network. By contrast, an authentication failure due to an invalid username/password combination would not be a transient failure. While it may be possible for a temporary defect in a service to cause such a failure, under normal operating conditions such an error would persist across subsequent re-tries until correct credentials are used.
- exception UnreachableStateException(status)[source]¶
Bases:
ExceptionIndicates that a job state being waited for cannot be reached.
This exception is thrown when the
wait()method is called with a set of states that cannot be reached by the job when the call is made.Constructs an UnreachableStateException.