EXIT CODE LIST OF S.o.G.E

Table 7.1  Job-Related
Error or Exit Codes

Script/Method 

Exit or Error Code 

Consequence 

Job script 

Success 

 

99 

Requeue 

 

Rest 

Success: exit code in accounting file 

prolog/epilog 

Success 

 

99 

Requeue 

 

Rest 

Queue error state, job requeued 

The following table lists the consequences of error codes or exit codes of jobs related to parallel environment (PE) configuration.

Table 7.2 Parallel-Environment-Related Error or Exit Codes

Script/Method 

Exit or Error Code 

Consequence 

pe_start 

Success 

 

Rest 

Queue set to error state, job requeued 

pe_stop 

Success 

 

Rest 

Queue set to error state, job not requeued 

The following table lists the consequences of error codes or exit codes of jobs related to queue configuration. These codes are valid only if corresponding methods were overwritten.

Table 7.3 Queue-Related Error or Exit Codes

Script/Method 

Exit or Error Code 

Consequence 

Job starter 

Success 

 

Rest 

Success, no other special meaning 

Suspend 

Success 

 

Rest 

Success, no other special meaning 

Resume 

Success 

 

Rest 

Success, no other special meaning 

Terminate 

Success 

 

Rest 

Success, no other special meaning 

The following table lists the consequences of error or exit codes of jobs related to checkpointing.

Table 7.4 Checkpointing-Related Error or Exit Codes

Script/Method 

Exit or Error Code 

Consequence 

Checkpoint 

Success 

 

Rest 

Success. For kernel checkpoint, however, this means that the checkpoint was not successful. 

Migrate 

Success 

 

Rest 

Success. For kernel checkpoint, however, this means that the checkpoint was not successful. Migration will occur. 

Restart 

Success 

 

Rest 

Success, no other special meaning 

Clean 

Success 

 

Rest 

Success, no other special meaning 

For jobs that run successfully, the qacct -j command output shows a value of 0 in the failed field, and the output shows the exit status of the job in the exit_status field. However, the shepherd might not be able to run a job successfully. For example, the epilog script might fail, or the shepherd might not be able to start the job. In such cases, the failed field displays one of the code values listed in the following table.

Table 7.5 qacct -j failed Field Codes

Code 

Description 

acctvalid 

Meaning for Job 

No failure 

Job ran, exited normally 

Presumably before job 

Job could not be started 

Before writing config 

Job could not be started 

Before writing PID 

Job could not be started 

On reading config file 

Job could not be started 

Setting processor set 

Job could not be started 

Before prolog 

Job could not be started 

In prolog 

Job could not be started 

Before pestart 

Job could not be started 

10 

In pestart 

Job could not be started 

11 

Before job 

Job could not be started 

12 

Before pestop 

Job ran, failed before calling PE stop procedure 

13 

In pestop 

Job ran, PE stop procedure failed 

14 

Before epilog 

Job ran, failed before calling epilog script 

15 

In epilog 

Job ran, failed in epilog script 

16 

Releasing processor set 

Job ran, processor set could not be released 

24 

Migrating (checkpointing jobs) 

Job ran, job will be migrated 

25 

Rescheduling 

Job ran, job will be rescheduled 
at cnsr.ch: typcally when node is faultly before starting job
--->Job Reshceduled, 'queue@node' flag set [E]rror ==> (maintenance required)

26 

Opening output file 

Job could not be started, stderr/stdout file could not be opened 

27 

Searching requested shell 

Job could not be started, shell not found 

28 

Changing to working directory 

Job could not be started, error changing to start directory 

100 

Assumedly after job 

Job ran, job killed by a signal 

Exit codes 127 A special stuff for this code, normally exit 127 come from a access error when job end with 127, the major problem is the x flag on the file many beginer dosent really understand unix flags rwx first check, if script in the job file have x flag if you work with confidential/critical/licenced data, file and directory permissions are very importants. if you are owner ex: rwx --- --- if you are not owner rwx r-x --- or rwx r-s --- or risky rws r-x --- first rwx is owner permission, second is for group members, third is for everbody (including guest, nobody etc...) s flag is a run with change user or group (take care about data or file permission with this flag Some of our users share scripts in the same group with chmod rwx r-- --- you probably understand the problem no x in the group, ==> exit 127 Exit codes bigger than 128 exit codes lower then 128 are SGE reserved when a job exit witk killed code 146, to retrieve the original exit code,remove 128 example: exot code 146 ==> 146-128 = 18 ... 18 is the real exit code [L.Tomas]


The Code column lists the value of the failed field. The Description column lists the text that appears in the qacct -j output. If acctvalid is set to t, the job accounting values are valid. If acctvalid is set to f, the resource usage values of the accounting record are not valid. The Meaning for Job column indicates whether the job ran or not.


NOTE [LT] if you made a integrity test of computation node, example in prolog, consider exit 25 when integrity fail you will: - the job will ne recheduled on another node. - set [E]rror flag on the faultly node(queue)