Next: 2.8 Priorities in Condor Up: 2. Users' Manual Previous: 2.6 Submitting a Job

Subsections

2.7 Managing a Condor Job

This section provides a brief summary of what can be done once your jobs begin execution. The basic mechanisms for monitoring a job are introduced, but the commands that are discussed include a lot more functionality than is displayed in this section. You are encouraged to look at the man pages of the commands referred to (located in Chapter 5 beginning on page

) for more information.

Once your jobs have been submitted, Condor will attempt to find resources to run your jobs. The existence of your requests is communicated to the Condor manager by registering you as a ``submitter.'' A list of all the current submitters may be obtained through condor_status with the -submitters option, which would yield output similar to the following:

%  condor_status -submitters

Name                 Machine     Running  IdleJobs  MaxJobsRunning

ashoks@jules.ncsa.ui jules.ncsa       74       54       200
breach@cs.wisc.edu   bianca.cs.       11        0       500
breach@cs.wisc.edu   neufchatel       23        0       500
jbasney@cs.wisc.edu  froth.cs.w        0        1       500
wright@raven.cs.wisc raven.cs.w        1       48       200

                           RunningJobs             IdleJobs

wright@raven.cs.wisc                 1                   48
ashoks@jules.ncsa.ui                74                   54
 jbasney@cs.wisc.edu                 0                    1
  breach@cs.wisc.edu                34                    0

               Total               109                  103

2.7.1 Checking on the progress of your jobs

At any time, you can check on the status of your jobs with the condor_q tool, which shows the status of all queued jobs along with other information. To identify jobs which are running, type

%  condor_q

-- Submitter: froth.cs.wisc.edu : <128.105.73.44:33847> : froth.cs.wisc.edu
 ID      OWNER            SUBMITTED    CPU_USAGE ST PRI SIZE CMD               
 125.0   jbasney         4/10 15:35   0+00:00:00 U  -10 1.2  hello.remote      
 127.0   raman           4/11 15:35   0+00:00:00 R  0   1.4  hello             
 128.0   raman           4/11 15:35   0+00:02:33 I  0   1.4  hello             

3 jobs; 1 unexpanded, 1 idle, 1 running, 0 malformed

The ST column (for status) shows the status of current jobs in the queue. The ``U'' stands for unexpanded, which means that the job has never checkpointed and when it starts running it will start running from the beginning. ``R'' means the the job is currently running. Finally, ``I'' stands for idle, which means the job has run before and has checkpointed, and when it starts running again it will resume where it left off, but is not running right now because it is waiting for a machine to become available.

Note: The CPU time reported for a job is the time that has been committed to the job. Thus, the CPU time is not updated for a job until the job checkpoints, at which time the job has made guaranteed forward progress. Depending upon how the site administrator configured the pool, several hours may pass between checkpoints, so do not worry if you do not observe CPU changing by the hour. Also note that this is actual CPU time as reported by the operating system; this is not wall-clock time.

Another useful method of tracking the progress of jobs is through the user log mechanism. If you have specified a log command in your submit file, the progress of the job may be followed by viewing the log file. Various events such as execution commencement, checkpoint, eviction and termination are logged in the file along with the time at which the event occurred.

2.7.2 Removing the job from the queue

A job can be removed from the queue at any time by using the condor_rm command. If the job that is being removed is currently running, the job is killed without a checkpoint, and its queue entry removed. For example:

%  condor_q

-- Submitter: froth.cs.wisc.edu : <128.105.73.44:33847> : froth.cs.wisc.edu
 ID      OWNER            SUBMITTED    CPU_USAGE ST PRI SIZE CMD               
 125.0   jbasney         4/10 15:35   0+00:00:00 U  -10 1.2  hello.remote      
 132.0   raman           4/11 16:57   0+00:00:00 R  0   1.4  hello             

2 jobs; 1 unexpanded, 0 idle, 1 running, 0 malformed

%  condor_rm 132.0
Job 132.0 removed.

%  condor_q

-- Submitter: froth.cs.wisc.edu : <128.105.73.44:33847> : froth.cs.wisc.edu
 ID      OWNER            SUBMITTED    CPU_USAGE ST PRI SIZE CMD               
 125.0   jbasney         4/10 15:35   0+00:00:00 U  -10 1.2  hello.remote      

1 jobs; 1 unexpanded, 0 idle, 0 running, 0 malformed

2.7.3 While your job is running ...

When your job begins to run, Condor starts up a condor_shadow process on the submit machine. The shadow process is the mechanism by which the remotely executing jobs can access the environment from which it was submitted, such as input and output files.

It is normal for a machine which has submitted hundreds of jobs to have hundreds of shadows running on the machine. Since the text segments of all these processes is the same, the load on the submit machine is usually not significant. If, however, you notice degraded performance, you can limit the number of jobs that can run simultaneously through the MAX_JOBS_RUNNING configuration parameter. Please talk to your system administrator for the necessary configuration change.

You can also find all the machines that are running your job through the condor_status command. For example, to find all the machines that are running jobs submitted by ``breach@cs.wisc.edu,'' type:

%  condor_status -constraint 'RemoteUser == "breach@cs.wisc.edu"'

Name       Arch     OpSys        State      Activity   LoadAv Mem  ActvtyTime

alfred.cs. INTEL    SOLARIS251   Claimed    Busy       0.980  64    0+07:10:02
biron.cs.w INTEL    SOLARIS251   Claimed    Busy       1.000  128   0+01:10:00
cambridge. INTEL    SOLARIS251   Claimed    Busy       0.988  64    0+00:15:00
falcons.cs INTEL    SOLARIS251   Claimed    Busy       0.996  32    0+02:05:03
happy.cs.w INTEL    SOLARIS251   Claimed    Busy       0.988  128   0+03:05:00
istat03.st INTEL    SOLARIS251   Claimed    Busy       0.883  64    0+06:45:01
istat04.st INTEL    SOLARIS251   Claimed    Busy       0.988  64    0+00:10:00
istat09.st INTEL    SOLARIS251   Claimed    Busy       0.301  64    0+03:45:00
...

To find all the machines that are running any job at all, type:

%  condor_status -run

Name       Arch     OpSys        LoadAv RemoteUser           ClientMachine  

adriana.cs INTEL    SOLARIS251   0.980  hepcon@cs.wisc.edu   chevre.cs.wisc.
alfred.cs. INTEL    SOLARIS251   0.980  breach@cs.wisc.edu   neufchatel.cs.w
amul.cs.wi SUN4u    SOLARIS251   1.000  nice-user.condor@cs. chevre.cs.wisc.
anfrom.cs. SUN4x    SOLARIS251   1.023  ashoks@jules.ncsa.ui jules.ncsa.uiuc
anthrax.cs INTEL    SOLARIS251   0.285  hepcon@cs.wisc.edu   chevre.cs.wisc.
astro.cs.w INTEL    SOLARIS251   1.000  nice-user.condor@cs. chevre.cs.wisc.
aura.cs.wi SUN4u    SOLARIS251   0.996  nice-user.condor@cs. chevre.cs.wisc.
balder.cs. INTEL    SOLARIS251   1.000  nice-user.condor@cs. chevre.cs.wisc.
bamba.cs.w INTEL    SOLARIS251   1.574  dmarino@cs.wisc.edu  riola.cs.wisc.e
bardolph.c INTEL    SOLARIS251   1.000  nice-user.condor@cs. chevre.cs.wisc.
...

2.7.4 Changing the priority of jobs

In addition to the priorities assigned to each user, Condor also provides each user with the capability of assigning priorities to each submitted job. These job priorities are local to each queue and range from -20 to +20, with higher values meaning better priority.

The default priority of a job is 0, but can be changed using the condor_prio command. For example, to change the priority of a job to -15,

%  condor_q raman

-- Submitter: froth.cs.wisc.edu : <128.105.73.44:33847> : froth.cs.wisc.edu
 ID      OWNER            SUBMITTED    CPU_USAGE ST PRI SIZE CMD               
 126.0   raman           4/11 15:06   0+00:00:00 U  0   0.3  hello             

1 jobs; 1 unexpanded, 0 idle, 0 running, 0 malformed

%  condor_prio -p -15 126.0

%  condor_q raman

-- Submitter: froth.cs.wisc.edu : <128.105.73.44:33847> : froth.cs.wisc.edu
 ID      OWNER            SUBMITTED    CPU_USAGE ST PRI SIZE CMD               
 126.0   raman           4/11 15:06   0+00:00:00 U  -15 0.3  hello             

1 jobs; 1 unexpanded, 0 idle, 0 running, 0 malformed

It is important to note that these job priorities are completely different from the user priorities assigned by Condor. Job priorities do not impact user priorities. They are only a mechanism for the user to identify the relative importance of jobs among all the jobs submitted by the user to that specific queue.

2.7.5 Why won't my job run?

Users sometimes find that their jobs do not run. There are several reasons why a specific job does not run. These reasons range from failed job or machine constraints, bias due to preferences, insufficient priority, or the preemption ``throttle'' that is implemented by the condor_negotiator to prevent thrashing. Many of these reasons can be diagnosed by using the -analyze option of condor_q. For example, the following job submitted by user ``jbasney'' was found not to run for several days.

% condor_q

-- Submitter: froth.cs.wisc.edu : <128.105.73.44:33847> : froth.cs.wisc.edu
 ID      OWNER            SUBMITTED    CPU_USAGE ST PRI SIZE CMD               
 125.0   jbasney         4/10 15:35   0+00:00:00 U  -10 1.2  hello.remote      

1 jobs; 1 unexpanded, 0 idle, 0 running, 0 malformed

Running condor_q's analyzer provided the following information:

%  condor_q 125.0 -analyze

-- Submitter: froth.cs.wisc.edu : <128.105.73.44:33847> : froth.cs.wisc.edu
---
125.000:  Run analysis summary.  Of 323 resource offers,
          323 do not satisfy the request's constraints
            0 resource offer constraints are not satisfied by this request
            0 are serving equal or higher priority customers
            0 are serving more preferred customers
            0 cannot preempt because preemption has been held
            0 are available to service your request

WARNING:  Be advised:
   No resources matched request's constraints
   Check the Requirements expression below:

Requirements = Arch == "INTEL" && OpSys == "SOLARIS251" && 0 && 
  Disk >= ExecutableSize && VirtualMemory >= ImageSize

We see that user ``jbasney'' has inadvertently expressed a Requirements expression that can never be satisfied due to the ... && 0 && ... clause which always evaluates to false.

While the analyzer can diagnose most common problems, there are some situations that it cannot reliably detect due to the instantaneous and local nature of the information it uses to detect the problem. Thus, it may be that the analyzer reports that resources are available to service the request, but the job still does not run. In most of these situations, the delay is transient, and the job will run during the next negotiation cycle.

If the problem persists and the analyzer is unable to detect the situation, it may be that the job begins to run but immediately terminates due to some problem. Viewing the job's error and log files (specified in the submit command file) and Condor's SHADOW_LOG file may assist in tracking down the problem. If the cause is still unclear, please contact your system administrator.

Next: 2.8 Priorities in Condor Up: 2. Users' Manual Previous: 2.6 Submitting a Job

condor-admin@cs.wisc.edu