Once your jobs have been submitted, Condor will attempt to find resources to run your jobs. The existence of your requests is communicated to the Condor manager by registering you as a ``submitter.'' A list of all the current submitters may be obtained through condor_status with the -submitters option, which would yield output similar to the following:
% condor_status -submitters Name Machine Running IdleJobs MaxJobsRunning ashoks@jules.ncsa.ui jules.ncsa 74 54 200 breach@cs.wisc.edu bianca.cs. 11 0 500 breach@cs.wisc.edu neufchatel 23 0 500 jbasney@cs.wisc.edu froth.cs.w 0 1 500 wright@raven.cs.wisc raven.cs.w 1 48 200 RunningJobs IdleJobs wright@raven.cs.wisc 1 48 ashoks@jules.ncsa.ui 74 54 jbasney@cs.wisc.edu 0 1 breach@cs.wisc.edu 34 0 Total 109 103
% condor_q -- Submitter: froth.cs.wisc.edu : <128.105.73.44:33847> : froth.cs.wisc.edu ID OWNER SUBMITTED CPU_USAGE ST PRI SIZE CMD 125.0 jbasney 4/10 15:35 0+00:00:00 U -10 1.2 hello.remote 127.0 raman 4/11 15:35 0+00:00:00 R 0 1.4 hello 128.0 raman 4/11 15:35 0+00:02:33 I 0 1.4 hello 3 jobs; 1 unexpanded, 1 idle, 1 running, 0 malformedThe
ST
column (for status) shows the status of
current jobs in the queue. The ``U'' stands for unexpanded, which means
that the job has never checkpointed and when it starts running it will
start running from the beginning. ``R'' means the the job is currently
running. Finally, ``I'' stands for idle, which means the job has run
before and has checkpointed, and when it starts running again it will
resume where it left off, but is not running right now because it is
waiting for a machine to become available.
Note: The CPU time reported for a job is the time that has been committed to the job. Thus, the CPU time is not updated for a job until the job checkpoints, at which time the job has made guaranteed forward progress. Depending upon how the site administrator configured the pool, several hours may pass between checkpoints, so do not worry if you do not observe CPU changing by the hour. Also note that this is actual CPU time as reported by the operating system; this is not wall-clock time.
Another useful method of tracking the progress of jobs is through the user log mechanism. If you have specified a log command in your submit file, the progress of the job may be followed by viewing the log file. Various events such as execution commencement, checkpoint, eviction and termination are logged in the file along with the time at which the event occurred.
% condor_q -- Submitter: froth.cs.wisc.edu : <128.105.73.44:33847> : froth.cs.wisc.edu ID OWNER SUBMITTED CPU_USAGE ST PRI SIZE CMD 125.0 jbasney 4/10 15:35 0+00:00:00 U -10 1.2 hello.remote 132.0 raman 4/11 16:57 0+00:00:00 R 0 1.4 hello 2 jobs; 1 unexpanded, 0 idle, 1 running, 0 malformed % condor_rm 132.0 Job 132.0 removed. % condor_q -- Submitter: froth.cs.wisc.edu : <128.105.73.44:33847> : froth.cs.wisc.edu ID OWNER SUBMITTED CPU_USAGE ST PRI SIZE CMD 125.0 jbasney 4/10 15:35 0+00:00:00 U -10 1.2 hello.remote 1 jobs; 1 unexpanded, 0 idle, 0 running, 0 malformed
It is normal for a machine which has submitted hundreds of jobs to have hundreds of shadows running on the machine. Since the text segments of all these processes is the same, the load on the submit machine is usually not significant. If, however, you notice degraded performance, you can limit the number of jobs that can run simultaneously through the MAX_JOBS_RUNNING configuration parameter. Please talk to your system administrator for the necessary configuration change.
You can also find all the machines that are running your job through the condor_status command. For example, to find all the machines that are running jobs submitted by ``breach@cs.wisc.edu,'' type:
% condor_status -constraint 'RemoteUser == "breach@cs.wisc.edu"' Name Arch OpSys State Activity LoadAv Mem ActvtyTime alfred.cs. INTEL SOLARIS251 Claimed Busy 0.980 64 0+07:10:02 biron.cs.w INTEL SOLARIS251 Claimed Busy 1.000 128 0+01:10:00 cambridge. INTEL SOLARIS251 Claimed Busy 0.988 64 0+00:15:00 falcons.cs INTEL SOLARIS251 Claimed Busy 0.996 32 0+02:05:03 happy.cs.w INTEL SOLARIS251 Claimed Busy 0.988 128 0+03:05:00 istat03.st INTEL SOLARIS251 Claimed Busy 0.883 64 0+06:45:01 istat04.st INTEL SOLARIS251 Claimed Busy 0.988 64 0+00:10:00 istat09.st INTEL SOLARIS251 Claimed Busy 0.301 64 0+03:45:00 ...To find all the machines that are running any job at all, type:
% condor_status -run Name Arch OpSys LoadAv RemoteUser ClientMachine adriana.cs INTEL SOLARIS251 0.980 hepcon@cs.wisc.edu chevre.cs.wisc. alfred.cs. INTEL SOLARIS251 0.980 breach@cs.wisc.edu neufchatel.cs.w amul.cs.wi SUN4u SOLARIS251 1.000 nice-user.condor@cs. chevre.cs.wisc. anfrom.cs. SUN4x SOLARIS251 1.023 ashoks@jules.ncsa.ui jules.ncsa.uiuc anthrax.cs INTEL SOLARIS251 0.285 hepcon@cs.wisc.edu chevre.cs.wisc. astro.cs.w INTEL SOLARIS251 1.000 nice-user.condor@cs. chevre.cs.wisc. aura.cs.wi SUN4u SOLARIS251 0.996 nice-user.condor@cs. chevre.cs.wisc. balder.cs. INTEL SOLARIS251 1.000 nice-user.condor@cs. chevre.cs.wisc. bamba.cs.w INTEL SOLARIS251 1.574 dmarino@cs.wisc.edu riola.cs.wisc.e bardolph.c INTEL SOLARIS251 1.000 nice-user.condor@cs. chevre.cs.wisc. ...
In addition to the priorities assigned to each user, Condor also provides each user with the capability of assigning priorities to each submitted job. These job priorities are local to each queue and range from -20 to +20, with higher values meaning better priority.
The default priority of a job is 0, but can be changed using the condor_prio command. For example, to change the priority of a job to -15,
% condor_q raman -- Submitter: froth.cs.wisc.edu : <128.105.73.44:33847> : froth.cs.wisc.edu ID OWNER SUBMITTED CPU_USAGE ST PRI SIZE CMD 126.0 raman 4/11 15:06 0+00:00:00 U 0 0.3 hello 1 jobs; 1 unexpanded, 0 idle, 0 running, 0 malformed % condor_prio -p -15 126.0 % condor_q raman -- Submitter: froth.cs.wisc.edu : <128.105.73.44:33847> : froth.cs.wisc.edu ID OWNER SUBMITTED CPU_USAGE ST PRI SIZE CMD 126.0 raman 4/11 15:06 0+00:00:00 U -15 0.3 hello 1 jobs; 1 unexpanded, 0 idle, 0 running, 0 malformed
It is important to note that these job priorities are completely different from the user priorities assigned by Condor. Job priorities do not impact user priorities. They are only a mechanism for the user to identify the relative importance of jobs among all the jobs submitted by the user to that specific queue.
% condor_q -- Submitter: froth.cs.wisc.edu : <128.105.73.44:33847> : froth.cs.wisc.edu ID OWNER SUBMITTED CPU_USAGE ST PRI SIZE CMD 125.0 jbasney 4/10 15:35 0+00:00:00 U -10 1.2 hello.remote 1 jobs; 1 unexpanded, 0 idle, 0 running, 0 malformed
Running condor_q's analyzer provided the following information:
% condor_q 125.0 -analyze -- Submitter: froth.cs.wisc.edu : <128.105.73.44:33847> : froth.cs.wisc.edu --- 125.000: Run analysis summary. Of 323 resource offers, 323 do not satisfy the request's constraints 0 resource offer constraints are not satisfied by this request 0 are serving equal or higher priority customers 0 are serving more preferred customers 0 cannot preempt because preemption has been held 0 are available to service your request WARNING: Be advised: No resources matched request's constraints Check the Requirements expression below: Requirements = Arch == "INTEL" && OpSys == "SOLARIS251" && 0 && Disk >= ExecutableSize && VirtualMemory >= ImageSize
We see that user ``jbasney'' has inadvertently expressed a Requirements
expression that can never be satisfied due to the ... && 0 && ...
clause which always evaluates to false.
While the analyzer can diagnose most common problems, there are some situations that it cannot reliably detect due to the instantaneous and local nature of the information it uses to detect the problem. Thus, it may be that the analyzer reports that resources are available to service the request, but the job still does not run. In most of these situations, the delay is transient, and the job will run during the next negotiation cycle.
If the problem persists and the analyzer is unable to detect the situation, it may be that the job begins to run but immediately terminates due to some problem. Viewing the job's error and log files (specified in the submit command file) and Condor's SHADOW_LOG file may assist in tracking down the problem. If the cause is still unclear, please contact your system administrator.