Next: 3.4 Configuring Condor Up: 3. Administrators' Manual Previous: 3.2 Installation of Condor

Subsections

3.3 Installing Contrib Modules

This section describes how to install various contrib modules in the Condor system. Some of these modules are separate, optional pieces, not included in the main distribution of Condor. For example, the checkpoint server, or DagMan. Others are integral parts of Condor taken from the development series that have certain features users might want to install. For example, the new SMP-aware condor_startd, or the CondorView collector. Both of these things come automatically with Condor version 6.1 and greater. However, if you don't want to switch over to using only the development binaries, you can install these seperate modules and maintain most of the stable release at your site.

3.3.1 Installing The SMP-Startd Contrib Module

Basically, the ``SMP-Startd Contrib module'' is just a selection of a few files needed to run the 6.1 startd in your existing 6.0 pool. For documentation on the new startd or the supporting files, see the version 6.1 manual.

See section 3.2 on page for complete details on how to install Condor. In particular, you should read the first few sections that discuss release directories, pool layout, and so on.

To install the SMP-startd from the separate contrib module, you must first download the appropriate binary modules for each of your platforms. Once you uncompress and untar the module, you will have a directory with an smp_startd.tar file, a README, and so on. The smp_startd.tar acts much like the release.tar file for a main release. It contains all the binaries and supporting files you would install in your release directory:

        sbin/condor_startd
        sbin/condor_starter
        sbin/condor_preen
        bin/condor_status
        etc/examples/condor_config.local.smp

condor_preen and condor_status are both fully backwards compatible, so you can use the new version for your entire pool without changing any of your config files. They each have just been enhanced to handle the SMP startd. See the version 6.1 man pages on each for details. The condor_starter is also backwards compatible, so you probably want to install it pool-wide, as well.

The SMP startd is backwards compatible, only in the sense that it still runs and works just fine on single-CPU machines. However, it uses different policy expressions to control its policy, so in this (more important) sense, it is not backwards compatible. For this reason, you must have some separate config file settings in effect on machines running the new version. Therefore, you must decide if you want to convert all your machines to the new version, or only convert your SMP machines. If you just convert the SMP machines, you can put the new settings in the local config file for each SMP machine. If you convert all your machines, you will want to put the new settings into your global config file.

3.3.1.1 Installing Pool-Wide

Since you are installing new daemon binaries for all hosts in your pool, it's generally a good idea to make sure no jobs are running and all the Condor daemons are shut off before you begin. Please see section 3.8 on page for details on how to do this.

You may want to keep your old binaries around, just to be safe. Simply move the existing condor_startd, condor_starter, condor_preen out of the way (for example, to ``condor_startd.old'') in the sbin directory, and move condor_status out of the way in bin.

You can simply untar the smp_startd.tar file into your release directory, and it will install the new versions (and overwrite your existing binaries if you haven't moved them out of the way). Once the new binaries are in place, all you need to do is add the new settings for the SMP startd to your global config file.

Once the binaries and config settings are in place, you can restart your pool, as described in section 3.8.1 on page on ``Restarting Your Condor Pool''.

3.3.1.2 Installing Only on SMP Machines

If you only want to run the new startd on your SMP machines, you should untar the smp_startd.tar file into some temporary location. Copy the sbin/condor_startd file into <release_dir>/sbin/condor_startd.smp. You can simply overwrite <release_dir>/sbin/condor_preen and <release_dir>/bin/condor_status with the new versions. In case you have any currently running condor_starter processes, you should move the existing binary to condor_starter.old with ``mv'' so that you don't get starters that crash with SIGILL or SIGBUS. Once you have moved the existing starter out of the way, you can install the new version from your scratch directory.

Once you've got all the new binaries installed, all you need to do is edit the local config file for each SMP host in your pool to add the SMP-specific settings described below. In addition, you will need to add the line:

        STARTD = $(SBIN)/condor_startd.smp

to let the condor_master know you want the new version spawned on that host.

Once the binaries are all in place and all the configuration settings are done, you can send a condor_reconfig command to your SMP hosts (from any machine listed in the HOST_ALLOW_ADMINISTRATOR setting in your config files), the condor_master should notice the new binaries on the SMP machines, and spawn them.

3.3.1.3 Notes on SMP Startd configuration

All documentation for the new Startd can be found in the version 6.1 manual. In the etc/examples/condor_config.local.smp file, you will see all the new config file settings you must define or change with the new version. Mainly, these are the new policy expressions. Look in the version 6.1 manual, in the ``Configuring The Startd Policy'' section for complete details on how to configure the policy for the 6.1 startd. In particular, you probably want to read the section titled ``Differences from the Version 6.0 Policy Settings'' to see how the new policy expressions differ from previous versions. These changes are not SMP-specific, they just make writing more complicated policies much easier. Given the wide range of SMP machines, from dual-CPU desktop workstations, up to giant, 128-node super computers, more flexibility in writing complicated policies is a big help.

In addition to the new policy expressions, there are a few settings that control how the SMP startd's view of the machine state effects each of the virtual machines it is representing. See the section ``Configuring The Startd for SMP Machines'' for full details on configuring these other settings of the SMP startd.

Finally, on SMP machines, each running node has its own condor_starter, and each starter maintains its own log file with a different name. Therefore, you want to list which files condor_preen should remove from the log directory, instead of having to list the files you want to keep. To do this, you specify a INVALID_LOG_FILES setting instead of a VALID_LOG_FILES setting. In both install cases, since you are using the new condor_preen in your whole pool, you should add the following to your global config file:

        INVALID_LOG_FILES = core

since core files are the only unwanted things that might show up in your log directory.

3.3.2 Installing CondorView Contrib Modules

To install CondorView for your pool, you really need two things:

1.: The CondorView server, which collects historical information.
2.: The CondorView client, a Java applett that views this data.

Since these are totally seperate modules, they will each be handled in their own sections.

3.3.3 Installing the CondorView Server Module

The CondorView server is just an enhanced version of the condor_collector which can log information to disk, providing a persistant, historical database of your pool state. This includes machine state, as well as the state of jobs submitted by users, and so on. This enhanced condor_collector is simply the version 6.1 development series, but it can be installed in a 6.0 pool. The historical information logging can be turned on or off, so you can install the CondorView collector without using up disk space for historical information if you don't want it.

To install the CondorView server, you must download the appropriate binary module for whatever platform you are going to run your CondorView server on. This does not have to be the same platform as your exisiting central manager (see below). Once you uncompress and untar the module, you will have a directory with a view_server.tar file, a README, and so on. The view_server.tar acts much like the release.tar file for a main release of Condor. It contains all the binaries and supporting files you would install in your release directory:

        sbin/condor_collector
        sbin/condor_stats
        etc/examples/condor_config.local.view_server

You have two options to choose from when deciding how to install this enhanced condor_collector in your pool:

1.: Replace your exisiting condor_collector and use the new version for both historical information and the regular role the collector plays in your pool.
2.: Install the new condor_collector and run it on a seperate host from your main condor_collector and configure your machines to send updates to both collectors.

If you replace your existing collector with the enhanced version, because it is development code, there might be a bug or problem that would cause problems for your pool. On the other hand, if you install the enhanced version on a seperate host, if there are problems, only CondorView will be affected, not your entire pool. However, installing the CondorView collector on a seperate host generates more network traffic (from all the duplicate updates that are sent from each machine in your pool to both collectors). In addition, the installation procedure to have both collectors running is a more complicated process. You will just have to decide for yourself which solution you feel more comfortable with.

Before we discuss the details of one type of installation or the other, we explain the steps you must take in either case.

3.3.3.1 Setting up the CondorView Server Module

Before you install the CondorView collector (as described in the following sections), you have to add a few settings to the local config file of that machine to enable historical data collection. These settings are described in detail in the Condor Version 6.1 Administrator's Manual, in the section ``condor_collector Config File Entries''. However, a short explaination of the ones you must customize is provided below. These entries are also explained in the etc/examples/condor_config.local.view_server file, included in the contrib module. You should just insert that file into the local config file for your CondorView collector host and customize as appropriate at your site.

POOL_HISTORY_DIR

This is the directory where historical data will be stored. There is a configurable limit to the maximum space required for all the files created by the CondorView server (POOL_HISTORY_MAX_STORAGE). This directory must be writable by whatever user the CondorView collector is running as (usually "condor").

NOTE: This should be a seperate directory, not the same as either the Spool or Log directories you have already setup for Condor. There are a few problems putting these files into either of those directories.

KEEP_POOL_HISTORY

This is a boolean that determines if the CondorView collector should store the historical information. It is false by default, which is why you must specify it as true in your local config file.

Once these settings are in place in the local config file for your CondorView server host, you must to create the directory you specified in POOL_HISTORY_DIR and make it writable by whomever your CondorView collector is running as. This would be the same user that owns the CollectorLog file in your Log directory (usually, ``condor'').

Once those steps are completed, you are ready to install the new binaries and you will begin collecting historical information. Then, you should install the CondorView client contrib module which contains the tools used to query and display this information.

3.3.3.2 CondorView Collector as Your Only Collector

To install the new CondorView collector as your main collector, you simply have to replace your existing binary with the new one, found in the view_server.tar file. All you need to do is move your existing condor_collector binary out of the way with the ``mv'' command. For example:

        % cd /full/path/to/your/release/directory
        % cd sbin
        % mv condor_collector condor_collector.old

Then, from that same directory, you just have to untar the view_server.tar file, into your release directory, which will install a new condor_collector binary, condor_stats, a tool that can be used to query this collector for historical information, and an example config file. Within 5 minutes, the condor_master will notice the new timestamp on your condor_collector binary, shutdown your existing collector, and spawn the new version. You will see messages about this in the log file for your condor_master (usually MasterLog in your log directory). Once the new collector is running, it is safe to remove your old binary, though you may want to keep it around in case you have problems with the new version and want to revert back.

3.3.3.3 CondorView Collector in Addition to Your Main Collector

To install the CondorView collector in addition to your regular collector requires a little extra work. First, you should untar the view_server.tar file into some temporary location (not your main release directory). Copy the sbin/condor_collector file out of there, and into your main release directory's sbin with a new name (such as condor_collector.view_server). You will also want to copy the condor_stats program to your sbin release directory.

Next, you must configure whatever host is going to run your seperate CondorView server to spawn this new collector in addition to whatever other daemons it's running. You do this by adding ``COLLECTOR'' to the DAEMON_LIST on this machine, and defining what ``COLLECTOR'' means. For example:

        DAEMON_LIST = MASTER, STARTD, SCHEDD, COLLECTOR
        COLLECTOR = $(SBIN)/condor_collector.view_server

For this change to take effect, you must actually re-start the condor_master on this host (which you can do with the condor_restart command, if you run that command from a machine with ``ADMINISTRATOR'' access to your pool. (See section 3.7 on page

for full details of IP/host-based security in Condor).

Finally, you must tell all the machines in your pool to start sending updates to both collectors. You do this by specifying the following setting in your global config file:

        CONDOR_VIEW_HOST = full.hostname

where ``full.hostname'' is the full hostname of the machine where you are running your CondorView collector.

Once this setting is in place, you must send a condor_reconfig to your entire pool. The easiest way to do this is:

        % condor_reconfig `condor_status -master`

Again, this command must be run from a trusted ``administrator'' machine for it to work.

3.3.4 Installing the CondorView Client Contrib Module

$\fbox{This section has not yet been written}$

3.3.5 Installing a Checkpoint Server

The Checkpoint Server is a daemon that can be installed on a server to handle all of the checkpoints that a Condor pool will create. This machine should have a large amount of disk space available, and should have a fast connection to your machines.

NOTE: It is a good idea to pick a very stable machine for your checkpoint server. If the checkpoint server crashes, the Condor system will continue to operate, though poorly. While the Condor system will recover from a checkpoint server crash as best it can, there are two problems that can (and will) occur:

1.: If the checkpoint server is not functioning, when jobs need to checkpoint, they cannot do so. The jobs will keep trying to contact the checkpoint server, backing off exponentially in the time they wait between attempts. Normally, jobs only have a limited time to checkpoint before they are kicked off the machine. So, if the server is down for a long period of time, chances are that you'll loose a lot of work by jobs being killed without writing a checkpoint.
2.: When the jobs wish to start, if they cannot be retrieved from the checkpoint server, they will either have to be restarted from scratch, or the job will sit there, waiting for the server to come back on-line. You can control this behavior with the MAX_DISCARDED_RUN_TIME parameter in your config file (see section 3.4.10 on page for details). Basically, this represents the maximum amount of CPU time you're willing to discard by starting a job over from scratch if the checkpoint server isn't responding to requests.

3.3.5.1 Preparing to Install a Checkpoint Server

Because of the problems that exist if your pool is configured to use a checkpoint server and that server is down, it is advisable to shut your pool down before doing any maintenance on your checkpoint server. See section 3.8 on page for details on how to do that.

If you are installing a checkpoint server for the first time, you must make sure there are no jobs in your pool before you start. If you have jobs in your queues, with checkpoint files on the local spool directories of your submit machines, those jobs will never run if your submit machines are configured to use a checkpoint server and the checkpoint files cannot be found on the server. You can either remove jobs from your queues, or let them complete before you begin the installation of the checkpoint server.

3.3.5.2 Installing the Checkpoint Server Module

To install a checkpoint server, download the appropriate binary contrib module for the platform your server will run on. When you uncompress and untar that file, you'll have a directory that contains a README, ckpt_server.tar, and so on. The ckpt_server.tar acts much like the release.tar file from a main release. This archive contains these files:

        sbin/condor_ckpt_server
        sbin/condor_cleanckpts
        etc/examples/condor_config.local.ckpt.server

These are all new files, not found in the main release, so you can safely untar the archive directly into your existing release directory. condor_ckpt_server is the checkpoint server binary. condor_cleanckpts is a script that can be periodically run to remove stale checkpoint files from your server. Normally, the checkpoint server cleans all old files by itself. However, in certain error situations, stale files can be left that are no longer needed. So, you may want to put a cron job in place that calls condor_cleanckpts every week or so, just to be safe. The example config file is described below.

Once you have unpacked the contrib module, you have a few more steps you must complete. Each will be discussed in their own section:

1.: Configure the checkpoint server.
2.: Spawn the checkpoint server.
3.: Configure your pool to use the checkpoint server.

3.3.5.3 Configuring a Checkpoint Server

There are a few settings you must place in the local config file of your checkpoint server. The file etc/examples/condor_config.local.ckpt.server contains all such settings, and you can just insert it into the local configuration file of your checkpoint server machine.

There is one setting that you must customize, and that is CKPT_SERVER_DIR. The CKPT_SERVER_DIR defines where your checkpoint files should be located. It is better if this is on a very fast local file system (preferably a RAID). The speed of this file system will have a direct impact on the speed at which your checkpoint files can be retrieved from the remote machines.

The other optional settings are:

DAEMON_LIST: (Described in section 3.4.7). If you want the checkpoint server managed by the condor_master, the DAEMON_LIST entry must have MASTER and CKPT_SERVER. Add STARTD if you want to allow jobs to run on your checkpoint server. Similarly, add SCHEDD if you would like to submit jobs from your checkpoint server.

The rest of these settings are the checkpoint-server specific versions of the Condor logging entries, described in section 3.4.3 on page .

CKPT_SERVER_LOG: The CKPT_SERVER_LOG is where the checkpoint server log gets put.
MAX_CKPT_SERVER_LOG: Use this item to configure the size of the checkpoint server log before it is rotated.
CKPT_SERVER_DEBUG: The amount of information you would like printed in your logfile. Currently, the only debug level supported is D_ALWAYS.

3.3.5.4 Spawning a Checkpoint Server

To spawn a checkpoint server once it is configured to run on a given machine, all you have to do is restart Condor on that host to enable the condor_master to notice the new configuration. You can do this by sending a condor_restart command from any machine with ``administrator'' access to your pool. See section 3.7 on page for full details about IP/host-based security in Condor.

3.3.5.5 Configuring your Pool to Use the Checkpoint Server

Once the checkpoint server is installed and running, you just have to change a few settings in your global config file to let your pool know about your new server:

USE_CKPT_SERVER: This parameter should be set to ``True''.
CKPT_SERVER_HOST: This parameter should be set to the full hostname of the machine that is now running your checkpoint server.

Once these settings are in place, you simply have to send a condor_reconfig to all machines in your pool so the changes take effect. This is described in section 3.8.2 on page .

3.3.6 Installing the PVM Contrib Module

For complete documentation on using PVM in Condor, see the section entitled ``Parallel Applications in Condor: Condor-PVM'' in the version 6.1 manual. This manual can be found at http://www.cs.wisc.edu/condor/manual/v6.1.

To install the PVM contrib module, all you have to do is download to appropriate binary module for whatever platform(s) you plan to use for Condor-PVM. Once you have downloaded each module, uncompressed and untarred it, you will be left with a directory that contains a pvm.tar, README and so on. The pvm.tar acts much like the release.tar file for a main release. It contains all the binaries and supporting files you would install in your release directory:

        sbin/condor_pvmd
        sbin/condor_pvmgs
        sbin/condor_shadow.pvm
        sbin/condor_starter.pvm

Since these files do not exist in a main release, you can safely untar the pvm.tar directly into your release directory, and you're done installing the PVM contrib module. Again, see the 6.1 manual for instructions on how to use PVM in Condor.

Next: 3.4 Configuring Condor Up: 3. Administrators' Manual Previous: 3.2 Installation of Condor

condor-admin@cs.wisc.edu