Next: 3.9 Setting up Condor Up: 3. Administrators' Manual Previous: 3.7 Setting Up IP/Host-Based

Subsections

3.8 Managing your Condor Pool

There are a number of administrative tools Condor provides to help you manage your pool. The following sections describe various tasks you might wish to perform on your pool and explains how to most efficiently do them.

All of the commands described in this section must be run from a machine listed in the HOST_ALLOW_ADMINISTRATOR setting in your config files, so that the IP/host-based security allows the administrator commands to be serviced. See section 3.7 on page for full details about IP/host-based security in Condor.

3.8.1 Shutting Down and Restarting your Condor Pool

There are a couple of situations where you might want to shutdown and restart your entire Condor pool. In particular, when you want to install new binaries, it is generally best to make sure no jobs are running, shutdown Condor, and then install the new daemons.

3.8.1.1 Shutting Down your Condor Pool

The best way to shutdown your pool is to take advantage of the remote administration capabilities of the condor_master. The first step is to save the IP address and port of the condor_master daemon on all of your machines to a file, so that even if you shutdown your condor_collector, you can still send administrator commands to your different machines. You do this with the following command:

        % condor_status -master -format "%s\n" MasterIpAddr > addresses

The first step to shutting down your pool is to shutdown any currently running jobs and give them a chance to checkpoint. Depending on the size of your pool, your network infrastructure, and the image-size of the standard jobs running in your pool, you may want to make this a slow process, only vacating one host at a time. You can either shutdown hosts that have jobs submitted (in which case all the jobs from that host will try to checkpoint simultaneously), or you can shutdown individual hosts that are running jobs. To shutdown a host, simply send:

        % condor_off hostname

where ``hostname'' is the name of the host you want to shutdown. This will only work so long as your condor_collector is still running. Once you have shutdown Condor on your central manager, you will have to rely on the addresses file you just created.

If all the running jobs are checkpointed and stopped, or if you're not worried about the network load put in effect by shutting down everything at once, it is safe to turn off all daemons on all machines in your pool. You can do this with one command, so long as you run it from a blessed administrator machine:

        % condor_off `cat addresses`

where addresses is the file where you saved your master addresses. condor_off will shutdown all the daemons, but leave the condor_master running, so that you can send a condor_on in the future.

Once all of the Condor daemons (except the condor_master) on each host is turned off, you're done. You are now safe to install new binaries, move your checkpoint server to another host, or any other task that requires the pool to be shutdown to successfully complete.

NOTE: If you are planning to install a new condor_master binary, be sure to read the following section for special considerations with this somewhat delicate task.

3.8.1.2 Installing a New condor_master

If you are going to be installing a new condor_master binary, there are a few other steps you should take. If the condor_master restarts, it will have a new port it is listening on, so your addresses file will be stale information. Moreover, when the master restarts, it doesn't know that you sent it a condor_off in its past life, and will just start up all the daemons it's configured to spawn unless you explicitly tell it otherwise.

If you just want your pool to completely restart itself whenever the master notices its new binary, neither of these issues are of any concern and you can skip this (and the next) section. Just be sure installing the new master binary is the last thing you install, and once you put the new binary in place, the pool will restart itself over the next 5 minutes (whenever all the masters notice the new binary, which they each check for once every 5 minutes by default).

However, if you want to have absolute control over when the rest of the daemons restart, you must take a few steps.

1.

Put the following setting in your global config file:

        START_DAEMONS = False

This will make sure that when the master restarts itself that it doesn't also start up the rest of its daemons.

2.

Install your new condor_master binary.

3.

Start up Condor on your central manager machine. You will have to do this manually by logging into the machine and sending commands locally. First, send a condor_restart to make sure you've got the new master, then send a condor_on to start up the other daemons (including, most importantly, the condor_collector).

4.

Wait 5 minutes, such that all the masters have a chance to notice the new binary, restart themselves, and send an update with their new address. Make sure that:

        % condor_status -master

lists all the machines in your pool.

5.

Remove the special setting from your global config file.

6.

Recreate your addresses file as described above:

        % condor_status -master -format "%s\n" MasterIpAddr > addresses

Once the new master is in place, and you're ready to start up your pool again, you can restart your whole pool by simply following the steps in the next section.

3.8.1.3 Restarting your Condor Pool

Once you are done performing whatever tasks you need to perform and you're ready to restart your pool, you simply have to send a condor_on to all the condor_master daemons on each host. You can do this with one command, so long as you run it from a blessed administrator machine:

        % condor_on `cat addresses`

That's it. All your daemons should now be restarted, and your pool will be back on its way.

3.8.2 Reconfiguring Your Condor Pool

If you change a global config file setting and want to have all your machines start to use the new setting, you must send a condor_reconfig command to each host. The easiest way to do this is:

        % condor_reconfig `condor_status -master`

Next: 3.9 Setting up Condor Up: 3. Administrators' Manual Previous: 3.7 Setting Up IP/Host-Based

condor-admin@cs.wisc.edu