It is possible to use rsh on the first node of a job to run arbitrary commands on the other nodes of the job. You may need to do this because some commercial software, or your own precompiled software, requires rsh to start parallel jobs. However, if you’re using parallel software that is already installed on the system, you will not need to do use rsh. This article describes how to use mpirun and rsh to launch a parallel job using a precompiled software which requires rsh to launch processes on all the nodes of a job.
Whether you’re running interactively or in batch mode, the process is the same. First you need to load the rsh module. The module will setup your environment so that the correct rsh command is used. We have three different modules for rsh. The modules are ‘rsh/mpiexec’, ‘rsh/pbsdsh’, and ‘rsh/qlogin’. They each use a different backend to emulate the real rsh command. All of them interface with PBS to actually run the command on the remote host and use the standard rsh options. However, they each handle STDIN/STDOUT differently, so if one doesn’t work for you, try another. The default is ‘rsh/qlogin’ which uses qlogin for the backend and handles STDIN/STDOUT best for most applications.
For some software, you may also need to set the environment variable RSH_COMMAND to ‘rsh’, though this should be the default for most.
Next you need to get the names of the other nodes in your job. To do this, use the PBS_NODEFILE variable. This variable contains the path of a file which has a list of all the hostnames in the job. A hostname will appear once for each allocated core in the job. This file will be your ‘machinefile’ for mpirun.
Finally, you need the total number of cores in the job. Again, we use the PBS_NODEFILE to get this information. All you need to do is count the number of lines in this file to get the number of cores in the job. See the example below on how to do this.
So, for example, here are the commands needed to launch an MPI program called “pfoo” on all the cores of a job. You would either type these command manually for an interactive job, or put them in your PBS script. This example assumes that the mpirun command that comes with “pfoo” is already in your PATH. You may need to load a separate module for pfoo.
module load rsh
NP=$(wc -l $PBS_NODEFILE | awk '{print $1;}')
mpirun -machinefile $PBS_NODEFILE -np $NP pfoo
For logging into running jobs, or for manually running arbitrary programs on the compute nodes from a job, see qlogin for more information. test.