Prerequisites
Understand how to select nodes for a job and create a pbs script and running a batch or interactive job.
Pytorch is preinstalled. You must run a batch or interactive job to run pytorch jobs
Pytorch Examples
Pytorch requires a node with a CUDA supported GPU. There are a select number of GPU nodes available. Below are several examples. Copy each script an place them in a working directory. To run the script enter the command “sh mpi_test.sh” or “rpc_test.sh”.
MPI Example:
This example illustrates Pytorch running in parallel on one or more nodes with MPI. Copy each script an place them in a working directory.
PBS script “mpi_test.sh”:
#PBS -N pytorch_mpi_test #PBS -l nodes=2:ppn=20:cuda cd $PBS_O_WORKDIR module load pytorch/cuda/mpi/2.5.1 mpiexec python3 mpi_test.py
Pytorch script “mpi_test.py”:
import os
import torch
import torch.distributed as dist
# Environment variables set by torch.distributed.launch
LOCAL_RANK = int(os.environ['OMPI_COMM_WORLD_LOCAL_RANK'])
WORLD_SIZE = int(os.environ['OMPI_COMM_WORLD_SIZE'])
WORLD_RANK = int(os.environ['OMPI_COMM_WORLD_RANK'])
def run(backend):
tensor = torch.zeros(1)
# Need to put tensor on a GPU device for nccl backend
if backend == 'nccl':
device = torch.device("cuda:{}".format(LOCAL_RANK))
tensor = tensor.to(device)
if WORLD_RANK == 0:
for rank_recv in range(1, WORLD_SIZE):
dist.send(tensor=tensor, dst=rank_recv)
print('worker_{} sent data to Rank {}\n'.format(0, rank_recv))
else:
dist.recv(tensor=tensor, src=0)
print('worker_{} has received data from rank {}\n'.format(WORLD_RANK, 0))
def init_processes(backend):
if backend == 'mpi':
dist.init_process_group(backend)
else:
dist.init_process_group(backend, rank=WORLD_RANK, world_size=WORLD_SIZE)
run(backend)
if __name__ == "__main__":
backend='mpi'
dist.init_process_group(backend)
run(backend)
RPC Example
In this example, one “rpc_test.py” runs on each node. Copy each script an place them in a working directory.
PBS script “rpc_test.sh”
PBS -l nodes=2:ppn=20:cuda #PBS -N rpc_test cd $PBS_O_WORKDIR module load pytorch/cuda/mpi/2.5.1 export MASTER_ADDR=$HOSTNAME export MASTER_PORT=8394 # In this example, one "rpc_test.py" runs on each node. mpiexec -npernode 1 python3 rpc_test.py
Pytorch script “rpc_test.py”
import torch
import torch.distributed.rpc as rpc
from torch import Tensor
import os
def remote_fn(x: Tensor, n: int) -> Tensor:
return x * n
if __name__ == "__main__":
rank = int( os.environ.get("OMPI_COMM_WORLD_RANK") )
world_size = int( os.environ.get("OMPI_COMM_WORLD_SIZE") )
name = "worker" + str(rank)
rpc.init_rpc(
name=name,
rank=rank,
world_size=world_size
)
if rank == 0:
workers = [(f"worker{n}", n) for n in range(1, world_size)]
for worker, rank in workers:
result = rpc.rpc_sync(worker, remote_fn, args=(torch.tensor(5), rank + 1))
print(result)
print("I AM ALL DONE")
rpc.shutdown()