Leveraging DistributedTensor and PyTorch DistributedTensor for 2-D Parallelism

Posted by


In this tutorial, we will learn about 2-D parallelism using DistributedTensor and PyTorch DistributedTensor.

Parallelism is the act of splitting a task into subtasks and executing them concurrently to improve overall performance. In the context of deep learning, parallelism can be used to speed up training by distributing computation across multiple devices like GPUs or even across multiple nodes in a cluster.

DistributedTensor is a library provided by PyTorch that enables us to perform parallel computation using distributed training techniques. It allows us to split a large tensor into smaller chunks and distribute them across different devices or nodes for parallel processing.

In this tutorial, we will focus on 2-D parallelism, where we will split a 2D tensor into smaller chunks and distribute them across multiple devices. This is particularly useful when dealing with large matrices such as images or sequences.

Step 1: Setting up the environment
To get started, make sure you have PyTorch installed on your system. You can install it using pip:

pip install torch

Next, we need to set up a distributed process group using the torch.distributed module. This process group is responsible for coordinating communication between different processes in a distributed environment. Here’s how you can set up a process group:

import torch
import torch.distributed as dist

# Initialize the process group
dist.init_process_group(backend='gloo', init_method='tcp://localhost:12345', rank=0, world_size=2)

In the above code, we are initializing a process group with the ‘gloo’ backend and specifying the address for communication (tcp://localhost:12345) and the rank and world size of the current process.

Step 2: Creating a DistributedTensor
Next, we will create a 2D tensor and split it into smaller chunks using the torch.Tensor.chunk method. We will then distribute these chunks across different processes using a DistributedTensor. Here’s how you can create and distribute a DistributedTensor:

import torch
import torch.distributed as dist

# Initialize the process group
dist.init_process_group(backend='gloo', init_method='tcp://localhost:12345', rank=0, world_size=2)

# Create a 2D tensor
tensor = torch.randn(4, 4)

# Split the tensor into chunks
chunks = tensor.chunk(2, dim=0)

# Create a DistributedTensor from the chunks
dist_tensor = torch.distributed.DistributedTensor(chunks[dist.get_rank()])

In the above code, we first create a 2D tensor of size 4×4 and split it into two chunks along the 0th dimension. We then create a DistributedTensor from the chunk corresponding to the current process rank using dist.get_rank().

Step 3: Performing operations on DistributedTensor
Once we have created a DistributedTensor, we can perform operations on it just like we would on a regular PyTorch tensor. The only difference is that the computation will be distributed across different processes. Here’s an example of performing matrix multiplication on a DistributedTensor:

import torch
import torch.distributed as dist

# Initialize the process group
dist.init_process_group(backend='gloo', init_method='tcp://localhost:12345', rank=0, world_size=2)

# Create a 2D tensor
tensor = torch.randn(4, 4)

# Split the tensor into chunks
chunks = tensor.chunk(2, dim=0)

# Create a DistributedTensor from the chunks
dist_tensor = torch.distributed.DistributedTensor(chunks[dist.get_rank()])

# Perform matrix multiplication
result = torch.mm(dist_tensor, dist_tensor)

In the above code, we first create a 2D tensor and split it into chunks. We then create a DistributedTensor from the chunk corresponding to the current process rank and perform matrix multiplication using the torch.mm function.

Step 4: Cleaning up
After you have finished using the DistributedTensor, make sure to clean up the process group to release resources and gracefully exit the distributed environment. Here’s how you can clean up:

import torch
import torch.distributed as dist

# Finalize the process group
dist.destroy_process_group()

In this tutorial, we learned about 2-D parallelism using DistributedTensor and PyTorch DistributedTensor. We set up a distributed process group, created a DistributedTensor from a 2D tensor, performed operations on the DistributedTensor, and cleaned up the process group. By utilizing 2-D parallelism, we can effectively distribute computation across multiple devices or nodes to speed up training and improve overall performance in deep learning applications.

0 0 votes
Article Rating
2 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
@foxdog9332
1 month ago

could you just edit out the troubleshooting the audio part out of the video?

@Gerald-iz7mv
1 month ago

hi does DTensor also work with databricks + pyspark running a cluster with only CPUs?