#Differentiation in PyTorch

In this chapter, we will explore one of the most fundamental concepts in deep learning: derivatives. Understanding how derivatives are computed and used is essential for training neural networks effectively.

#Why Are Derivatives Important?

In deep learning, the goal is often to minimize a loss function, which measures how far off the model's predictions are from the actual values. Derivatives, also known as gradients, provide a way to measure how changes in the model's parameters will affect the loss. These gradients are crucial for updating the parameters in the right direction to reduce the loss, a process known as backpropagation.

PyTorch simplifies this process with its autograd package, which automatically computes gradients for any tensor with requires_grad=True.

#Key Concepts

  • Tensor: A multi-dimensional array used to store data.
  • Gradient: The derivative of a function with respect to its inputs, indicating how much the function's output will change with a small change in the input.
  • Autograd: PyTorch’s automatic differentiation engine that facilitates neural network training.

Let’s explore these concepts with hands-on examples.


#Basic Gradient Calculation in PyTorch

We’ll start with a simple mathematical function and see how PyTorch computes its gradient.

Consider the function:

The derivative of this function with respect to is:

Let’s implement this in PyTorch:

import torch # Define a tensor with requires_grad=True x = torch.tensor(2.0, requires_grad=True) # Define the function y = x^2 y = x ** 2 # Compute the derivative of y with respect to x y.backward() # Print the gradient of x print(x.grad)

#Explanation:

  • requires_grad=True: This tells PyTorch to track all operations on x so that we can compute the gradient later.
  • Function : We define a simple function where depends on .
  • y.backward(): This computes the derivative of with respect to . Since

the derivative is

which gives us 4 when ( x = 2 ).

  • x.grad: This stores the computed gradient, which is 4 in this case.

#Partial Derivatives and Multivariable Functions

When dealing with functions of multiple variables, we compute partial derivatives. A partial derivative measures how the function changes as one variable changes, while keeping the other variables constant.

Consider the function:

Here, is a function of two variables, and .

The partial derivatives are:

Let’s compute these partial derivatives using PyTorch:

import torch # Define tensors with requires_grad=True x1 = torch.tensor(1.0, requires_grad=True) x2 = torch.tensor(2.0, requires_grad=True) # Define the function z = 3x1^2 + 2x2^3 z = 3 * x1**2 + 2 * x2**3 # Compute the derivatives of z with respect to x1 and x2 z.backward() # Print the gradients print(x1.grad) # Gradient of z with respect to x1 print(x2.grad) # Gradient of z with respect to x2

#Explanation:

  • Partial Derivatives: We compute how z changes with respect to each variable, and , while treating the other variable as constant.
  • z.backward(): This computes the partial derivatives
  • Gradients: x1.grad will be 6 (since = 1), and x2.grad will be 24 (since = 2 ).

#Using Gradients in Optimization

In neural networks, gradients are used to update model parameters in order to minimize the loss function. This is done using an optimization algorithm like Stochastic Gradient Descent (SGD).

Consider a simple example where we want to minimize the following loss function:

The derivative of the loss function with respect to is:

Here’s how we can compute this gradient and update the parameter using PyTorch:

import torch import torch.optim as optim # Define a tensor with requires_grad=True w = torch.tensor(1.0, requires_grad=True) # Define the loss function loss = (w - 2)**2 # Define an optimizer optimizer = optim.SGD([w], lr=0.1) # Perform one optimization step optimizer.zero_grad() # Zero the gradients loss.backward() # Compute the gradients optimizer.step() # Update the parameter w # Print the updated value of w print(w)

#Explanation:

  • Loss Function: The loss function

measures how far the current value of is from the target value (2 in this case).

  • Gradient Calculation: loss.backward() computes the gradient
  • Optimizer: The SGD optimizer updates by subtracting the gradient multiplied by the learning rate (0.1 in this case).

  • Updated Parameter: After the update, moves closer to the value that minimizes the loss.


#Zeroing the Gradients

When performing multiple optimization steps, the gradients will accumulate by default. To prevent this, you should zero the gradients before each backward pass:

w.grad.zero_()

#The detach() Function

The detach() function creates a new tensor that shares the same data but does not require gradients. This is useful when you want to perform operations that should not affect the gradient computation.

#Example:

import torch # Define a tensor with requires_grad=True x = torch.tensor(2.0, requires_grad=True) # Define a function y = x ** 2 # Detach the tensor from the computation graph y_detached = y.detach() # Perform operations on the detached tensor z = y_detached + 5 # Print results print(f'y: {y}') # Tensor with gradients print(f'y_detached: {y_detached}') # Tensor without gradients print(f'z: {z}') # Resulting tensor after operation

#Explanation:

  • y.detach(): This creates a new tensor y_detached that shares the same data as y but does not track gradients.
  • Use Case: This is helpful when you need to perform certain operations on tensors without affecting the gradient computation.

#Conclusion

Understanding derivatives and how to handle gradients in PyTorch is fundamental for training and optimizing neural networks. PyTorch’s autograd package makes it easy to compute and use these gradients.