#
Differentiation in PyTorch
In this chapter, we will explore one of the most fundamental concepts in deep learning: derivatives. Understanding how derivatives are computed and used is essential for training neural networks effectively.
#
Why Are Derivatives Important?
In deep learning, the goal is often to minimize a loss function, which measures how far off the model's predictions are from the actual values. Derivatives, also known as gradients, provide a way to measure how changes in the model's parameters will affect the loss. These gradients are crucial for updating the parameters in the right direction to reduce the loss, a process known as backpropagation.
PyTorch simplifies this process with its autograd package, which automatically computes gradients for any tensor with requires_grad=True.
#
Key Concepts
- Tensor: A multi-dimensional array used to store data.
- Gradient: The derivative of a function with respect to its inputs, indicating how much the function's output will change with a small change in the input.
- Autograd: PyTorch’s automatic differentiation engine that facilitates neural network training.
Let’s explore these concepts with hands-on examples.
#
Basic Gradient Calculation in PyTorch
We’ll start with a simple mathematical function and see how PyTorch computes its gradient.
Consider the function:
y = x^2
The derivative of this function with respect to x is:
\frac{dy}{dx} = 2xLet’s implement this in PyTorch:
import torch
# Define a tensor with requires_grad=True
x = torch.tensor(2.0, requires_grad=True)
# Define the function y = x^2
y = x ** 2
# Compute the derivative of y with respect to x
y.backward()
# Print the gradient of x
print(x.grad)
#
Explanation:
requires_grad=True: This tells PyTorch to track all operations onxso that we can compute the gradient later.- Function y = x^2: We define a simple function where y depends on x.
y.backward(): This computes the derivative of y with respect to x. Since
y = x^2
the derivative is
\frac{dy}{dx} = 2xwhich gives us 4 when ( x = 2 ).
x.grad: This stores the computed gradient, which is 4 in this case.
#
Partial Derivatives and Multivariable Functions
When dealing with functions of multiple variables, we compute partial derivatives. A partial derivative measures how the function changes as one variable changes, while keeping the other variables constant.
Consider the function:
z = 3x_1^2 + 2x_2^3
Here, z is a function of two variables, x_1 and x_2.
The partial derivatives are:
\frac{\partial z}{\partial x_1} = 6x_1\frac{\partial z}{\partial x_2} = 6x_2^2Let’s compute these partial derivatives using PyTorch:
import torch
# Define tensors with requires_grad=True
x1 = torch.tensor(1.0, requires_grad=True)
x2 = torch.tensor(2.0, requires_grad=True)
# Define the function z = 3x1^2 + 2x2^3
z = 3 * x1**2 + 2 * x2**3
# Compute the derivatives of z with respect to x1 and x2
z.backward()
# Print the gradients
print(x1.grad) # Gradient of z with respect to x1
print(x2.grad) # Gradient of z with respect to x2
#
Explanation:
- Partial Derivatives: We compute how
zchanges with respect to each variable, x_1 and x_2, while treating the other variable as constant. z.backward(): This computes the partial derivatives
\frac{\partial z}{\partial x_1} = 6x_1\frac{\partial z}{\partial x_2} = 6x_2^2- Gradients:
x1.gradwill be 6 (since x_1 = 1), andx2.gradwill be 24 (since x_2 = 2 ).
#
Using Gradients in Optimization
In neural networks, gradients are used to update model parameters in order to minimize the loss function. This is done using an optimization algorithm like Stochastic Gradient Descent (SGD).
Consider a simple example where we want to minimize the following loss function:
\text{loss} = (w - 2)^2The derivative of the loss function with respect to w is:
\frac{d(\text{loss})}{dw} = 2(w - 2)Here’s how we can compute this gradient and update the parameter w using PyTorch:
import torch
import torch.optim as optim
# Define a tensor with requires_grad=True
w = torch.tensor(1.0, requires_grad=True)
# Define the loss function
loss = (w - 2)**2
# Define an optimizer
optimizer = optim.SGD([w], lr=0.1)
# Perform one optimization step
optimizer.zero_grad() # Zero the gradients
loss.backward() # Compute the gradients
optimizer.step() # Update the parameter w
# Print the updated value of w
print(w)
#
Explanation:
- Loss Function: The loss function
\text{loss} = (w - 2)^2measures how far the current value of w is from the target value (2 in this case).
- Gradient Calculation:
loss.backward()computes the gradient
\frac{d(\text{loss})}{dw} = 2(w - 2)Optimizer: The SGD optimizer updates w by subtracting the gradient multiplied by the learning rate (0.1 in this case).
Updated Parameter: After the update, w moves closer to the value that minimizes the loss.
#
Zeroing the Gradients
When performing multiple optimization steps, the gradients will accumulate by default. To prevent this, you should zero the gradients before each backward pass:
w.grad.zero_()
#
The detach() Function
The detach() function creates a new tensor that shares the same data but does not require gradients. This is useful when you want to perform operations that should not affect the gradient computation.
#
Example:
import torch
# Define a tensor with requires_grad=True
x = torch.tensor(2.0, requires_grad=True)
# Define a function
y = x ** 2
# Detach the tensor from the computation graph
y_detached = y.detach()
# Perform operations on the detached tensor
z = y_detached + 5
# Print results
print(f'y: {y}') # Tensor with gradients
print(f'y_detached: {y_detached}') # Tensor without gradients
print(f'z: {z}') # Resulting tensor after operation
#
Explanation:
y.detach(): This creates a new tensory_detachedthat shares the same data asybut does not track gradients.- Use Case: This is helpful when you need to perform certain operations on tensors without affecting the gradient computation.
#
Conclusion
Understanding derivatives and how to handle gradients in PyTorch is fundamental for training and optimizing neural networks. PyTorch’s autograd package makes it easy to compute and use these gradients.