Autodiff

Posted by : Sushanth Sunday, 9 January 2022

Autodiff

Backpropagation involves lot of differentiation and implementing backprop by hand is like programming in assembly language.

Autodiff is a library to build an automatic differentiation which helps to easily make derivatives.

Automatic differentiation (autodiff)

Refers to a general way of taking a program which computes a value, and automatically constructing a procedure for computing derivatives of that value.

Backpropagation

It is the special case of autodiff applied to neural nets but in machine learning, we often use backprop synonymously with autodiff.

An autodiff system will convert the program into a sequence of primitive operations which have specified routines for computing derivatives. In this representation, backprop can be done in a completely mechanical way.

Autograd:

This class is an engine to calculate derivatives (Jacobian-vector product to be more precise). It records a graph of all the operations performed on a gradient enabled tensor and creates an acyclic graph called the dynamic computational graph. The leaves of this graph are input tensors and the roots are output tensors. Gradients are calculated by tracing the graph from the root to the leaf and multiplying every gradient in the way using the chain rule.

Autodiff in Pytorch:

x = torch.ones([3,2], requires_grad=True) # requires gradient is set to true
# x could be differentiated against - it is possible to get differentiation w.r.t x
print(x)
print(x.requires_grad)

y = x + 5  # y also requires_grad feature  # y is a function of x
print(y)
print(y.requires_grad) 

z = y*y + 1
print(z)
print(z.requires_grad)

t = torch.sum(z) # grad_fn=<SumBackward0> is used for book keeping
print(t) 

Backward() function Backward is the function which actually calculates the gradient by passing it’s argument (1x1 unit tensor by default) through the backward graph all the way up to every leaf node traceable from the calling root tensor. The calculated gradients are then stored in .grad of every leaf node. Remember, the backward graph is already made dynamically during the forward pass. Backward function only calculates the gradient using the already made graph and stores them in leaf nodes.
# to compute gradient - do a backward pass
t.backward()

print(x.grad)  # derivative of t w.r.t x

t=∑izi,zi=y2i+1,yi=xi+5
∂t∂xi=∂zi∂xi=∂zi∂yi∂yi∂xi=2yi×1
At x = 1, y = 6, ∂t∂xi=12

===================================

x = torch.ones([3, 2], requires_grad=True)
y = x + 5
r = 1/(1 + torch.exp(-y))  # r: sigmoid of y
print(r)
s = torch.sum(r) 
s.backward()
print(x.grad) # partial derivative of s w.r.t x
========================================
x = torch.ones([3, 2], requires_grad=True)
y = x + 5
r = 1/(1 + torch.exp(-y))

# r.backward() # Error: grad can be implicitly created only for scalar outputs

# tensor with multiple values should have an argument
# it works by chain rule
a = torch.ones([3, 2])
r.backward(a)
print(x.grad)
∂s∂x=∂s∂r⋅∂r∂x
For the above code a represents ∂s∂r and then x.grad gives directly ∂s∂x
=======================================
Disabling gradient calculation is useful for inference, when you are sure that you will not call :meth:Tensor.backward(). It will reduce memory consumption for computations that would otherwise have requires_grad=True.
In this mode, the result of every computation will have requires_grad=False, even when the inputs have requires_grad=True.
learning_rate = 0.01

w = torch.tensor([1.], requires_grad=True)
b = torch.tensor([1.], requires_grad=True)

print(w.item(), b.item())

for i in range(50):

    x = torch.randn([20,1])
    y = 3*x -2

    y_hat = w*x + b
    loss = torch.sum((y_hat-y)**2)

    loss.backward()

    # no_grad : pytorch might think its a continuation of old equations of 
    # previous - to avoid building below into computation graph
    # w and b are only variable updates
    # also set the gradients to zero, to start fresh
    with torch.no_grad():
        w -= learning_rate * w.grad
        b -= learning_rate * b.grad

        w.grad.zero_()
        b.grad.zero_()

    print(w.item(), b.item())
===============================================================
%%time

learning_rate = 0.001
N = 10000000
epochs = 200

w = torch.rand([N], requires_grad=True)
b = torch.ones([1],requires_grad=True)

print(torch.mean(w).item(), b.item())

for i in range(epochs):
    x = torch.randn([N])
    y = torch.dot(3*torch.ones([N]),x) - 2

    y_hat = torch.dot(w,x) + b
    loss = torch.sum((y_hat - y) ** 2)

    loss.backward()

    with torch.no_grad():
        w -= learning_rate * w.grad
        b -= learning_rate * b.grad
    
        w.grad.zero_()
        b.grad.zero_()
================================================================
%%time
learning_rate = 0.001
N = 10000000
epochs = 200

w = torch.rand([N], requires_grad=True, device=cuda0)
b = torch.ones([1], requires_grad=True, device=cuda0)

# print(torch.mean(w).item(), b.item())

for i in range(epochs):
  
  x = torch.randn([N], device=cuda0)
  y = torch.dot(3*torch.ones([N], device=cuda0), x) - 2
  
  y_hat = torch.dot(w, x) + b
  loss = torch.sum((y_hat - y)**2)
  
  loss.backward()
  
  with torch.no_grad():
    w -= learning_rate * w.grad
    b -= learning_rate * b.grad
    
    w.grad.zero_()
    b.grad.zero_()
==================================================================
import torch
import math

dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0")  # Uncomment this to run on GPU

# Create Tensors to hold input and outputs.
# By default, requires_grad=False, which indicates that we do not need to
# compute gradients with respect to these Tensors during the backward pass.
x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)
y = torch.sin(x)

# Create random Tensors for weights. For a third order polynomial, we need
# 4 weights: y = a + b x + c x^2 + d x^3
# Setting requires_grad=True indicates that we want to compute gradients with
# respect to these Tensors during the backward pass.
a = torch.randn((), device=device, dtype=dtype, requires_grad=True)
b = torch.randn((), device=device, dtype=dtype, requires_grad=True)
c = torch.randn((), device=device, dtype=dtype, requires_grad=True)
d = torch.randn((), device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(2000):
    # Forward pass: compute predicted y using operations on Tensors.
    y_pred = a + b * x + c * x ** 2 + d * x ** 3

    # Compute and print loss using operations on Tensors.
    # Now loss is a Tensor of shape (1,)
    # loss.item() gets the scalar value held in the loss.
    loss = (y_pred - y).pow(2).sum()
    if t % 100 == 99:
        print(t, loss.item())

    # Use autograd to compute the backward pass. This call will compute the
    # gradient of loss with respect to all Tensors with requires_grad=True.
    # After this call a.grad, b.grad. c.grad and d.grad will be Tensors holding
    # the gradient of the loss with respect to a, b, c, d respectively.
    loss.backward()

    # Manually update weights using gradient descent. Wrap in torch.no_grad()
    # because weights have requires_grad=True, but we don't need to track this
    # in autograd.
    with torch.no_grad():
        a -= learning_rate * a.grad
        b -= learning_rate * b.grad
        c -= learning_rate * c.grad
        d -= learning_rate * d.grad

        # Manually zero the gradients after updating weights
        a.grad = None
        b.grad = None
        c.grad = None
        d.grad = None

print(f'Result: y = {a.item()} + {b.item()} x + {c.item()} x^2 + {d.item()} x^3')

Subscribe to Posts | Subscribe to Comments

Technical Articles

Software Programming articles

Autodiff

Autodiff

Automatic differentiation (autodiff)

Backpropagation

Autograd:

Autodiff in Pytorch:

Leave a Reply