Numerical ComputationAutomatic Differentiation
Suppose that is a function whose definition is too complicated for us to feasibly differentiate symbolically. Perhaps the most straightforward way to approximate the derivative of is to calculate the difference quotient
a small value of . However, this approach is very inaccurate because the subtraction step is ill-conditioned.
Exercise
Use difference quotients to approximate the derivative of at , with as ranges from to . What is the least error over these values of ? How does that error compare to machine epsilon?
Solution. The block
diffquotient(f,x,ϵ) = (f(x+ϵ) - f(x))/ϵ m = minimum([abs(diffquotient(x->x^2,2/3,2.0^k) - 4/3) for k = -60:-20])
returns . This error is more than ten million times larger than we could hope for just from roundoff error:
m / (nextfloat(4/3) - 4/3)
The problem with difference quotients is that the relative error of blows up as (due to catastrophic cancellation). Larger values of are inaccurate because of truncation error. Even at the optimal value of , the precision is still poor.
On the other hand, the problem of calculating the derivative of is well-conditioned as long as the condition number of isn't too large. So the difference quotient algorithm is unstable, and we can hope for a stable alternative.
Another straightforward alternative to difference quotients is to calculate the derivative symbolically, the way introductory calculus students learn to do it. For example, the derivative of is . However, this approach quickly becomes untenable as the functions get sufficiently complicated, as is typically the case in modern machine learning applications.
Indeed, there is an approach to derivative computation which is precise, fast, and scalable: automatic differentiation. The idea is to replace Float64's with objects called dual numbers which track values and gradients at the same time.
One concrete way to realize dual numbers is to use the matrix
in place of in the program that computes . This requires that any internal calculations performed by are able to handle matrices as well as plain numbers. The matrix resulting from this calculation will be equal to
allowing us to read off the derivative as the top-right entry:
f(x) = x^2 + 8x f([5 1; 0 5])
Exercise
In this exercise, we will explain why
for any polynomial .
- Check that the above equation holds for the identity function (the function which returns its input) and for the function which returns the multiplicative identity (in other words, 1 for real numbers, or the identity matrix for matrices).
- Check that if the above equation holds for two differentiable functions and , then it holds for the sum and the product .
- Explain why the above equation holds for any polynomial function .
Solution.
- If is the identity function, then both sides of the above equation reduce to . If returns the multiplicative identity, then both sides reduce to the identity matrix.
- We have
which in turn is equal to by the product rule. The result for works similarly, with linearity of the derivative in place of the product rule.
- The set of functions which satisfies the above equation includes and and is closed under multiplication and addition. Therefore, this set of functions at least includes all
.
While the exercise above only addresses polynomial functions , the relationship actually holds for many more functions, because many common functions may be described as limits of polynomials: if is a matrix, then
Since the identity is true for every truncation of the sum on the right-hand side, it's true for the exponential function as well.
Exercise
Use automatic differentiation to find the derivative of at the point . Compare your answer to the true value of . Hint: You'll want to define using
using LinearAlgebra
f(t) = exp(-t^2/4) * (t^4 - 2t^3 - t^2 + 3t - I)
where I
is an object in Julia which is defined to behave like multiplicative identity (note that subtracting the identity matrix is the appropriate matrix analogue of subtracting from a real number).
Also, to help check your answer, here's the symbolic derivative of :
df(t) = (-t^5 + 2*t^4 + 9*t^3 - 15*t^2 - 3*t + 6) * exp(-t^2/4)/2
Solution. We define as suggested and then calculate f([2 1; 0 2])[1,2]
. The result is exactly the same as df(2)
and different from df(big(2))
. So we see that automatic differentiation gives a major improvement over the difference quotient approach in this instance.
In practice, you will usually want to use a library to perform automatic differentiation, because ensuring suitable dual-number-awareness of all of the functions called by can be a daunting task. Julia has several packages for this purpose, including ForwardDiff
. In Python you can use autograd
, which works for all of the NumPy functions. (We note that these libraries don't actually use matrices to represent dual numbers; they introduce a custom type which has the same behavior.)