Sõnastik

Valige vasakul üks märksõnadest ...

Machine LearningNeural Networks

Lugemise aeg: ~45 min

There is yet another way to build complex models out of more basic ones: composition. We can use outputs of one model as inputs of the next model, and we train the full composition to map training inputs to training outputs.

Using linear functions as our building blocks is a dead end:

Example
An affine function from \mathbb{R}^t to \mathbb{R}^s is a function of the form \mathbf{x}\mapsto W\mathbf{x} + \mathbf{b}, where W is an s\times t matrix and \mathbf{b} \in \mathbb{R}^t. Show that a composition of affine functions is affine.

Solution. We have W_2(W_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2 = W_2W_1 \mathbf{x} + (W_1 \mathbf{b}_1 + \mathbf{b}_2), which is of the form (matrix times \mathbf{x} plus vector).

This example shows that composing affine functions does not yield any new functions. We will introduce nonlinearity by applying a fixed function K: \mathbb{R} \to \mathbb{R} componentwise after each affine map application. We call K the activation function. The modern default activation to use is the ReLU function K(x) = \max(0,x). We borrow some Julia syntax and write K. for the function which applies K pointwise:

\begin{align*} K.([x_1, \ldots, x_t]) = [K(x_1), \ldots, K(x_t)].\end{align*}

Given an affine map \mathbf{x}\mapsto W \mathbf{x} + \mathbf{b}, we call W the weight matrix and \mathbf{b} the bias vector.

Example
Suppose that A_1(\mathbf{x}) = \left[\begin{smallmatrix} 3 & -2 \\ 1 & 4 \end{smallmatrix}\right]\mathbf{x} + \left[\begin{smallmatrix} 1 \\ 1 \end{smallmatrix}\right] and A_2(\mathbf{x}) = \left[\begin{smallmatrix} -4 & 0 \\ 3 & 1 \end{smallmatrix}\right]\mathbf{x} + \left[\begin{smallmatrix} -2 \\ 2 \end{smallmatrix}\right]. Find (A_2 \circ K. \circ A_1)\left(\left[\begin{smallmatrix} -2 \\ -4 \end{smallmatrix}\right]\right), where K is the ReLU activation function.

Solution. We have A_1(\mathbf{x}) = \left[\begin{smallmatrix} 3 \\ -17 \end{smallmatrix}\right]. Applying K to each component yields \left[\begin{smallmatrix} 3 \\ 0 \end{smallmatrix}\right]. Finally, applying A_2 yields

\begin{align*} \left[\begin{smallmatrix} -4 & 0 \\ 3 & 1 \end{smallmatrix}\right] \left[\begin{smallmatrix} 3 \\ 0 \end{smallmatrix}\right] + \left[\begin{smallmatrix} -2 \\ 2 \end{smallmatrix}\right] = \boxed{\left[\begin{smallmatrix} -14 \\ 11 \end{smallmatrix}\right]}.\end{align*}

We will use a diagram to visualize our neural net as a composition of maps. We include the sequence of alternating affine and activation maps, and we also include one final map which associates a real-valued cost with each output vector. Given a training observation (\mathbf{x}_i, \mathbf{y}_i), let's consider the cost C_i(\mathbf{y}) = |\mathbf{y} - \mathbf{y}_i|^2 (which measures squared distance from the vector \mathbf{y} output by the neural net and the desired vector \mathbf{y}_i). Our goal will be to find values for the weights and biases which yield a small average value for C(N(\mathbf{x}_i), \mathbf{y}_i) as (\mathbf{x}_i, \mathbf{y}_i) ranges over the set of training observations.

Definition
A neural network function N: \mathbb{R}^p \to \mathbb{R}^q is a composition of affine transformations and componentwise applications of a function K: \mathbb{R} \to \mathbb{R}.

The architecture of a neural network is the sequence of dimensions of the domains and codomains of its affine maps.

In principle, we have fully specified a neural network learner: given a set of training observations and a choice of architecture, we can ask for the weights and biases which minimize the average cost over the training observations. However, this is not the end of the story, for two reasons: neural net cost minimization is not a convex problem, and finding a global minimum is typically not feasible. Furthermore, even if the global minimum can be found, it is typically overfit. Therefore, building an effective neural net requires finessing not only the setup (architecture, choice of activation function, choice of cost function) but also the details of the optimization algorithm.

Neural networks are typically trained using stochastic gradient descent. The idea is to determine for each training observation the desired nudges to each weight and bias to reduce the cost for that observation. Rather than performing this calculation for every observation in the training set (which is, ideally, enormous), we do it for a randomly selected subset of the training data.

The main challenge in implementing stochastic gradient descent is the bookkeeping associated with all the nodes in the neural net diagram. We will use matrix differentiation to simplify that process. Let's begin by defining a neural net data type and writing some basic methods for it.

Example
Create data types in Julia to represent affine maps and neural net functions. Write an architecture method for neural nets which returns the sequence of dimensions of the domains and codomains of its affine maps.

Solution. We supply a NeuralNet with the sequence of affine maps, the activation function, and also the activation function's derivative. We write call methods for the AffineMap and NeuralNet types so they can be applied as functions to appropriate inputs. (One of these methods refers to a function we will define in the next example.)

using LinearAlgebra, StatsBase, Test
struct AffineMap
    W::Matrix
    b::Vector
end
struct NeuralNet
    maps::Vector{AffineMap}
    K::Function # activation function
    K̇::Function # derivative of K
end
(A::AffineMap)(x) = A.W * x + A.b
(NN::NeuralNet)(x) = forwardprop(NN,x)[end]
architecture(NN::NeuralNet) = [[size(A.W,2) for A in NN.maps];
                                size(last(NN.maps).W,1)]

Now we can set up an example neural network to use for tests later.

W₁ = [3.0 -4; -2 4]
W₂ = [1.0 -1]
b₁ = [1.0, -4]
b₂ = [0.0];
K(x) = x > 0 ? x : 0
K̇(x) = x > 0 ? 1 : 0
NN = NeuralNet([AffineMap(W₁, b₁), AffineMap(W₂, b₂)], K, K̇)

Successively applying the maps in the neural network is called forward propagation.

Example
Write a Julia function forwardprop which calculates the sequence of vectors obtained by applying each successive map in the neural net (in other words, the vectors which are output from each map and input into the next one).

Solution. We store the sequence of values we obtain in an array called vectors. The very last map is the identity function rather than K., so it must be handled separately.

function forwardprop(NN::NeuralNet,x)
    activations = [x]
    for (j,A) in enumerate(NN.maps)
        push!(activations,A(activations[end]))
        K = j < length(NN.maps) ? NN.K : identity
        push!(activations,K.(activations[end]))
    end
    activations
end

activations = forwardprop(NN, [-2.0, -3])
@test activations == [[-2, -3], [7, -12], [7, 0], [7], [7]]

To adjust the weights and biases of the neural net in a favorable direction, we want to compute for every node \nu the derivative of the value in the cost node with respect to the value in node \nu. The node which is easiest to compute the derivative for is the prediction vector node, since the only map between that node and the cost is C_i.

More generally, given the derivative value for any particular node \nu, we can calculate the derivative for the node \nu_{\text{left}} immediately to its left, using the chain rule. A small change \mathrm{d}\mathbf{u} in the value \mathbf{u} at node \nu_{\text{left}} induces a small change \frac{\partial S}{\partial \mathbf{u}}\mathrm{d}\mathbf{u} in the value \mathbf{v} at node \nu (where S denotes the map between \nu and \nu_{\text{left}}). This change in turn induces a change \frac{\partial T}{\partial \mathbf{v}}\frac{\partial S}{\partial \mathbf{u}}\mathrm{d}\mathbf{u} in the cost value (where T denotes the map from \nu to the cost node). In other words, the net result of the small change \mathrm{d}\mathbf{u} is the product of the two derivative matrices and \mathrm{d}\mathbf{u}.

So we can work from right to left in the diagram and calculate the derivative values for all of the nodes. This is called backpropagation. We will just need to calculate the derivatives of the maps between adjacent nodes.

Example

  • Find the derivative of C_i(\mathbf{y}) = |\mathbf{y} - \mathbf{y}_i|^2 with respect to \mathbf{y}.
  • Find the derivative of K. (the map which applies K pointwise).
  • Find the derivative of \mathbf{u} \mapsto W \mathbf{u} + \mathbf{b}, where W is a matrix and \mathbf{b} is a vector.

Solution.

  • The derivative of C_i is

\begin{align*} \frac{\partial}{\partial \mathbf{y}}\left[(\mathbf{y} - \mathbf{y}_i)' (\mathbf{y} - \mathbf{y}_i)\right] = 2(\mathbf{v} - \mathbf{y}_i)',\end{align*}

by the product rule.

  • The derivative with respect to \mathbf{x} has (i,j) th entry \frac{\partial (K.(\mathbf{x})_{i})}{\partial x_j}, which is equal to 0 if i \neq j and \dot K(x_i) if i = j (where we're borrowing from physics the notation \dot K for the derivative of K). In other words, the derivative of K. with respect to \mathbf{x} is \operatorname{diag} \dot K.(\mathbf{x}). (Here \operatorname{diag} means "form a diagonal matrix with this vector's entries on the diagonal", the dot on top means "derivative", and the subscript dot means "apply componentwise".)
  • The derivative of W \mathbf{u} + \mathbf{b} with respect to \mathbf{u} is \frac{\partial (W \mathbf{u})}{\partial \mathbf{u}} + \frac{\partial \mathbf{b}}{\partial \mathbf{u}}= W + 0 = W.

Example
Write a Julia function backprop which calculates the for each node the derivative of the cost function with respect to the value at that node. In other words, calculate the derivative of the composition of maps between that node and the cost node at the end.

Solution. We can use all the derivatives we calculated in the previous example. We define functions which return the index of the j th green node and the j th purple node in the diagram, for convenience.

"Index of jth green node"
greennode(j) = 2j-1
"Index of jth purple node"
purplenode(j) = 2j

"""
Compute the gradient of each composition of maps from
an intermediate node to the final output.
"""
function backprop(NN::NeuralNet,activations,yᵢ)
    y = activations[end]
    grads = [2(y-yᵢ)']
    for j in length(NN.maps) : -1 : 1
        if j == length(NN.maps)
            push!(grads, grads[end])
        else
            push!(grads, grads[end] *
              Diagonal(NN.K̇.(activations[purplenode(j)])))
        end
        push!(grads, grads[end] * NN.maps[j].W)
    end
    reverse(grads)
end

gradients = backprop(NN, activations, [2.0])
@test gradients ==
          [[30, -40]', [10, 0]', [10, -10]', [10]', [10]']

With the derivative values computed for all of the nodes on the bottom row, we can calculate the derivative of the cost function with respect to the weights and biases. To find changes in the cost function with respect to changes in a weight matrix W_j, we need to introduce the idea of differentiating with respect to a matrix.

Exercise
Given f:\mathbb{R}^{m\times n} \to \mathbb{R}, we define

\begin{align*} \frac{\partial}{\partial W}f(W) = \begin{bmatrix} \frac{\partial f}{\partial w_{1,1}} & \cdots & \frac{\partial f}{\partial w_{1,n}} \\ \vdots & \ddots & \vdots \\ \frac{\partial f}{\partial w_{m,1}} & \cdots & \frac{\partial f}{\partial w_{m,n}} \end{bmatrix},\end{align*}

where w_{i,j} is the entry in the i th row and j th column of W. Suppose that \mathbf{u}' is a 1 \times m row vector and \mathbf{v} is an n \times 1 column vector. Show that we have

\begin{align*} \frac{\partial}{\partial W}(\mathbf{u}'W \mathbf{v}) = \mathbf{u} \mathbf{v}'.\end{align*}

Solution. We have

\begin{align*}\mathbf{u}' A \mathbf{v} = w_{1,1}u_1v_1 + w_{1,2}u_1v_2 + \cdots + w_{m,n} u_m v_n. \end{align*}

Therefore, the derivative with respect to w_{i,j} is u_iv_j. This is also the (i,j) th entry of \mathbf{u} \mathbf{v}', so the purported equality holds.

If \mathbf{f} is a vector-valued function of a matrix W, then \frac{\partial}{\partial W}(\mathbf{f}(\mathbf{v})) is "matrix" whose (i,j,k) th entry is \frac{\partial f_i}{\partial w_{j,k}}. As suggested by the scare quotes, this is not really a matrix, since it has three varying indices instead of two. This is called a third-order tensor, but we will not develop this idea further, since the only property we will need is suggested by the notation and the result of the exercise above: differentiating W \mathbf{v} with respect to W and left-multiplying by a row vector \mathbf{u}' has the following net effect:

\begin{align*} \mathbf{u}'\frac{\partial}{\partial W}(W \mathbf{v}) = \frac{\partial}{\partial W}(\mathbf{u}W \mathbf{v}) = \mathbf{u} \mathbf{v}'.\end{align*}

Example
Write two functions weight_gradients and bias_gradients which compute the derivatives with respect to W_j and \mathbf{b}_j of C_i(N(\mathbf{x_i})).

Solution. In each case, we calculate the derivative of the map to the next node (either W_j \mapsto W_j \mathbf{x} + \mathbf{b}_j or \mathbf{b} \mapsto W_j \mathbf{x} + \mathbf{b}_j) and left-multiply by the derivative of the cost function with respect to the value at that node. The derivative of W_j \mathbf{x} + \mathbf{b}_j with respect to \mathbf{b}_j is the identity matrix, and by the exercise above, differentiating W \mathbf{v} with respect to W and left-multiplying by \mathbf{u} yields \mathbf{u}' \mathbf{v}':

function weight_gradients(NN::NeuralNet,activations,gradients)
    [gradients[purplenode(j)]' * activations[greennode(j)]' for j in 1:length(NN.maps)]
end

function bias_gradients(NN::NeuralNet,gradients)
    [gradients[purplenode(j)]' for j in 1:length(NN.maps)]
end
  
@test weight_gradients(NN, activations, gradients) ==
		[[-20 -30; 0 0], [70 0]]
        
@test bias_gradients(NN, gradients) == [[10, 0], [10]]

Note that we have to transpose grads[purplenode(j)] since it is a row vector to begin with.

We are now set up to train the neural network.

Example
Write a function train which performs stochastic gradient descent: for a randomly chosen subset of the training set, determine the average desired change in weights and biases to reduce the cost function. Update the weights and biases accordingly and perform a specified number of iterations.

Your function should take 7 arguments: (1) desired architecture, (2) the activation function K, (3) the derivative \dot K of the activation function K, (4) an array of training observations, (5) the batch size (the number of observations to use in each randomly chosen subset used in a single stochastic gradient descent iteration), (6) the learning rate \epsilon, and (7) the desired number of stochastic gradient descent iterations.

Solution. We write a function which calculates the average suggested changes in the weights and biases and call it from inside the train function.

function suggested_param_changes(NN::NeuralNet, observations,
                                 batch, ϵ)
    arch = architecture(NN)
    n_layers = length(arch)-1
    sum_Δweight = [zeros(arch[i+1],arch[i]) for i in 1:n_layers]
    sum_Δbias = [zeros(arch[i+1]) for i in 1:n_layers]
    for k in batch
        x, y = observations[k]
        activations = forwardprop(NN,x)
        gradients = backprop(NN,activations,y)
        ΔWs = -ϵ*weight_gradients(NN,activations,gradients)
        Δbs = -ϵ*bias_gradients(NN,gradients)
        for i in 1:n_layers
            sum_Δweight[i] += ΔWs[i]
            sum_Δbias[i] += Δbs[i]
        end
    end
    (sum_Δweight, sum_Δbias) ./ length(batch)
end

Now we can write train. We initialize the biases to 0.1, and the affine map entries are sampled independently from the standard normal distribution.

function train(arch, K, K̇, observations, batchsize,
               ϵ = 0.1, n_iterations = 1000)
    random_maps = [AffineMap(randn(arch[i+1],arch[i]),
                     fill(0.1,arch[i+1])) for i in 1:length(arch)-1]
    NN = NeuralNet(random_maps, K, K̇)
    for i in 1:n_iterations
        batch = sample(1:length(observations), batchsize)
        meanΔweight, meanΔbias =
            suggested_param_changes(NN, observations, batch, ϵ)
        NN = NeuralNet(
             [AffineMap(A.W + ΔW, A.b + Δb) for (A,ΔW,Δb) in
                 zip(NN.maps, meanΔweight, meanΔbias)], K, K̇
        )
    end
    NN
end

Example
Try training your model on some data which are sampled by taking \mathbf{X} uniform in the unit square and setting Y = 1-|\mathbf{X}|^2.

Solution. We sample our data:

using Random; Random.seed!(123)
xs = [rand(2) for i in 1:1000]
ys = [[1-x'*x] for x in xs]
observations = collect(zip(xs,ys))

Then we choose an architecture, a batch size, and a learning rate, and we train our model on the data:

arch = [2,5,1]
batchsize = 100
ϵ = 0.005
NN = train(arch, K, K̇, observations, batchsize, ϵ, 10_000)

Finally, we inspect the result.

cost(NN::NeuralNet,observations) = mean(norm(NN(x)-y)^2 for (x,y) in observations)
cost(NN,observations) # returns 0.00343
xgrid = 0:1/2^8:1
ygrid = 0:1/2^8:1
zs = [first(NN([x,y])) for x=xgrid,y=ygrid]
using Plots; pyplot()
surface(xgrid,ygrid,zs)

We can see that the resulting graph of N fits the points reasonably well. We can see that the graph is piecewise linear, with the creases between the linear pieces corresponding to points where of the affine maps in the net returns zero.

Bruno
Bruno Bruno