Sõnastik

Valige vasakul üks märksõnadest ...

Machine LearningDimension Reduction

Lugemise aeg: ~30 min

Real-world data sets typically have many more than 2 or 3 features, which rules out direct use of the visualizations we've used for the examples we've seen so far. However, we can often apply a map from \mathbb{R}^p to a low-dimensional space which preserves much of the structure of the original data set.

Let's begin with an example where both the original space and the reduced space are low-dimensional.

Example
Finding a map \phi from \mathbb{R}^2 to \mathbb{R} such that each point (x,y) shown can be reasonably accurately located if the value of \phi(x,y) is known. You may assume that the coordinates of the 100 points are stored in a 100 \times 2 matrix A.

Solution. Consider the line y = mx + b which, roughly speaking, runs along the primary axis of the ellipse-shaped point cloud. If we know how far along this line one of the points is, then we know pretty accurately where it's located.

More precisely, for each point, we let \phi(x,y) be the orthogonal projection of (x,y) onto the line y = mx + b. We would like to minimize the average squared error approximation \phi(x,y) \approx (x,y).

We use an idea from singular value decomposition: the line through the origin which minimizes the sum of squared distances is the first column of V in the singular value decomposition U \Sigma V' of A. The only problem is that in this example this line does not pass through the origin. To deal with this issue, we recenter the points by finding the mean of the coordinates of the points and subtracting it from the coordinates of each point, so that the transformed points are centered at the origin. We then perform SVD on this new matrix to identify the line which minimizes the sum of squared distances to the points; now these lines all cross the origin. Once the best line's unit vector \mathbf{v} is known, we can determine for each point how far along the line its projection is by taking its dot product with \mathbf{v}. In other words, we can project all of the points onto \mathbb{R}^1 by right-multiplying the matrix A by \mathbf{v}.

The idea we developed in this example is called principal component analysis. We subtract off the means to center the point cloud, apply the singular value decomposition, and consider the first k columns of the resulting V matrix (these columns are called the principal components of the feature matrix). The desired dimension reduction map from \mathbb{R}^p to \mathbb{R}^k is represented by the matrix V[:,1:k]' (that is, the matrix obtained by removing all but the first k columns of V and then transposing).

The three principal components of the given collection of points are shown in red, green, and gold, respectively. The red vector runs along the line whose sum of squared distances to the points is as small as possible, and the green vector minimizes the sum of squared distances to the points subject to the constraint of being orthogonal to the first vector.

Example
Apply principal component analysis to project the handwritten digit images in the MNIST dataset to the plane.

Solution. We begin by loading the dataset and reshaping the training data feature array.

using MLDatasets, Images, Plots
MNIST.download(i_accept_the_terms_of_use = true)
features, labels = MNIST.traindata(Float64)
A = reshape(features[:],28^2,60_000)'

Next, we define a custom function for displaying images. This function accepts a vector of length 28^2 = 784, it reshapes the entries into a square, and it returns an array of colors for display. If some of the components are negative, then negative and positive values are represented in different colors (red and blue, respectively). Otherwise, the image is displayed in grayscale.

function imshow(v)
    if any(v .< 0)
        (x -> x > 0 ? RGB(x,0,0) : RGB(0,0,-x)).(reshape(v./maximum(abs.(v)),(28,28))')
    else
        Gray.(reshape(v./maximum(abs.(v)),(28,28))')
    end
end

To perform principal component analysis, we take the column-wise mean with mean(A,dims=1) and subtract it from the matrix before performing the singular value decomposition.

using Statistics, LinearAlgebra
A = A[1:10_000,:] # make this computationally feasible on Binder
U, Σ, V = svd(A .- mean(A,dims=1))
imshow(V[:,1])

We can see the first principal component with imshow(V[:,1]), and similarly for the second principal component

The first two principal components.

Finally, we project onto each of the first two principal components and make a scatter plot of the results. We set the marker size (ms) and the marker stroke width (msw) so we can see the colors of the points more clearly.

n = 5000
scatter(A[1:n,:]*V[:,1],
        A[1:n,:]*V[:,2],
        group=labels[1:n],
        markersize=2,
        markerstrokewidth=0)

We can see that some digits cluster apart from the other digits (like 1), while others remain heavily overlapping.

t-SNE

In this section we discuss a popular dimensionality reduction technique which is often more effective than principal component analysis at discovering structure in the dataset. The idea is to choose a map which attempts to preserve pairwise similarity of the data points. The version we present is called t-SNE, which is short for t-distributed stochastic neighbor embedding.

Suppose that \mathbf{x}_1, \ldots, \mathbf{x}_n is a set of points in \mathbb{R}^p. We begin by fixing a parameter of the model, called the perplexity \rho. Given \sigma > 0, we define

\begin{align*} P_{i,j}(\sigma) = \frac{\operatorname{e}^{-|\mathbf{x}_i - \mathbf{x}_j|^2/(2\sigma^2)}}{\sum_{k\neq j} \operatorname{e}^{-|\mathbf{x}_k - \mathbf{x}_j|^2/(2\sigma^2)}}\end{align*}

for i\neq j, and P_{i,i}(\sigma) = 0 for all 1 \leq i \leq n. This quantity measures the similarity of distinct points \mathbf{x}_i and \mathbf{x}_j: if \mathbf{x}_i is closer to \mathbf{x}_j (compared to how close other points are to \mathbf{x}_j), then P_{i,j} is closer to 1. If \mathbf{x}_i is far from \mathbf{x}_j, then P_{i,j} is close to 0.

Example
Consider the points \mathbf{x}_1 = [0,0], \mathbf{x}_2 = [0,1], \mathbf{x}_3 = [1,1], and \mathbf{x}_4 = [4,0]. Find P_{2,1}(\sigma) for each value of \sigma in \left\{\frac{1}{4},1,2,100\right\}.

Solution. We define a function to compute P_{i,j}(\sigma).

x = [[0,0],[0,1],[1,1],[4,0]]
f(x,y,σ) = exp(-norm(x-y)^2/(2σ^2))
P(x,i,j,σ) = f(x[i],x[j],σ) / sum(f(x[k],x[j],σ) for k in 1:length(x) if k ≠ j)
[P(x,2,1,σ) for σ in [0.25, 1, 2, 100]]

We find that P_{2,1}(0.25) = 0.9997, P_{2,1}(1) = 0.6222, P_{2,1}(2) = 0.4912, and P_{2,1}(100) = 0.3334.

We can see from this calculation that \sigma serves as the unit with respect to which proximity is being measured. If \sigma is very large, then all of the points are effectively close to \mathbf{x}_1, so the values of P_{i,1}(\sigma) are approximately equal for i \in \{2,3,4\}. If \sigma is very small, then P_{i,1}(\sigma) is close to 1 for \mathbf{x}_1's nearest neighbor and is close to 0 for the other points.

For each j from 1 to n, we define \sigma_j to be the solution \sigma of the equation

\begin{align*}2^{-\sum_{i : i \neq j} P_{i,j}(\sigma) \log_2 P_{i,j}(\sigma)} = \rho.\end{align*}

The quantity \sum_i P_{i,j}(\sigma) \log_2 P_{i,j}(\sigma) on the left-hand side is called the Shannon entropy, and it measures how evenly distributed the values P_{1,j}(\sigma), P_{2,j}(\sigma), \ldots, P_{n,j}(\sigma) are. We fix the perplexity for each j to ensure that the function i\mapsto P_{i,j}(\sigma_j) avoids the extremes of too heavily concentrating its values on a small number of i's or spreading out its values too evenly across all of the i's.

We will denote by \mathbf{\tilde{x}}_1, \ldots, \mathbf{\tilde{x}}_n the images of the points \mathbf{x}_1, \ldots, \mathbf{x}_n under a map from the original feature space to a lower-dimensional Euclidean space (typically the plane or 3D space). We won't bother trying to define our dimension reduction map on the whole feature space; rather, we will let the domain of the map be just the set of training points. We begin by defining

\begin{align*} Q_{i,j} = \frac{(1 + |\mathbf{\tilde{x}}_i-\mathbf{\tilde{x}}_j|^2)^{-1}}{\sum_{k\neq j}(1 + |\mathbf{\tilde{x}}_k-\mathbf{\tilde{x}}_j|^2)^{-1}}.\end{align*}

These quantities measure the pairwise similarity of the points \mathbf{\tilde{x}}_1, \ldots, \mathbf{\tilde{x}}_n, analogously to P_{i,j}. Notice, that rather than using a Gaussian function to measure neighborliness, we're using the Cauchy density \frac{1}{1+x^2}. (As an aside, the Cauchy distribution is also known a the t distribution with one degree of freedom, and that's where we get the t in the name t-SNE).

We measure how well the similarities P_{i,j} match the similarities Q_{i,j} using the cost function

\begin{align*} C(\mathbf{\tilde{x}}_1, \ldots, \mathbf{\tilde{x}}_n) = \sum_{(i,j) : i \neq j} P_{i,j} \log_2 \frac{P_{i,j}}{Q_{i,j}}.\end{align*}

Finally, we use gradient descent to find values for the image points \mathbf{\tilde{x}}_1, \ldots, \mathbf{\tilde{x}}_n which minimize this cost function. Note that, since our dimension reduction map is only defined on the training points, we can think of the image coordinates as the parameters of the map and perform the gradient descent steps on the image coordinates. You can think of this visually as moving the locations of the points about freely so as to try to get the value of the cost function to go down. The following animation shows this process in action:

This animation comes from a blog post by Chris Olah, which contains many other animations and discussion of PCA, t-SNE, and related models.

Exercise
Why might it valuable to use the heavier tailed Cauchy function to compute the Q's?

Solution. If two points which are supposed to be close are very far apart, the gradient of the Gaussian measuring their neighborliness will be tiny (since the Gaussian has derivative extremely close to zero outside of a small zone around the origin). This effect is much less pronounced with the Cauchy function, so the gradient signal in the optimization process is stronger.

Lastly, we note that the version of t-SNE proposed in the original t-SNE paper actually uses symmetrized versions of the P's and Q's that they call p_{i,j} and q_{i,j}. (Symmetric means that p_{i,j} = p_{j,i} and q_{i,j} = q_{j,i}.) This version has some technical advantages and gives similar results to the asymmetric version discussed above.

Example
Use the Julia package TSne to plot a two-dimensional t-SNE embedding of the first 2000 images in the MNIST data set.

Solution. We call the tsne function on the first k = 2000 rows of the MNIST matrix A defined above.

using TSne, Random
Random.seed!(123)
n = 2000
Y = tsne(A[1:n,:])
scatter(Y[:,1],Y[:,2],
        group=labels[1:n],
        ms=2,msw=0)

Congraulations! You've finished the Data Gymnasia Machine Learning course.

Bruno
Bruno Bruno