Nearer neighbors

People are really into visualizing lots of data. No big surprise there. One algorithm that pops up a lot no matter what field you’re in is t-SNE (t-Distributed Stochastic Neighbor Embedding). For mapping to 2 and 3 dimensions (i.e. those you can visualize), t-SNE handles grouping better than just about any other algorithm I’ve seen. I’ve seen a lot of demonstrations of visualizations made by t-SNE, but I wanted a better idea of what was going on under the hood. I’ve heard a lot of people describe it as trying to preserve the pairwise distances between the original data and embedded data. But just glancing at the algorithm suggests that’s not really what’s happening at all.

So just for myself, I made this animation comparing t-SNE to the random projection method. The code is here (you need to download the Python t-SNE implementation and the MNIST data), and I put descriptions of the algorithms further down. Random projections and t-SNE are about as different as you can get (between dimensionality reduction techniques, anyways). Random projections is linear, t-SNE is non-linear. Random projections sits back and lets the Johnson-Lindenstrauss Lemma do it’s thing, t-SNE aggressively optimizes the KL divergence cost function. But I didn’t realize they’d be this different.

t-SNE vs. random projections. Han Solo.
Heatmaps showing the pairwise distances for the embeddings of the MNIST 2500 data using t-SNE and random projections.

So what’s happening here. The data is the MNIST 2500 data: 2500 normalized 16 pixel by 16 pixel images of handwritten digits 0 through 9. We’re looking at heatmaps of their pairwise distances (blue means very similar, red means very different). They’re sorted along the axes of the heatmaps, so the 0 images are first rows/columns, the 1 images come after, etc.

The patchwork pattern in the t-SNE figure represents the fact that all of the embedded representations of the images corresponding to a certain number are generally close to each other. We also see some not-so-surprising relationships; for instance, the 9s are all close to each other (bottom right patch), but they’re also close to the 4s and 7s. The random projection images and original data show a strongly similar patch for the 1 images, but nothing really near what the t-SNE embeddings are showing. It makes sense that if any group of images would appear close, it would be the 1s.

As you think more about it, it starts to make a lot more sense. Random projections is all about preserving the pairwise distances between the original data, and it seems to be doing that pretty well. Once we get down to lower dimensions, the algorithm is too simple to do anything of use, and we get embeddings that are, well, random.

But t-SNE is doing something totally different. The algorithm is trying to model neighborhoods, not preserve global relationships. The thing that we’re trying to preserve is the neighbor likelihood distribution, which pretty explicitly cuts out relationships to faraway points. t-SNE is geared towards visualization, and in particular, visualization of data that lives on a few different high dimensional manifolds. If it really was all about preserving distances, then it should certainly be able to do that when embedding in relatively high dimensions. The first few images of the t-SNE animation would show pairwise distances similar to the original data.

Moral of the story: know what your dimensionality reduction algorithm is doing before using it or telling someone else about it.

t-SNE in brief

This is supposed to be a short post, so I’ll only give the high level. Long story short: this is a fairly complex iterative algorithm that computes a non-linear dimensionality reduction. Let our initial data be \(x_1,x_2,…,x_n \in \mathbb{R}^d\), and let our target dimension be \(p\) (usually 2 or 3).

  1. Compute an initial embedding \(f:\mathbb{R}^d \to \mathbb{R}^{d’}\), \(d’<d\) to a somewhat-lower-dimensional space, usually just using PCA on the \(x_i\). Let \(f(x_i) = x’_i\)
  2. Define \(p_{ij}\), the probability that \(x’_i\) and \(x’_j\) are in the same neighborhood, based on pairwise distance distributions for every pair of points
  3. Randomly set an initial embedding \(y_i\in\mathbb{R}^p\) for every \(x’_i\)
  4. Use a heavy-tailed t-distribution to set \(q_{ij}\), the probability that \(y_i\) and \(y_j\) are neighbors in the \(p\)-dimensional space, in a similar manner to the \(p_{ij}\)
  5. Use gradient descent to update the \(y_{i}\) (and thus the \(q_{ij}\)) such that the Kullback-Leibler diveregence between the two distribtions is minimized. The divergence is $$KL(P Q) = \sum_{i\neq j} p_{ij} \log\left(\frac{p_{ij}}{q_{ij}}\right)$$

</span> The algorithm is typically used for embedding into 2- and 3-dimensional space since recomputing the \(q_{ij}\) requires recomputing the pariwise distances, which is costly in high dimensions. If your \(n\) data points are in \(\mathbb{R}^k\), it takes \(O(kn^2)\) time to compute them. For more details, see Laurens van der Maaten and Geoffrey Hinton’s original paper (or the follow ups). They give good details and examples.

Random projections in brief

Random projections is confusingly simple. If you have \(n\) data points in \(\mathbb{R}^d\) and you want to embed them in \(\mathbb{R}^p\) (\(p < d\)), then generate a bunch of random orthogonal matrices \(Q\in\mathbb{R}^{p\times d}\) until for one of them, the pairwise distances for the embedded points \(Y_{n\times p} = X_{n\times d}Q^T_{p\times d}\) aren’t distorted too far from the original points \(X_{n\times d}\). This seems like a crazy naive way to go about dimensionality reduction, until you see the even crazier Johnson-Lindenstrauss Lemma.

The J-L Lemma states that for any set of \(n>4\) data points and any acceptable pairwise distance distortion \(\epsilon\in (0,1/2]\), there exists a linear map \(f\) into \(\mathbb{R}^{20\log n/\epsilon^2}\) such that for any pair of data points \(x_i\) and \(x_j\) That is, the pairwise distances won’t be distorted too much. Not only that, but the projection can be found in randomized polynomial time. The craziest part about this is that it doesn’t depend on \(d\), the original dimension of your data, at all. So if I was willing to accept a distortion of 0.5, then there’s an acceptable projection of my \(n\) points into \(k=80\log n\) dimensional space, whether my data was living in \(\mathbb{R}^{200}\) or \(\mathbb{R}^{200 \text{ billion}}\). Note that people often use the random projection method to embed points in lower dimensions than the lemma guarantees will work.