t-distributed Stochastic Neighbor Embedding

Main Concept¶

t-SNE aims to visualize high-dimensional data by embedding it into a low-dimensional space (typically 2D or 3D). The core idea is to convert high-dimensional Euclidean distances between data points into conditional probabilities that represent similarities. It then defines a similar probability distribution over points in the low-dimensional map and minimizes the difference between these two distributions. t-SNE focuses on preserving local neighborhood structures, meaning that points that are close together in the high-dimensional space are also likely to be close together in the low-dimensional embedding. It transforms the original data representation by creating a new set of coordinates in a lower-dimensional space. The coordinates of data points change as they are positioned in this new space to reflect the learned similarities, with a focus on preserving local neighborhoods.

Theoretical Aspect¶

t-SNE defines two probability distributions: one in the high-dimensional space and one in the low-dimensional space.

High-dimensional probabilities: The conditional probability $p_{j|i}$ that point $\mathbf{x}_j$ is a neighbor of point $\mathbf{x}_i$ is defined as:
$p_{j|i} = \frac{\exp(-||\mathbf{x}_i - \mathbf{x}_j||^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-||\mathbf{x}_i - \mathbf{x}_k||^2 / 2\sigma_i^2)}$
(1)
where $\sigma_i$ is the bandwidth of the Gaussian kernel centered on $\mathbf{x}_i$ . The joint probability $p_{ij}$ is then defined as:
$p_{ij} = \frac{p_{j|i} + p_{i|j}}{2n}$
(2)
where $n$ is the number of data points.
Low-dimensional probabilities: The probability $q_{ij}$ that point $\mathbf{y}_j$ is a neighbor of point $\mathbf{y}_i$ in the low-dimensional map is defined using a Student t-distribution with one degree of freedom (which has heavier tails than a Gaussian):
$q_{ij} = \frac{(1 + ||\mathbf{y}_i - \mathbf{y}_j||^2)^{-1}}{\sum_{k \neq l} (1 + ||\mathbf{y}_k - \mathbf{y}_l||^2)^{-1}}$
(3)

The objective is to minimize the Kullback-Leibler (KL) divergence between these two distributions:

KL(P||Q) = \sum_i \sum_j p_{ij} \log \frac{p_{ij}}{q_{ij}}

(4)

The key variables being optimized are $\mathbf{Y}$ , the matrix of low-dimensional embeddings.

Solution Methodology¶

The optimization is typically performed using gradient descent. The gradient of the KL divergence with respect to the low-dimensional embeddings $\mathbf{y}_i$ is:

\frac{\partial KL}{\partial \mathbf{y}_i} = 4 \sum_j (p_{ij} - q_{ij})( \mathbf{y}_i - \mathbf{y}_j)(1 + ||\mathbf{y}_i - \mathbf{y}_j||^2)^{-1}

(5)

The algorithm proceeds as follows:

Pairwise Affinities: Compute the pairwise affinities $p_{ij}$ in the high-dimensional space.
Initialization: Initialize the low-dimensional embeddings $\mathbf{Y}$ (e.g., randomly or using PCA).
Optimization: Minimize the KL divergence using gradient descent. This involves iteratively updating the embeddings $\mathbf{y}_i$ using the gradient:
$\mathbf{y}_i^{(t+1)} = \mathbf{y}_i^{(t)} + \eta \frac{\partial KL}{\partial \mathbf{y}_i} + \alpha (\mathbf{y}_i^{(t)} - \mathbf{y}_i^{(t-1)})$
(6)
where η is the learning rate and α is the momentum.
Iteration: Repeat step 3 until convergence.

The solution involves numerical methods like Gaussian kernels, Student t-distribution, and gradient descent with momentum.

Global Optimality¶

The optimization problem in t-SNE is non-convex, meaning that gradient descent is not guaranteed to find a global minimum. The resulting embedding can be sensitive to the initialization and the choice of hyperparameters (e.g., perplexity, learning rate, momentum). Different initializations can lead to different local minima. The perplexity parameter, which is related to the number of neighbors considered for each point, significantly influences the results.

Conclusion¶

t-SNE is a powerful technique for visualizing high-dimensional data, particularly for revealing cluster structures.

t-SNE applied to MNIST handwritten digits

It focuses on preserving local neighborhoods, making it effective for exploring the local geometry of data. However, it is important to remember that t-SNE is primarily a visualization technique and not a dimensionality reduction technique for downstream tasks. It does not preserve global distances, and the interpretation of cluster sizes and distances between clusters should be done with caution. A meaningful use case is in visualizing gene expression data, where t-SNE can reveal clusters of genes with similar expression patterns across different samples.