Let
And the distance between the center and corners of the hypercube is
Why does this matter? As the degree increases, the probability of an outlier appearing in the data increases considerably, represented by the decreasing space between the cube and sphere as the degree approaches infinity. It's thus important that in cases of high dimensionality that features share a small standard deviation and are normally distributed around a center mean so as to limit outliers.
-
Nearly all of high-dimensional space in a hypercube is distant from the center and close to the border.
-
High dimensional datasets at risk of being sparse. The average distance between two random points:
-
in a unit square is roughly 0.52.
-
in a unit 3-d cube is roughly 0.66.
-
in a unit 1,000,000-d hypercube is roughly 408.25.
-
-
Distances from a random point to its nearest and farthest neighbor are similar.
-
Distance-based classification generalizes poorly unless # samples grows exponentially with
$d$ .
Biological networks provide a powerful means of representing relationships between biological entities and/or their functional components. A common example of this approach is protein-protein interaction networks where various proteins are stringed together and the edges between each serve to represent some biological relation. The properties that allow for biological networks to be so powerful are provided below.
-
Highly interconnected with modular structure.
-
Weakly to strongly scale-free (fraction of nodes with degree
$k$ follows a power law$k^{-\alpha}$ ). -
Subsets of genes, proteins or regulatory elements tend to form highly correlated modules.
-
Functional genomics datasets tend to (not always!) occupy a low dimensional subpace of the feature space (e.g., genes, proteins, regulatory elements).
-
Ideal for dimenstional reduction approaches to both visualize and analyze functional genomics data.
Given a number of samples, we want to reduce the dimensionality of the feature set. We find directions that constitute an orthonormal basis in which different individual dimensions of the data are linearly uncorrelated. Principal component analysis (PCA) is the process of computing the principal components and using them to perform a change of basis on the data, sometimes using only the first few principal components and ignoring the rest.
Assume we have
We can also consider PCA in the context of Singular Value Decomposition,
as the principal components are simply the eigenvectors of the
covariance matrix of the sample data. Moreover, the covariance matrix of
where
PCA is a widely applied dimensionality reduction algorithm and is extremely valuable in reducing the complexity of datasets, allowing for visualizing data that was once unable to be plotted in 3-dimensional space. However, in reducing the dimensionality of the data, there is the obvious tradeoff of information loss that was once captured when all features were included.
The goal with Non-Negative Matrix Factorization is to factorize a given matrix V into a product of two resulting matrices, W (representing features) and H (storing the weights/coefficients for each feature). It is important to note that the given matrix V must contain strictly positive elements. We can find the necessary W and H that when multiplied together constitute our initial data matrix V using the following steps:
-
Initialize
$\mathbf{H}$ and$\mathbf{W}$ with non negative values -
Update
$H_{ij}$ and$W_{ij}$ -
$H_{ij} \leftarrow H_{ij} \frac{(\mathbf{W}^{T} \mathbf{V})_{ij}}{(\mathbf{W}^{T} \mathbf{WH})_{ij}}$ -
$W_{ij} \leftarrow W_{ij} \frac{(\mathbf{V} \mathbf{H}^{T})_{ij}}{(\mathbf{W} \mathbf{HH}^{T})_{ij}}$
-
-
Stop when
$H_{ij}$ and$W_{ij}$ don't change within a specified tolerance
By computing a feature matrix and the corresponding coefficient matrix for a given data matrix, Non-Negative Matrix Factorization allows one to easily explore the most relevant features to the dataset. This is most valuable when the dimensionality of the dataset is increased and extracting the most relevant features becomes difficult.
A non-linear dimensional reduction approach that attempts to map a distribution of pairwise distances among nn high-dimensional samples from their high dimension to a distribution of pairwise distances of the nn samples in a low dimension.
While the name may not sound as such, t-SNE is quite an intuitive approach. This specific algorithm works by first determining the pairwise distances amongst samples when projected onto the initial vector space. Once computed, these pairwise distances are used to map these same samples to a lower dimensional vector space, where the distances between corresponding samples is maintained.
Next, define $$\begin{aligned} p_{ij} = \frac{p(j|i) + p(i|j)}{2n} \end{aligned}$$
In the lower dimensional space, the probability of
$$\begin{aligned}
q_{ij} = \frac{(1 + \Vert \mathbf{y}_{j} - \mathbf{y}_{j} \Vert^2)^{-1}}{\sum_{k,l;k \ne l} (1 + \Vert \mathbf{y}_{k} - \mathbf{y}_{l} \Vert^2)^{-1}}
\end{aligned}$$
With
In doing so, we are minimizing the difference between the distributions
of
Unlike the previously mentioned dimensionality reduction algorithms, t-SNE is able to escape overcrowded clusters by maintaining the same distribution of distances amongst samples, even in a lower dimensional space.
Unlike many of the previously mentioned algorithms, UMAP is based on a newly emerging approach to dimensionality reduction known as "manifold learning." The general idea of manifold learning states that the dimensionality of a given data matrix exists in a lower dimensional space but projected in a higher dimension. UMAP is one such example of this approach and by leveraging what's known as "simplicial complexes," the topological structure of the input matrix can be constructed. We can analogize this approach using graph strucutres, The simplical complexes, illustrated in the figure below, are used to first derive a higher dimension graphical representation of the input matrix.
Once the higher dimensional simplical complex is formed and accurately spans the input matrix, the complex is slowly optimized to a lower dimension, ensuring that the "connectedness" of the input points is maintained. More on this graphical basis for UMAP is discussed in the following section.
Let
$$\begin{aligned} \sum_{j=1}^{k} e^{-\max(0,d(\mathbf{x}{i},\mathbf{x}{ij}) - \rho_{i})/\sigma_{i}} = \log_{2}(k) \end{aligned}$$
With
$$\begin{aligned} E = {(\mathbf{x}{i},\mathbf{x}{ij}) | 1 \le j \le k, 1 \le i \le n} \ w_{h}(\mathbf{x}{i},\mathbf{x}{ij}) = e^{-\max(0,d(\mathbf{x}{i},\mathbf{x}{ij}) - \rho_{i})/\sigma_{i}} \end{aligned}$$
Combine edges of
where
UMAP has been praised for its ability to maintain global structure of the input. Additionally, its speed has been noted as an advantage, especially compared to t-SNE where pairwise distances amongst elements must be computed. Altogether, UMAP shows great prowess in employing the principles of manifold learning for dimensionality reduction and general topological data analysis.