An Overview on Matrix Factorization and Dimensionality Reduction

AI & ML Nov 8, 2023

Matrix factorization is a cornerstone of linear algebra, with far-reaching applications in data analysis, machine learning, signal processing, and beyond. By decomposing a matrix into simpler, interpretable components, these techniques not only streamline numerical computations but also uncover intrinsic structures within high-dimensional data.

A key application of matrix factorization is dimensionality reduction, which simplifies complex datasets while preserving essential patterns, enabling efficient storage, faster computations, and improved interpretability.

From recommender systems (where techniques like Singular Value Decomposition power collaborative filtering) to image compression (where low-rank approximations reduce storage needs) and bioinformatics (where factorization reveals latent biological signals), matrix factorization plays a pivotal role.

At the heart of many of these methods lie eigenvalues and eigenvectors, which encode fundamental properties of linear transformations and enable spectral decompositions like Principal Component Analysis (PCA).

In this blog post, we explore the foundations of matrix factorization and try to shed some light on widely used techniques such as Singular Value Decomposition (SVD), Principal Component Analysis (PCA), Non-negative Matrix Factorization (NMF) and more.

Introduction to Matrix Factorization

Matrix factorization refers to the decomposition of a matrix into a product of two or more matrices. Formally, given a matrix $\mathbf{A}$, we seek matrices $\mathbf{B}$ and $\mathbf{C}$, such that

$$\mathbf{A} = \mathbf{B}\mathbf{C}$$

This decomposition is not unique and depends on the properties of $\mathbf{A}$ and the constraints imposed on $\mathbf{B}$ and $\mathbf{C}$. The goal is to simplify complex matrix operations, reduce data dimensionality, and uncover latent patterns hidden in the (data) matrix.

Matrix factorization is fundamental in linear algebra, enabling efficient solutions to linear systems, matrix inversion, and determinant calculation. Beyond pure mathematics, it is a powerful tool in machine learning where large, sparse and often incomplete datasets are common.

Different factorization techniques serve different purposes: some preserve orthogonality, others enforce non-negativity constraints, and some focus on statistical independence or uncovering latent factors.

The choice of factorization method depends on the nature of the data and the desired properties of the decomposition.

Eigenvalues and Eigenvectors

Eigenvalues and eigenvectors are intrinsic properties of square matrices, that describe how linear transformations scale and rotate vectors. For a square matrix $\mathbf{A}$, an eigenvector $\mathbf{v}$ and its corresponding eigenvalue $\lambda$ satisfy

$$\mathbf{A} \mathbf{v} = \lambda \mathbf{v}$$

This equation states that applying the linear transformation $\mathbf{A}$ to $\mathbf{v}$ scales $\mathbf{v}$ by $\lambda$ without changing its direction. Eigenvalues can be real or complex, and eigenvectors are always non-zero.

A key application of eigenvalues and eigenvectors is eigendecomposition, a matrix factorization technique that expresses $\mathbf{A}$ in terms of its eigenvectors and eigenvalues.

If $\mathbf{A}$ has $n$ linearly independent eigenvectors, it can be decomposed as

$$\mathbf{A} = \mathbf{P} \mathbf{\Lambda} \mathbf{P}^{-1}$$

where $\mathbf{P}$ is a matrix whose columns are the eigenvectors of $\mathbf{A}$ and $\mathbf{\Lambda}$ is a diagonal matrix containing the corresponding eigenvalues.

Eigenvalues and eigenvectors are central to data analysis techniques such as Principal Component Analysis (PCA). They provide insights into the behaviour of linear systems, the structure of matrices and the geometry of data distributions. In factorization techniques like Singular Value Decomposition (SVD), eigenvalues and eigenvectors are used to decompose matrices into orthogonal and diagonal matrices.

How to Compute Eigenvalues, Eigenvectors, and Eigendecomposition

For a given square matrix $\mathbf{A}$ of size $n \times n$, the first step is to find its eigenvalues. This involves solving the characteristic equation

$$\det(\mathbf{A} - \lambda \mathbf{I}) = 0$$

where $\mathbf{A}$ is the matrix, $\lambda$ is the eigenvalue and $\mathbf{I}$ is the identity matrix of the same size as $\mathbf{A}$, $\det$ denotes the determinant.

The characteristic equation $\det(\mathbf{A} - \lambda \mathbf{I}) = 0$ is polynomial equation in $\lambda$ of degree $n$. Solving this equation will yield the eigenvalues $\lambda_1$, $\lambda_2$, ..., $\lambda_n$.

For each eigenvalue $\lambda_i$, we can proceed finding the eigenvector(s) by solving the equation

$$(\mathbf{A} - \lambda_i \mathbf{I})\mathbf{v}_i = 0$$

which is a homogeneous system of linear equations. The solution space of this system gives us the eigenspace corresponding to $\lambda_i$.

The eigendecomposition of a matrix $\mathbf{A}$ can then be constructed via

$$\mathbf{P} \mathbf{\Lambda} \mathbf{P}^{-1} = \mathbf{A}$$

where $\mathbf{P}$ is formed by placing the eigenvectors as columns, and $\mathbf{\Lambda}$ is a diagonal matrix with eigenvalues on its diagonal.

Singular Value Decomposition

Singular Value Decomposition (SVD) is one of the most widely used matrix factorization methods. It decomposes any $m \times n$ matrix $\mathbf{A}$ into three matrices such that

$$\mathbf{A} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^T$$

where $\mathbf{U}$ is a $m \times m$ orthogonal matrix (left singular vectors), $\mathbf{\Sigma}$ is a $m \times n$ diagonal matrix with non-negative singular values $\sigma_i$ on the diagonal, and $\mathbf{V}^T$ is the transpose of an $n \times n$ orthogonal matrix (right singular vectors).

Unlike eigendecomposition, which is limited to square matrices, SVD is a universal decomposition that applies to any $m \times n$ matrix, whether square or rectangular.

The singular values $\sigma_i$ are defined as the square roots of the eigenvalues of $\mathbf{A}^T\mathbf{A}$ (or $\mathbf{A}\mathbf{A}^T$). The columns of $\mathbf{V}$ are eigenvectors of $\mathbf{A}^T\mathbf{A}$, while the columns of $\mathbf{U}$ are eigenvectors of $\mathbf{A}\mathbf{A}^T$.

SVD can be visualized as a sequence of three transformations: a rotation or reflection $\mathbf{V}^T$, a scaling along coordinate axes by the singular values $\mathbf{\Sigma}$, and another rotation or reflection $\mathbf{U}$.

SVD Algorithm

First, we compute $\mathbf{A}^T \mathbf{A}$ and $\mathbf{A}\mathbf{A}^T$, which are both square and symmetric matrices, meaning they have real eigenvalues and orthogonal eigenvectors.

Second, we compute the eigenvectors of $\mathbf{A}^T \mathbf{A}$ to get $\mathbf{V}$ and those of $\mathbf{A} \mathbf{A}^T$ to get $\mathbf{U}$.

The singular values $\sigma_i$ are the square root eigenvalues of either $\mathbf{A}^T \mathbf{A}$ or $\mathbf{A} \mathbf{A}^T$.

To get $\mathbf{\Sigma}$ we arrange the derived singular values in descending order on the diagonal of $\mathbf{\Sigma}$.

The columns of $\mathbf{U}$ represent the eigenvectors of $\mathbf{A}\mathbf{A}^T$, whereas the columns of $\mathbf{V}$ represent the eigenvectors of $\mathbf{A}^T \mathbf{A}$.

Finally, we can form the SVD by combining $\mathbf{U}$, $\mathbf{\Sigma}$ and $\mathbf{V}^T$ such that

$$\mathbf{A} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^T$$

Principal Component Analysis

Principal Component Analysis (PCA) is a dimensionality reduction technique that uses eigendecomposition to find the principal components of data. These components are orthogonal directions that capture the maximum variance in the data, allowing us to represent high-dimensional data in lower dimensions while preserving as much information as possible.

Given a data matrix $\mathbf{X}$ where each row represents an observation and each column a feature, PCA finds a set of orthogonal vectors (principal components) that best describe the variance in the data.

PCA starts with the computation of the covariance matrix

$$\mathbf{C} = \frac{1}{n-1} \mathbf{X}^T \mathbf{X}$$

where $\mathbf{X}$ is assumed to be centered (mean-subtracted).

The principal components are the eigenvectors of the covariance matrix $\mathbf{C}$, and the corresponding eigenvalues represent the amount of variance captured by each component.

The eigendecomposition of the covariance matrix is

$$\mathbf{C} = \mathbf{Q} \mathbf{\Lambda} \mathbf{Q}^T $$

where $\mathbf{Q}$ contains the eigenvectors (principal components) as columns, and $\mathbf{\Lambda}$ is a diagonal matrix containing the eigenvalues in descending order

PCA Algorithm

Center the data: Subtract the mean from each feature: $\mathbf{X}_{\text{centered}} = \mathbf{X} - \bar{\mathbf{X}}$
Compute the covariance matrix: $\mathbf{C} = \frac{1}{n-1}\mathbf{X}_{\text{centered}}^T\mathbf{X}_{\text{centered}}$
Find eigenvalues and eigenvectors: Solve $\mathbf{C}\mathbf{q}_i = \lambda_i\mathbf{q}_i$ (as outlined above)
Sort by eigenvalue: Arrange eigenvectors by decreasing eigenvalue
Select components: Choose the top $k$ eigenvectors for dimensionality reduction
Transform data: Project data onto the new basis: $\mathbf{Y} = \mathbf{X}_{\text{centered}}\mathbf{Q}_k$

Non-Negative Matrix Factorization

Non-negative Matrix Factorization (NMF) is a variant of matrix factorization that imposes non-negativity constraints on the factor matrices.

Given a non-negative matrix $\mathbf{V}$, NMF seeks non-negative matrices $\mathbf{W}$ and $\mathbf{H}$ such that

$$\mathbf{V} \approx \mathbf{W} \mathbf{H}$$

where $\mathbf{W}$ is $n \times k$ and $\mathbf{H}$ is $k \times m$, with $k \le \min(n,m)$.

NMF decomposes data into additive combinations of non-negative basis vectors, revealing the underlying parts or features that compose the data.

NMF Algorithm

For $\mathbf{W}$ and $\mathbf{H}$ we perform random initialization with non-negative values.

Most commonly, the objective is to minimize the difference between $\mathbf{V}$ and $\mathbf{W}\mathbf{H}$ as

$$\min_{\mathbf{W} \ge 0, \mathbf{H} \ge 0} ||\mathbf{V} - \mathbf{W} \mathbf{H}||_F^2$$

where $|| \cdot ||_F$ denotes the Frobenius norm.

Subsequently, we can use the multiplicative update rule to iteratively update $\mathbf{W}$ and $\mathbf{H}$. These are derived from gradient descent and ensure that non-negativity constraints are maintained.

$$\mathbf{H} \leftarrow \mathbf{H} \odot \frac{\mathbf{W}^T \mathbf{V}}{\mathbf{W}^T\mathbf{W}\mathbf{H}} $$

$$\mathbf{W} \leftarrow \mathbf{W} \odot \frac{\mathbf{V} \mathbf{H}^T}{\mathbf{W}\mathbf{H}\mathbf{H}^T}$$

Here, $\odot$ denotes element-wise multiplication, division is element-wise too.

Repeat this update process until the cost function converges or a maximum number of iterations is reached.

Independent Component Analysis

Independent Component Analysis (ICA) is a computational technique for separating a multivariate signal into additive, statistically independent components.

Unlike PCA, which finds orthogonal components that maximize variance, ICA seeks components that are statistically independent, meaning that the value of one component does not provide any information about the values of the other components.

ICA assumes that the observed data $\mathbf{X}$ is a linear mixture of independent source signals $\mathbf{S}$ such that

$$\mathbf{X} = \mathbf{A} \mathbf{S}$$

where $\mathbf{A}$ is the unknown mixing matrix combining the independent source signals $\mathbf{S}$ into $\mathbf{X}$.

The goal is to find the unmixing matrix $\mathbf{W}$ such that

$$\mathbf{S} = \mathbf{W}\mathbf{X}$$

Unlike PCA, which seeks orthogonal directions of maximum variance, ICA aims to recover statistically independent source signals from mixed observations.

Factor Analysis

Factor Analysis is a statistical method that models observed variables as linear combinations of unobserved latent factors. Unlike PCA, Factor Analysis is a latent variable model that seeks to explain the correlations among observed variables through underlying factors.

Factor Analysis can be mathematically defined as:

$$\mathbf{x} = \mathbf{\mu} + \mathbf{\Lambda}\mathbf{f} + \mathbf{\epsilon}$$

where $\mathbf{x}$ is the vector of observed variables, $\mathbf{\mu}$ is the mean vector of $\mathbf{x}$, $\mathbf{\Lambda}$ is the factor loading matrix, $\mathbf{f}$ is the vector of common (latent) factors and $\mathbf{\epsilon}$ is the vector of specific factors (noise).

Factor Analysis does make several assumptions, which are important to discuss.

Common factors $\mathbf{f}$
- $\mathbb{E}[\mathbf{f}] = \mathbf{0}$ ensures the factors are centered, simplifying interpretation and model estimation by removing location effects.
- $\text{Cov}(\mathbf{f}) = \mathbf{I}$ standardizes the factors for identifiability, preventing arbitrary scaling and ensuring each factor contributes equally to the model.
Specific factors $\mathbf{\epsilon}$
- $\mathbb{E}[\boldsymbol{\epsilon}] = \mathbf{0}$ ensures that deviations from the common factors are purely random noise, not systematic bias.

$\text{Cov}(\boldsymbol{\epsilon}) = \mathbf{\Psi}$ (diagonal) assumes unique variances per variable but no shared structure, ensuring that residual variability is independent across observed variables.

Independence $\mathbf{f}, \mathbf{\epsilon}$
- $\text{Cov}(\mathbf{f}, \boldsymbol{\epsilon}) = \mathbf{0}$ prevents overlap between shared and unique variability, ensuring that common factors explain only systematic variation, while specific factors capture only noise

The covariance matrix of the observed variables decomposes as:

$$\boldsymbol{\Sigma} = \mathbf{\Lambda}\mathbf{\Lambda}^T + \mathbf{\Psi}$$

where $\mathbf{\Lambda}\mathbf{\Lambda}^T$ represents the common variance, capturing correlations due to shared latent factors, and $\mathbf{\Psi}$ represents the unique variance, containing variances specific to each observed variable.

The factor loadings $\mathbf{\Lambda}$ indicate how strongly each observed variable relates to the latent factors. The diagonal elements of $\mathbf{\Psi}$ represent the variance in each observed variable not explained by the common factors.

Conclusion

Matrix factorization is a powerful framework that underpins many modern data analysis and machine learning techniques. While all these methods decompose a matrix into simpler components, each serves a distinct purpose and operates under different assumptions:

Singular Value Decomposition (SVD) provides a general-purpose, orthogonal decomposition that works for any matrix, making it ideal for dimensionality reduction, noise reduction, and solving linear systems.
Principal Component Analysis (PCA) leverages eigendecomposition to find directions of maximum variance, making it a go-to method for unsupervised dimensionality reduction.
Non-Negative Matrix Factorization (NMF) imposes non-negativity constraints, making it particularly useful for interpretable parts-based representations, such as in image processing and topic modeling.
Independent Component Analysis (ICA) focuses on statistical independence rather than orthogonality, making it suitable for blind source separation tasks like audio signal processing.
Factor Analysis models observed data as a combination of latent variables, providing a probabilistic framework for understanding underlying structures in data.

Each technique has its strengths and limitations, and the choice depends on the problem at hand. SVD and PCA excel in capturing linear relationships, while NMF and ICA provide more interpretable or statistically independent components. Factor Analysis, meanwhile, offers a probabilistic perspective on latent structures.