A note on Singular Value Decomposition and Principal Components

Let $\mathbf{X}$ denote an $n \times k$ data matrix, e.g., observations on $k$ variables for $n$ individuals. In what follows, we assume $\mathbf{X}$ has been centered at zero, i.e., each element of the original uncentered data matrix has had the mean of its column subtracted from it to form the corresponding element of $\mathbf{X}$ . If $\mathbf{X}^{\prime}$ denotes the transpose of $\mathbf{X}$ , then the $k \times k$ matrix $\mathbf{X}^{\prime} \mathbf{X}$ is $(n-1)$ -times the covariance matrix of the data. Since $\mathbf{X}^{\prime} \mathbf{X}$ is symmetric, it has real nonnegative eigenvalues which we will denote by $s_j^2$ for $j = 1, \ldots, k$ , and corresponding orthonormal eingenvectors $\mathbf{v}_j$ , $j = 1, \ldots, k$ . It is customary to order these so that $s_1^2 \ge s_2^2 \ge \cdots s_k^2$ , and we can then write the eigenvalue equations in matrix form as

$\mathbf{X}^{\prime} \mathbf{X} \mathbf{V} = \mathbf{V} \Sigma^2 \qquad \qquad \qquad \qquad \qquad (1)$

where $\mathbf{V}$ is a $k \times k$ matrix whose columns are the normalised eigenvectors of $\mathbf{X}^{\prime} \mathbf{X}$ , and $\Sigma^2$ is a $k \times k$ diagonal matrix with the eigenvalues in descending order of size along the main diagonal.

In Principal Components Analysis (PCA), the eigenvectors of $\mathbf{X}^{\prime} \mathbf{X}$ are referred to as the principal directions of the centered data matrix $\mathbf{X}$ . The principal components of the data matrix $\mathbf{X}$ are the projections $\mathbf{X} \mathbf{v}_i$ of the data set onto its principal directions. The $n \times k$ product matrix $\mathbf{X} \mathbf{V}$ is thus used for data dimensionality reduction. The idea is that the first two or three columns of the $n \times k$ principal components matrix $\mathbf{X} \mathbf{V}$ might capture most of the variation between the $n$ individuals in the data set, this variation being measured by the corresponding eigenvalues in the diagonal matrix $\Sigma^2$ . We can then produce a two or three-dimensional scatter plot of the data, with each individual in the rows of $\mathbf{X} \mathbf{V}$ having a point in the scatter plot with coordinates appearing in the first two or three columns of $\mathbf{X} \mathbf{V}$ along that row. The remaining columns in $\mathbf{X} \mathbf{V}$ can be ignored since they do not vary much between the individuals, so the large data set has been reduced to a convenient two or three dimensional scatterplot displaying most of the variation in the data set.

The spectral decomposition of the matrix $\mathbf{X}^{\prime} \mathbf{X}$ is thus

$\mathbf{X}^{\prime} \mathbf{X} = \mathbf{V} \Sigma^2 \mathbf{V}^{\prime} \qquad \qquad \qquad \qquad \qquad (2)$

which can be written equivalently as

$(\mathbf{X} \mathbf{V})' (\mathbf{X} \mathbf{V}) = \Sigma^2 \qquad \qquad \qquad \qquad \qquad (3)$

But note that $(\mathbf{X} \mathbf{V})' (\mathbf{X} \mathbf{V})$ is the covariance matrix of $\mathbf{X} \mathbf{V}$ . This shows that the eigenvalues of $\mathbf{X}^{\prime} \mathbf{X}$ along the diagonal of $\Sigma^2$ are the variances of the columns of the principal components matrix $\mathbf{X} \mathbf{V}$ , and remember these are arranged in descending order of size. Furthermore, the principal directions $\mathbf{V}$ represented by the eigenvectors of $\mathbf{X}^{\prime} \mathbf{X}$ are the directions in which the data are maximally dispersed, i.e., the total dispersion cannot be increased by choosing any other matrix $\mathbf{M}$ instead of $\mathbf{V}$ which satisfies $\mathbf{M}' \mathbf{M} = \mathbf{I}$ . To clarify this, note that

$\text{tr}[(\mathbf{X} \mathbf{V})' (\mathbf{X} \mathbf{V})] = \text{tr}(\mathbf{V}' \mathbf{X}'\mathbf{X}\mathbf{V}) = \text{tr}(\mathbf{X}'\mathbf{X}\mathbf{V}\mathbf{V}') \leq \sum_{i=1}^k \lambda_i(\mathbf{X}'\mathbf{X}) \lambda_i(\mathbf{V}\mathbf{V}')$

where the last inequality is known as von Neumann’s trace inequality, and the notation $\lambda_i(\mathbf{M})$ , $i = 1, \ldots, k$ , represents the eigenvalues of a matrix $\mathbf{M}$ arranged in order of decreasing size. We see that the sum of the variances of the columns of the principal components matrix $\mathbf{X} \mathbf{V}$ attains the upper bound in von Neumann’s inequality when $\mathbf{V}$ is the orthonormal matrix of eigenvectors of $\mathbf{X}'\mathbf{X}$ , so that $\mathbf{V} \mathbf{V}' = \mathbf{I}$ and

$\text{tr}[(\mathbf{X} \mathbf{V})' (\mathbf{X} \mathbf{V})] = \sum_{i=1}^k \lambda_i(\mathbf{X}'\mathbf{X})$

There is a more general decomposition of the data matrix $\mathbf{X}$ , called Singular Value Decomposition (SVD), which is closely related to PCA. The SVD approach is based on factoring $\mathbf{X}$ into three parts

$\mathbf{X} = \mathbf{U} \Sigma \mathbf{V}^{\prime} \qquad \qquad \qquad \qquad \qquad (4)$

where $\mathbf{U}$ is an orthogonal $n \times n$ matrix containing the eigenvectors of the $n \times n$ matrix $\mathbf{X} \mathbf{X}^{\prime}$ , $\Sigma$ is an $n \times k$ diagonal matrix whose $k$ nonzero entries are called the singular values of $\mathbf{X}$ and are the square roots of the eigenvalues of $\mathbf{X}^{\prime} \mathbf{X}$ , and $\mathbf{V}$ is the $k \times k$ orthogonal matrix of eigenvectors of $\mathbf{X}^{\prime} \mathbf{X}$ . Note that $\mathbf{X} \mathbf{X}^{\prime}$ is an $n \times n$ matrix but with the same eigenvalues as $\mathbf{X}^{\prime} \mathbf{X}$ , so is of rank $k$ . Thus, there will be $k$ columns of $\mathbf{U}$ containing the eigenvectors of $\mathbf{X} \mathbf{X}^{\prime}$ and the rest of the columns will have to be ‘padded’ with unit vectors that are pairwise orthogonal with all the others (e.g., these added columns can each have a 1 as the only nonzero element, with the positions of the 1 varied to ensure orthogonality with the other columns of $\mathbf{U}$ ). The matrix $\Sigma$ will typically have rows of zeros below the diagonal $k \times k$ upper part.

The decomposition in (4) is used widely in data analysis and image compression. For example, in the case of image compression, $\mathbf{X}$ might be an extremely large matrix containing information about each pixel in an image. The factorisation on the right-hand side of (4) can then be used to reduce the dimensionality of the data in a manner reminiscent of PCA. It may be possible to limit the nonzero singular values in $\Sigma$ to a much smaller number than $k$ , and to use only the corresponding first few columns of $\mathbf{U}$ , while still retaining most of the key information required to render the image faithfully. Also note that in the case when $\mathbf{X}$ is a real square matrix, the right-hand side of (4) can be interpreted in geometric terms as decomposing the transformation represented by the matrix $\mathbf{X}$ into a rotation, followed by a scaling, followed by another rotation.

It is useful to understand the link between SVD and PCA because computational maths systems like Python and MATLAB have in-built functions for SVD which can be used to speed up the calculations required for PCA. Given a data matrix $\mathbf{X}$ , the SVD in (4) can be carried out quickly and easily using in-built functions. The output matrices $\Sigma$ and $\mathbf{V}$ can then be used to construct the principal components matrix $\mathbf{X} \mathbf{V}$ and eigenvalue matrix $\Sigma^2$ required for PCA.

A note on Singular Value Decomposition and Principal Components

Published by Dr Christian P. H. Salas

Leave a comment Cancel reply