A note on Singular Value Decomposition and Principal Components

Let \mathbf{X} denote an n \times k data matrix, e.g., observations on k variables for n individuals. In what follows, we assume \mathbf{X} has been centered at zero, i.e., each element of the original uncentered data matrix has had the mean of its column subtracted from it to form the corresponding element of \mathbf{X}. If \mathbf{X}^{\prime} denotes the transpose of \mathbf{X}, then the k \times k matrix \mathbf{X}^{\prime} \mathbf{X} is (n-1)-times the covariance matrix of the data. Since \mathbf{X}^{\prime} \mathbf{X} is symmetric, it has real nonnegative eigenvalues which we will denote by s_j^2 for j = 1, \ldots, k, and corresponding orthonormal eingenvectors \mathbf{v}_j, j = 1, \ldots, k. It is customary to order these so that s_1^2 \ge s_2^2 \ge \cdots s_k^2, and we can then write the eigenvalue equations in matrix form as

\mathbf{X}^{\prime} \mathbf{X} \mathbf{V} = \mathbf{V} \Sigma^2 \qquad \qquad \qquad \qquad \qquad (1)

where \mathbf{V} is a k \times k matrix whose columns are the normalised eigenvectors of \mathbf{X}^{\prime} \mathbf{X}, and \Sigma^2 is a k \times k diagonal matrix with the eigenvalues in descending order of size along the main diagonal.

In Principal Components Analysis (PCA), the eigenvectors of \mathbf{X}^{\prime} \mathbf{X} are referred to as the principal directions of the centered data matrix \mathbf{X}. The principal components of the data matrix \mathbf{X} are the projections \mathbf{X} \mathbf{v}_i of the data set onto its principal directions. The n \times k product matrix \mathbf{X} \mathbf{V} is thus used for data dimensionality reduction. The idea is that the first two or three columns of the n \times k principal components matrix \mathbf{X} \mathbf{V} might capture most of the variation between the n individuals in the data set, this variation being measured by the corresponding eigenvalues in the diagonal matrix \Sigma^2. We can then produce a two or three-dimensional scatter plot of the data, with each individual in the rows of \mathbf{X} \mathbf{V} having a point in the scatter plot with coordinates appearing in the first two or three columns of \mathbf{X} \mathbf{V} along that row. The remaining columns in \mathbf{X} \mathbf{V} can be ignored since they do not vary much between the individuals, so the large data set has been reduced to a convenient two or three dimensional scatterplot displaying most of the variation in the data set.

The spectral decomposition of the matrix \mathbf{X}^{\prime} \mathbf{X} is thus

\mathbf{X}^{\prime} \mathbf{X} = \mathbf{V} \Sigma^2 \mathbf{V}^{\prime} \qquad \qquad \qquad \qquad \qquad (2)

which can be written equivalently as

(\mathbf{X} \mathbf{V})' (\mathbf{X} \mathbf{V}) = \Sigma^2 \qquad \qquad \qquad \qquad \qquad (3)

But note that (\mathbf{X} \mathbf{V})' (\mathbf{X} \mathbf{V}) is the covariance matrix of \mathbf{X} \mathbf{V}. This shows that the eigenvalues of \mathbf{X}^{\prime} \mathbf{X} along the diagonal of \Sigma^2 are the variances of the columns of the principal components matrix \mathbf{X} \mathbf{V}, and remember these are arranged in descending order of size. Furthermore, the principal directions \mathbf{V} represented by the eigenvectors of \mathbf{X}^{\prime} \mathbf{X} are the directions in which the data are maximally dispersed, i.e., the total dispersion cannot be increased by choosing any other matrix \mathbf{M} instead of \mathbf{V} which satisfies \mathbf{M}' \mathbf{M} = \mathbf{I}. To clarify this, note that

\text{tr}[(\mathbf{X} \mathbf{V})' (\mathbf{X} \mathbf{V})] = \text{tr}(\mathbf{V}' \mathbf{X}'\mathbf{X}\mathbf{V}) = \text{tr}(\mathbf{X}'\mathbf{X}\mathbf{V}\mathbf{V}') \leq \sum_{i=1}^k \lambda_i(\mathbf{X}'\mathbf{X}) \lambda_i(\mathbf{V}\mathbf{V}')

where the last inequality is known as von Neumann’s trace inequality, and the notation \lambda_i(\mathbf{M}), i = 1, \ldots, k, represents the eigenvalues of a matrix \mathbf{M} arranged in order of decreasing size. We see that the sum of the variances of the columns of the principal components matrix \mathbf{X} \mathbf{V} attains the upper bound in von Neumann’s inequality when \mathbf{V} is the orthonormal matrix of eigenvectors of \mathbf{X}'\mathbf{X}, so that \mathbf{V} \mathbf{V}' = \mathbf{I} and

\text{tr}[(\mathbf{X} \mathbf{V})' (\mathbf{X} \mathbf{V})] = \sum_{i=1}^k \lambda_i(\mathbf{X}'\mathbf{X})

There is a more general decomposition of the data matrix \mathbf{X}, called Singular Value Decomposition (SVD), which is closely related to PCA. The SVD approach is based on factoring \mathbf{X} into three parts

\mathbf{X} = \mathbf{U} \Sigma \mathbf{V}^{\prime} \qquad \qquad \qquad \qquad \qquad (4)

where \mathbf{U} is an orthogonal n \times n matrix containing the eigenvectors of the n \times n matrix \mathbf{X} \mathbf{X}^{\prime}, \Sigma is an n \times k diagonal matrix whose k nonzero entries are called the singular values of \mathbf{X} and are the square roots of the eigenvalues of \mathbf{X}^{\prime} \mathbf{X}, and \mathbf{V} is the k \times k orthogonal matrix of eigenvectors of \mathbf{X}^{\prime} \mathbf{X}. Note that \mathbf{X} \mathbf{X}^{\prime} is an n \times n matrix but with the same eigenvalues as \mathbf{X}^{\prime} \mathbf{X}, so is of rank k. Thus, there will be k columns of \mathbf{U} containing the eigenvectors of \mathbf{X} \mathbf{X}^{\prime} and the rest of the columns will have to be ‘padded’ with unit vectors that are pairwise orthogonal with all the others (e.g., these added columns can each have a 1 as the only nonzero element, with the positions of the 1 varied to ensure orthogonality with the other columns of \mathbf{U}). The matrix \Sigma will typically have rows of zeros below the diagonal k \times k upper part.

The decomposition in (4) is used widely in data analysis and image compression. For example, in the case of image compression, \mathbf{X} might be an extremely large matrix containing information about each pixel in an image. The factorisation on the right-hand side of (4) can then be used to reduce the dimensionality of the data in a manner reminiscent of PCA. It may be possible to limit the nonzero singular values in \Sigma to a much smaller number than k, and to use only the corresponding first few columns of \mathbf{U}, while still retaining most of the key information required to render the image faithfully. Also note that in the case when \mathbf{X} is a real square matrix, the right-hand side of (4) can be interpreted in geometric terms as decomposing the transformation represented by the matrix \mathbf{X} into a rotation, followed by a scaling, followed by another rotation.

It is useful to understand the link between SVD and PCA because computational maths systems like Python and MATLAB have in-built functions for SVD which can be used to speed up the calculations required for PCA. Given a data matrix \mathbf{X}, the SVD in (4) can be carried out quickly and easily using in-built functions. The output matrices \Sigma and \mathbf{V} can then be used to construct the principal components matrix \mathbf{X} \mathbf{V} and eigenvalue matrix \Sigma^2 required for PCA.

Published by Dr Christian P. H. Salas

Mathematics Lecturer

Leave a comment