Processing math: 57%

Singular Value Decomposition and LoRa

In this post we:

  • Motivate and gain some geometric understanding of the Singular Value Decomposition.
  • Give a formal proof that the SVD exists which follows the geometric intuition.
  • Gain some geometric intuition for XX and how to connect the SVD with the eigenstuff of XX.
  • How to use SVD to do linear regression (aka projecting a vector y onto Im(X)).
  • See how the SVD naturally gives best low rank approximations of linear transformations.
  • Understand the subspace similarity metric introduced in the LoRa paper through the geometric lens we have developed here.

Geometric Motivation for SVD

Consider the matrix

X=[111311]

Then the image of X is the plane spanned by the columns of X.

The image of the unit circle is an ellipse.

You can access an interactive version here.

The singular value decomposition of X is motivated by a compelling geometric problem: we want to find the axes of this ellipse.

For now let’s just treat the SVD as a black box, and see each part of the SVD corresponds to a part of the solution to this problem.

import numpy as np

X = np.array([[1,1],[1,3],[1, -1]])
svdstuff = np.linalg.svd(X)

U = svdstuff[0] 

D = np.concatenate([np.diag(svdstuff[1]), np.zeros((1,2))])  

V = np.transpose(svdstuff[2])
print('U = \n', U)
print('\n D = \n', D)
print('\n V = \n', V)

print('\n These should be equal \n',  X, '\n = \n ', np.dot(U, np.dot(D, np.transpose(V))))

U = [[-3.65148372e-01 4.47213595e-01 -8.16496581e-01] [-9.12870929e-01 -2.57661936e-16 4.08248290e-01] [ 1.82574186e-01 8.94427191e-01 4.08248290e-01]]

D = [[3.46410162 0. ] [0. 1.41421356] [0. 0. ]]

V = [[-0.31622777 0.9486833 ] [-0.9486833 -0.31622777]]

These should be equal [[ 1 1] [ 1 3] [ 1 -1]] = [[ 1. 1.] [ 1. 3.] [ 1. -1.]]

So we have

X=UΣV

with

U[0.3650.4478.160.91200.4080.182.8940.402]dsdsΣ[3.464001.41400]dsdsV[0.3160.9490.9490.316]

The columns of U, which we call the left singular vectors are

u1=[0.3650.9120.182]dsdsu2=[0.4470.894]dsdsu3=[0.8160.4080.402]

The diagaonl entries of Σ are σ1=3.464 and σ2=1.414.

Let’s visualize what we have so far before moving on to V:

Again, you can access an interactive version here.

We can see that u1 is the unit vector pointing in the direction of the major axis of the ellipse, u2 is the unit vector pointing in the direction of the minor axis of the ellipse, and u3 is just chosen to complete an orthonormal basis of the codomain.

σ1u1 and σ2u2 are the vectors pointing from the center of the ellipse to the vertexes of the ellipse. σ1 and σ2 are the lengths of these.

The columns of V, which we call the right singular vectors are

v1=[0.3160.949]dsdsv2=[0.9490.316]

Let’s plot these in the domain:

These have been chosen so that

Xv1=σ1u1Xv2=σ2u2

In other words the right singular vectors (the vj) are the inverse images of the vertexes of the ellipse.

You may not have noticed it but a small miracle has occured. The right singular vectors are orthogonal to each other! This is really the key insight which makes SVD possible. We will understand why this happens soon.

To summarize, and slightly generalize:

  • We started with a n×p matrix X.
  • The unit sphere in Rp is transformed into an ellipsoid living in Im(X) in the domain Rn.
  • Let r=Rank(X). We found the unit vectors u1,u2,,ur pointing in the direction of the axes of this ellipsoid. We let σ1,σ2,σr be the distance from the origin to these vertexes.
  • v1,v2,vr are chosen to be inverse images of σ1u1,σ2u2,,σrur. It is a small miracle that these end up orthogonal as well.
  • If n>r, then we also completed the uj to form a basis of Rn: ur+1,ur+2,un are chosen to be orthonormal and span Im(X).
  • Similarly, if p>r then the null space of X will be Span(v1,v2,vr), and we choose the remaining vj to complete an orthonormal basis of Rp. You could also view this as the image ellipsoid having “degenerate axes” of length 0, and extend the σj to be 0 in this case.
  • At the end of this process we have orthonormal bases of both the domain and codomain with:
Xvj=σjuj

Let’s attempt to carry out this construction in general.

To find the major axis we will maximize S(β)=|Xβ|2 subject to the constraint that g(β)=|β|2=1. Note that the maximum must exist because S is a continuous function which we are maximizing a continuous function on the unit sphere in Rn which is compact. So we know that the maximum value is achieved.

To do this we will use the method of Lagrange Multipliers: at the maximizing βmax we have

S|βmax=λg|βmax

You can find a full explanation of how to compute these gradients here: just set y=0 in that derivation.

We have

S|β=2XXβdsdsg|β=2β

so we have that

XXβmax=λβmax

Notice that this also implies (apply βmax to both sides) that |Xβmax|2=λ, so λ0.

The key to SVD is the following lemma, which explains the small miracle we observed in our concrete example:

Small Miracle Lemma:

Let X, λ, βmax be as above. Let βperp be any vector orthogonal to βmax. Then Xβperp is also orthogonal to Xβmax.

Proof:

Xβperp,Xβmax=βperp,XXβmax=βperp,λβmax=λβperp,βmax=0

This lemma has a nice geometric interpretation: βmax is a radial vector of the unit sphere, and βperp is perpendicular to it. For a sphere, the tangent space is the space of vectors perpendicular to the radial vector. In the codomain we are looking at an ellipsoid. In general a radial vector will not be perpendicular to the tangent space of the ellipsoid, but we are saying it is when that radial vector is along the major axis of the ellipse.

Adjust the value of the slider s in this geogebra link. You can see that the radial vector of an ellipse is only perpendicular to the tangent line at the axes of the ellipse.

This should be geometrically reasonable: at points where the radial vector and tangent line are not perpendicular, the radial vector will get a little bigger if moved one direction, and a little smaller if moved in the other direction. At the max moving in either direction will increase the length, which means that the tangent line must be perpendicular to the radial vector.

We can now build bases for the domain and codomain of a matrix n×p matrix X as follows:

  • Choose v1 to be a vector maximizing |Xv1|2 subject to the constraint |v1|2=1. We have |Xv1|2=λ1. Set σ1=λ1, so that |Xv1|=σ1.
  • Set u1 to be the unit vector pointing in the same direction as Xv1. By our computation, Xv1=σ1u1.
  • Now restrict the domain of X to v1. By the lemma, the image of X|v1 is perpendicular to v1. So we can iterate: choose v2 to be a vector from v1 subject to the constraint that |v2|2=1. We have |Xv2|2=λ2. Set σ2=λ2, so that |Xv2|=σ2.
  • Set u2 to be the unit vector pointing in the same direction as Xv2, so that Xv2=σ2u2. By the lemma u2u1.
  • Let r=Rank(X). Continue to iteratively construct orthonormal vectors v1,v2,v3,,vr and u1,u2,u3,,ur until the uj span the image of X.
  • At this point Span(vj) will be the null space of X. We can find an orthonormal basis vr+1,vr+2,vr+3,,vp of the null space. Similarly, we can find an orthonormal basis ur+1,ur+2,,un of Im(X). We can set σj=0 for j>Rank(X) so that we continue to have Xvj=σjuj in this case.
  • Now {vj} is an orthonormal basis of Rp, {uj} is an orthonormal basis of Rn, and we have
Xvj=σjuj for 0jp

If we let U be the orthogonal matrix whose columns are uj, V be the orthogonal matrix whose columns are vj, and Σ be the n×p matrix whose diagonal is Σjj=σj, then we can rephrase this as the matrix factorization:

X=UΣV
  • Note: We call uj the left singular vectors, and vj the right singular vectors.

We have just proven the Singular Value Decomposition theorem. Let’s state it again formally:

Singular Value Decomposotion Theorem: Let X be any n×p matrix. Then we can find a factorization X=UΣV where

  • V is an orthogonal p×p matrix (its columns form an orthonormal basis of Rp).
  • U is an orthogonal n×n matrix (its columns form an orthonormal basis of Rn).
  • Σ is a n×p matrix which has zero entries everywhere but the diagonal Σjj=σj0.

Note: there are several slight variants of the SVD. One we will use in this post is the “compact SVD”. Here we only take the r=Rank(X) left and right singular vectors with non-zero singular values. So we get the decomposition

X=UrΣrVr

where Vr is an p×r matrix, Σr is an r×r diagonal matrix with positive values along the diagonal, and Ur is an n×r matrix.

Understanding how our “basis building” intuition corresponds to the matrix factorization.

I think it is worth breaking down the expression

X=UΣV

to really understand what each part is doing. I will “reconstruct” this formula using the basis vectors we found in the proof.

Let β be any vector in Rp.

Since the vj form an orthonormal basis of Rp, it is easy to express β in this basis: we just have

β=(v1β)v1+(v2β)v2++(vpβ)vp

We can compute the vector of coefficients using V:

[v1βv2βvpβ]=[v1βv2βvpβ]=[v1v2vp]β=Vβ

If we apply X to β expressed in this basis we get

X(β)=X(v1β)v1+(v2β)v2++(vpβ)vp=(v1β)X(v1)+(v2β)X(v2)++(vpβ)X(vp)=σ1(v1β)u1+σ2(v1β)u2++σp(vpβ)up+0up+1++0un

We already computed that the “vector of coefficients of β when expressed in the basis of right singular vectors” was Vβ.

We can see that we need to multiply these coefficients by σj. The coefficients are now ΣVβ (where Σ is also taking care of padding out’’ any extra 0 coefficients which may be needed).

We then want to use these coefficients to make a linear combination of the uj. We do that just by writing UΣVβ.

So we can see that for every β we have

Xβ=UΣVβ

which is why we have the equality

X=UΣV

In summary:

  • Vβ records the coefficients of β when written in the basis of vj.
  • ΣVβ then scales these coefficients by the singular values, and pads the list with zeros at the end to match the dimension of the output space.
  • UΣVβ uses the scaled coefficients to form a linear combination of the uj.
  • This is exactly what we want, since our understanding is that vjσjuj!

Intuition for XX and why the eigenstuff of XX relates to the Singular Value Decompostion.

You may have noticed that in our proof of the small miracle lemma, ended up with

XXβmax=λβmax

Another way to phrase this is that βmax is an eigenvector for XX with eigenvalue λ. Since we obtained our singular value decomposition by repeatedly applying this lemma, it turns out that all of the right singular vectors are eigenvectors of XX. We can actually compute that pretty easily:

XXvj=X(σjuj)=(UΣV)(σjuj)=VΣU(σjuj)=σjVΣej since U is orthogonal, the only component which lives is the jth one.=σ2jVej=σ2jvj

Note: I am using ej to represent the vector in Rn with component 1 in the jth coordinate and 0 otherwise.

This computation doesn’t give us a lot of intuition about XX or why its eigenstuff should related to the SVD stuff of X though. What does XXβ actually mean? Where is the connection to the ellipsoid?

The key to understanding XX is the following fact. For any two vectors β1,β2Rp, we have

XXβ1,β2=Xβ1,Xβ2

This is important enough to turn into a slogan:

To compute the inner product of the images of two vectors β1 and β2 under the transformation X, you can compute the inner product of β2 with XXβ1.

So XX is a tool for understanding inner product stuff in Im(X) by representing it as inner product stuff happening in the domain.

Let’s make that concrete with an example.

Let’s choose our old friend

X=[111311]

This is transforming R2 into a plane in R3. It does so in a way which distorts both angles and distances. The angle and distance between two vectors before the transformation will not be the same as their angle and distance after. XX is a tool which helps us understand how these angles and distances are distorted! We are thinking of XX as a bilinear form Rp×RpR mapping (β1,β2)β2XXβ1=Xβ1,Xβ2.

Please play with this geogebra app:

Here we can see two vectors β1 and β2 in the domain. I am looking at their image in the codomain (which is 3 dimensional), but looking “straight down” in the direction of u3. Essentially I just picked up Im(X) and plopped it down flat on the page so that u1 was in the positive x direction, and u2 was in the positive y direction.

We can see the both the angle between the inputs and the length of the inputs have changed.

Thinking of XXβ1 as living in the same space as β1 leads to some confusion. Even though these have the same dimensions, it might really be better to think of XXβ1 as covector than as a vector. It is a partially applied version of the bilinear form we discussed above. In other words, you should be thinking of XXβ1 as a map RpR defined by β2β2XXβ1=Xβ1,Xβ2. You might recognize that from this perspective, XXβ1 is a curried form of XX viewed a bilinear form.

Nonetheless, we can view XXβ as a vector, and the nicest way to visualize it is probably by thinking of it as (a multiple of) the gradient of the map β|Xβ|2. We can then levarage our intuition about gradient vectors pointing in the direction of greatest increase, being perpendicular to level sets, and having magnitude representing the “steepness” of ascent.

Play around with this geogebra app:

The standard Lagrange multiplier intuition shows us why the β which maximizes |Xβ|2 must be an eigenvector for XX.

I think that our understanding of XX as a bilinear form is the most important take away from this section though. We will see how this is crucial to understanding how SVD gives us optimal low rank approximations of a matrix.

Using SVD to project onto a subspace

Remember that we defined r=Rank(X) and Ur to be the first r columns of U.

The columns of Ur form an orthonormal basis of the image of X. So it is easy to use it to project a vector y onto the image of X:

ˆy=ProjIm(X)(y)=r1(yuj)uj

This can be rewritten as a matrix product as

ˆy=UrUry

To find the value of β with ˆy=Xβ, we can use

Xβ=UrUryUrΣrVrβ=UrUryΣrVrβ=UryVrβ=Σ1rUryβ=VrΣ1rUry

This should make intuitive sense as well:

  • When we orthogonally project y onto the image of X, Ury gives the coefficients of the uj spanning the image. Say
ˆy=c1u1+c2u2++crur

Then

Ury=[c1c2cr]
  • So Σ1rUry is scaling each of these coefficients by 1σj:
Σ1rUry=[c1σ1c2σ2crσr]
  • Finally β=VrΣ1rUry takes the linear combination of the right singular vectors vj using these coefficients.

β=c1σ1v1+c2σ2v2++crσrvr

  • Applying X to β will return ˆy since
X(β)=X(c1σ1v1+c2σ2v2++crσrvr)=c1σ1σ1u1+c2σ2σ2u2++crσrσrur=c1u1+c2u2++crur=ˆy

We can use this for linear regression!

Let’s check using some synthetic data:

import numpy as np
import pandas as pd

ones = np.ones((100,1))
x1 = np.random.random((100,1))
x2 = np.random.random((100,1))
X = np.stack([ones, x1, x2], axis = 1).reshape((100,3))

y = 2*ones - x1 + 3*x2 + np.random.normal(0,0.1,(100,1)) 

from sklearn.linear_model import LinearRegression
reg = LinearRegression(fit_intercept = False).fit(X, y) # X has a column of ones, so I set fit_intercept = False.
y_hat_sklearn = reg.predict(X)

svd_stuff = np.linalg.svd(X)
    # Note np.linalg.svd(X) is a tuple (U, list of singular values, V^\top)
U = svd_stuff[0]
U_r = U[:,:3] # The image is 3 dimensional, so we the first 3 left singular vectors span the image.
y_hat_svd = np.dot(np.dot(U_r, np.transpose(U_r)), y) # projecting onto the image using the first 3 left singular vectors

print((y_hat_sklearn - y_hat_svd).max()) # This should be close to 0!

4.440892098500626e-16


S_r = np.diag(svd_stuff[1])
S_r_inv = np.linalg.inv(S_r) # Note:  this is easy to calculate.  Just 1 divided by each singular value.
V_r = np.transpose(svd_stuff[2])[:,:3]
beta_svd = np.dot(np.dot(V_r, np.dot(S_r_inv, np.transpose(U_r))), y)
print('These should agree: \n', beta_svd, '\n\n',  reg.coef_)

These should agree: [[ 2.01228328] [-1.01411842] [ 2.98989642]]

[[ 2.01228328 -1.01411842 2.98989642]]

Using SVD to find low rank approximations of a matrix

Note: I learned this story about variational characterizations originally from Qiaochu Yuan’s blog, which is also an excellent take on all of this SVD content. They are more terse and less geometric in their exposition.

To understand how SVD gives low rank approximations, we should first understand the following two variational approach to the singular vectors and values:

Theorem: (First Variational Characterization of singular values)

Let X be an n×p matrix. Let j{1,2,3,...,p}.

We can characterize σj as follows:

σj=max

In other words, to find \sigma_j we look at all subspaces of dimension j of the domain, minimize \vert M \vec{v}\vert over the unit ball in that subspace. Different subspaces give us different minimum values. We choose the subspace V which maximizes this minimum value. In that case, the minimum value is \sigma_j.

If you have been following along, this should not be a surprizing characterization!

Here is the proof.

We already know that V_{\textrm{svd}} = \textrm{Span}(\vec{v}_1, \vec{v}_2, \vec{v_3}, ... \vec{v_j}) we have

\min_{\substack{ \vec{v} \in V_{\textrm{svd}} \\ \vert\vec{v}\vert = 1}} \vert M \vec{v}\vert = \sigma_j

This is true just by construction of the right singular vectors.

We need to show that any other choice of V gives us a smaller result than \sigma_j.

Let V' be another subspace of dimension k.

Then \textrm{Span}(v_j, v_{j+1}, ..., v_{p}) is a p-j+1 dimensional subspace, the intersection of V' with this subspace must be at least 1 dimensional. Take a unit vector \vec{v} \in V' \cap \textrm{Span}(v_j, v_{j+1}, ..., v_{p}). Let

\vec{v} = \sum_{i=j}^{i=p} c_i \vec{v}_i \textrm{ with } \sum_{i=j}^{i=p} \vert c_i\vert^2 = 1

Then

\begin{align*} \vert X \vec{v}\vert^2 &= \left\vert \sum_{j}^p X(c_i \vec{v}_i)\right\vert^2\\ &= \left\vert \sum_{j}^p c_i \sigma_i \vec{u}_i \right\vert^2\\ &= \sum_{j}^p (\vert c_i\vert^2 \vert\sigma_i\vert^2) \textrm{ Pythagorean Theorem: $\vec{u}_i$ are orthogonal!}\\ &\leq \sum_{j}^p (\vert c_i\vert^2 \vert\sigma_j\vert^2) \textrm{ since $\sigma_j \geq \sigma_{j+1} \geq ...$}\\ &= \vert\sigma_j\vert^2 \sum_{j}^p (\vert c_i\vert^2 )\\ &= \vert\sigma_j\vert^2 \end{align*}

as desired!

Theorem: (Second Variational Characterization of singular values)

Let X be an n \times p matrix. Let j \in \{1,2,3,...,p\}.

We can characterize \sigma_{j+1} as follows:

\sigma_{j+1} = \min_{\substack{ V \subset \mathbb{R}^p \\ \textrm{dim}(V) = p-j}} \max_{\substack{ \vec{v} \in V \\ \vert\vec{v}\vert = 1}} \vert M \vec{v}\vert

Here is the (very similar!) proof:

We already know that for V_{\textrm{svd}} = \textrm{Span}(\vec{v}_{j+1}, \vec{v}_{j+2}, \vec{v_{j+3}}, ... \vec{v_p}) we have

\max_{\substack{ \vec{v} \in V_{\textrm{svd}} \\ \vert\vec{v}\vert = 1}} \vert M \vec{v}\vert = \sigma_{j+1}

This is true just by construction of the right singular vectors.

We need to show that any other choice of V gives us a larger result than \sigma_{j+1}.

Let V' be another subspace of dimension p-j.

Then \textrm{Span}(v_1, v_2, ..., v_j, v_{j+1}) is a j+1 dimensional subspace, the intersection of V' with this subspace must be at least 1 dimensional. Take a unit vector \vec{v} \in V' \cap \textrm{Span}(v_1, v_2, ..., v_j, v_{j+1}). Let

\vec{v} = \sum_{i=1}^{i=j+1} c_i \vec{v}_i \textrm{ with } \sum_{i=1}^{i=j+1} \vert c_i\vert^2 = 1

Then

\begin{align*} \vert X \vec{v}\vert^2 &= \left\vert \sum_{1}^{j+1} X(c_i \vec{v}_i)\right\vert^2\\ &= \left\vert \sum_{1}^{j+1} c_i \sigma_i \vec{u}_i \right\vert^2\\ &= \sum_{1}^{j+1} (\vert c_i\vert^2 \vert\sigma_i\vert^2) \textrm{ Pythagorean Theorem: $\vec{u}_i$ are orthogonal!}\\ &\geq \sum_{1}^{j+1} (\vert c_i\vert^2 \vert\sigma_{j+1}\vert^2) \textrm{ since $\sigma_1 \geq \sigma_{2} \geq ... \geq \sigma_{j+1}$}\\ &= \vert\sigma_{j+1}\vert^2 \sum_{1}^{j+1} (\vert c_i\vert^2 )\\ &= \vert\sigma_{j+1}\vert^2 \end{align*}

as desired!

We are now ready for:

Theorem (Low rank approximation)

Let X be an n \times p matrix. Let X = U \Sigma V^\top be an SVD decomposition. Let \Sigma_j be the n \times p matrix with diagonal values

\Sigma_{k,k} = \begin{cases}\sigma_k \textrm{ if $k \leq j$} \\ 0 \textrm{ else}\end{cases}

Let X_j = U \Sigma_j V^\top. Clearly \textrm{Rank}(X_j) \leq j (with equality if \sigma_j \neq 0).

Our claim is that the X_j is the matrix of rank at most j which is closest to X in operator norm, and this minimal value is \sigma_{j+1}. The operator norm of a matrix M is largest singular value of M (aka the maximum value of \vert M\vert on the unit sphere).

Proof:

Let M_j be any other matrix of rank at most j. Then \textrm{Null}(M_j) has dimension at least p-j. By the second variational characterization, there must be a unit vector \vec{v} \in \textrm{Null}(M_j) with \vert X \vec{v}\vert \geq \sigma_{j+1}. But \vert X \vec{v}\vert = \vert X\vec{v} - M_k\vec{v}\vert since M_k\vec{v} = 0. Hence the operator norm of X - M_k is at least \sigma_{j+1}.

On the other hand X - X_j = U (\Sigma - \Sigma_k) V^\top. This is the SVD decomposition of X - X_j (up to reordering of the columns to put the singular values in descending order). So we can see that largest singular value of this matrix is \sigma_{j+1}, which is its operator norm.

Principle angles between subspaces

You are probably familiary with the fact that the angle between two vectors \vec{u} and \vec{v} can be defined as

\theta(u,v) = \angle(u, v) = \arccos\left( \frac{\vert \langle \vec{u}, \vec{v}\rangle \vert}{\vert\vec{u} \vert \vert \vec{v}\vert}\right)

You can probably also work out ad-hoc defininitions for the angle between a 1-dimesional subspace and a 2-dimensional subspace, or the angle between two 2-dimensional subspaces.

In higher dimensions things get a bit tougher. If you have an i and j dimensional subspace of \mathbb{R}^d, a single “angle” is no longer suitable as a description of the relationship between these subspaces.

In this section we motivate the definition of the principle angles between two subspaces, and prove that these angles can be computed using the SVD of a certain matrix. This will give a beautiful geometric interpretation for the subspace similarity measurement introduced in the LoRa paper (which we will cover in the next, final, section).

We quote Wikipedia’s variational definition (adjusting the notation to fit our use case) since it is the closest in spirit to the kind of work we have been doing, and (imho) is also the most geometrically motivated definition:

Given two subspaces \mathcal{U},\mathcal{W} of \mathbb{R}^d with \dim(\mathcal{U})=i \leq \dim(\mathcal{W}):=j, there exists then a sequence of i angles 0 \le \theta_1 \le \theta_2 \le \cdots \le \theta_k \le \pi/2 called the principal angles, the first one defined as:

\theta_1:=\min \left\{ \arccos \left( \frac{ \vert \langle u,w\rangle\vert}{\vert u \vert \vert w \vert }\right) \,: \, u\in \mathcal{U}, w\in \mathcal{W} \right\}=\angle(u_1,w_1),

The vectors u_1 and w_1 are the corresponding ‘‘principal vectors.’’

The other principal angles and vectors are then defined recursively via:

\theta_i:=\min \left\{ \arccos \left( \frac{ \vert \langle u,w\rangle\vert }{\vert u \vert \vert w \vert }\right) \,: \, u\in \mathcal{U},~w\in \mathcal{W},~u\perp u_j,~w \perp w_j \quad \forall j\in \{1,\ldots,i-1\} \right\}.

This means that the principal angles (\theta_1,\ldots, \theta_k) form a set of minimized angles between the two subspaces, and the principal vectors in each subspace are orthogonal to each other.

Let’s see how this is connected with the SVD.

Let U be a d \times i matrix whose columns are orthonormal and span \mathcal{U}. Similarly, let W be a d \times j matrix whose columns are orthonormal and span \mathcal{W}.

Consider the matrix P = W W^\top U. Let \beta be a unit vector in \mathbb{R}^i. Then U\vec{\beta} is a unit vector on \mathcal{U} since

\begin{align*} \vert U\vec{\beta}\vert^2 &= \langle U\vec{\beta}, U \vec{\beta}\rangle \\ &= \langle \vec{\beta}, U^\top U \vec{\beta}\rangle \\ &= \langle \vec{\beta}, \mathbb{I}_i \vec{\beta}\rangle \\ &= \langle \vec{\beta}, \vec{\beta}\rangle \\ &= \vert \vec{\beta}\vert^2\\ &= 1 \end{align*}

P \vec{\beta} computes the orthogonal projection of the unit vector \vec{u} = U \vec{\beta} onto the subspace \mathcal{W}. Why? Let the columns of W be called \vec{w}_k. Then

\begin{align*} P\vec{\beta} &= W W^\top \vec{u}\\ &= W W^\top \vec{u}\\ &= \sum_1^j \vec{w}_k (\vec{w}_k^\top \vec{u})\\ &= \sum_1^j \langle \vec{u}, \vec{w}_k \rangle \vec{w}_k\\ \end{align*}

This is the sum of the projections of \vec{u} into the direction of each of the orthonormal \vec{w}_k.

We will now see the connection between the singular value decomposition of this projection matrix with the principle angles.

Since we have already used U, I will write the singular value decomposition as:

P = T \Sigma V

A few obervations:

  • We saw that U is length preserving, and since W^\top W is an orthogonal projection it satisfies \vert W^\top W \vec{u}\vert \leq \vert \vec{u}\vert (this is a consequence of the Pythagorean theorem!). So P also satisfies \vert P(\vec{u}) \vert < \vert \vec{u}\vert.
  • As a consequence of this, the singular values of P are all in the interval [0,1].

Let’s look again at the definition of the first principle angle:

\theta_1:=\min \left\{ \arccos \left( \frac{ \vert \langle u,w\rangle\vert}{\vert u \vert \vert w \vert }\right) \,: \, u\in \mathcal{U}, w\in \mathcal{W} \right\}

Since the \arccos function is decreasing, we can rewrite this as

\cos(\theta_1):=\max \left\{ \frac{ \vert \langle u,w\rangle\vert}{\vert u \vert \vert w \vert } \,: \, u\in \mathcal{U}, w\in \mathcal{W} \right\}

by bilinearity we can scale \vec{u} and \vec{w} by any non-zero constant without changing \frac{ \vert \langle u,w\rangle\vert}{\vert u \vert \vert w \vert }. So we can instead maximize over the vectors of unit length:

\cos(\theta_1):=\max_{\substack{ \vert \vec{u}\vert = 1 \\ \vert\vec{w}\vert = 1 \\ }} \left\{ \vert \langle u,w \rangle \vert : \, u \in \mathcal{U}, w\in \mathcal{W} \right\}

There is a unique unit vector \beta \in \mathbb{R}^i with U\beta = \vec{u}, and w is the unique unit vector in \mathbb{R}^d with w = W W^\top w. Thus \langle u,w \rangle = \langle U \beta, W W^\top w \rangle = \langle W W^\top U \beta, w \rangle = \langle P \beta, w \rangle.

Since w and P \beta are both in \mathcal{W}, their dot product will be maximized when they point in the same direction (except in the special case P \beta = 0). So the maximum values of \langle P \beta, w \rangle is equal to \vert P \beta \vert where the maximum is taken over the unit sphere in \mathbb{R}^i:

\cos(\theta_1):=\max_{\vert \vec{\beta}\vert = 1} \left\{ \vert P \beta \vert : \, \beta \in \mathbb{R}^i \right\}

This was our definition of the first singular value of P, so we have:

Lemma: \cos(\theta_1) is the first singular value of the projection matrix P = WW^\top U.

An identical analysis (just follow our inductive construction of the SVD process) gives us the theorem:

Theorem: \cos(\theta_k) = \sigma_k where \sigma_k are the singular values of the projection matrix P = WW^\top U.

Whew!

Let’s try to visualize this to get a bit more geometric intuition.

Consider the two planes \mathcal{U} and \mathcal{W} spanned by the columns matrices

U = \begin{bmatrix} 1 & 1\\ 2 & 5\\ -1 & 6 \end{bmatrix} \hphantom{dsds} W = \begin{bmatrix} 1 & 3 \\ -1 & 2\\ 2 & 0 \end{bmatrix}

Interactive version .

We visualize the unit circle in \mathcal{U} and its orthogonal projection on onto \mathcal{W}, which is an ellipse:

Interactive version.

Since these planes intersect, the longest length of an orthogonal projection of a unit vector from \mathcal{U} onto \mathcal{W} will be 1: these vectors actually coincide! So the first left singular vector of W^\top W U would be a unit vector which lies in this intersection, the first left singular value would be 1, and the first principle angle would be 0.

The next second singular vector would be a unit vector pointed in the direction of the minor axis of the ellipse in \mathcal{W}. The second singular value is the distance from the origin to that vertex. The second principle angle would be be the inverse cosine of that distance. Make sure you understand why this is “exactly what you want” geometrically: \sigma_2 \vec{u}_2 is the base of a right triangle whose hypotenuse is a vector on the unit circle in \mathcal{U}!

The subspace similarity metric in the LoRa paper.

We are now finally ready to understand the similarity metric introduced in the LoRa paper. I quote:

In this paper we use the measure \phi(A,B,i,j) = \psi (U_A^i, U_B^j) = \frac{\vert(U_A^i)^\top U_B\vert^2_F}{\min \{i,j\}} to measure the subspace similarity between two column orthogonal matrices U_A^i \in \mathbb{R}^{d \times i} and U_B^j \in \mathbb{R}^{d \times j}, obtained by taking columns of the left singular matrices of A and B.

Note that there is a typo in the paper. I am quite sure that they meant to write \frac{\vert(U_A^i)^\top U_B^{j}\vert^2_F}{\min \{i,j\}} [note the “j” in the “exponent” of U_B^j].

Let’s break this down!

Let C be a d \times p matrix with SVD C = U_C \Sigma_C V_C^\top.

In their notation, U_C^k would be the matrix obtained by only retaining the first k columns of U. Let C_k be the best rank at most k approximation of C with respect to the operator norm. By the results in the last section, we know that the column space of U_C^k is the image of C_k!

So comparing the subspaces U_A^i and U_B^j is a very reasonable thing to do. We wouldn’t want to use all of U_A and U_B because the column space of both of those is all of \mathbb{R}^d! We could use U_A^\textrm{Rank(A)} = \textrm{Im}(A) and U_B^\textrm{Rank(B)} = \textrm{Im}(B), but the reality is that (as we have seen) it is the left singular vectors with the largest singular values which carry “most of the information” about a matrix. The low singular values correspond to noise. This is the intuitive content of our low rank approximation result. So it makes sense that instead of taking all of the non-negative singular values, we might only want to retain the top “so many” of them. This is what U_A^i and U_B^j are doing: they are retaining the i (respectively j) most important left singular vectors.

Now (U_A^i)^\top U_B^{j} looks an awful lot like the projection matrix U_A^i (U_A^i)^\top U_B^{j} we met in the last section on principle angles! In fact, these two matrices will have the same singular values (since, as we have computed, \vec{v} \mapsto M \vec{v} is a distance preserving map when the columns of M are orthonormal, as is the case for U_A^i).

They are using the Frobenius norm of this matrix. We really just need to know the definition:

\vert A \vert_\text{F} = \sqrt{\sum_{i}^m\sum_{j}^n \vert a_{ij}\vert ^2} = \sqrt{\operatorname{trace}\left(A^* A\right)} = \sqrt{\sum_{i=1}^{\min\{m, n\}} \sigma_i^2(A)}

A couple nice things about this norm:

  • It is much easier to compute than the operator norm. We don’t actually need to find the singular values of A, as the first two formulas show.
  • Even so, it carries information about all of the singular values of A.

Note: it is a nice exercise to prove the equivalence of these three ways of computing the norm. If you have been working through everything so far it shouldn’t be out of your reach to prove this.

Given the results of the last section, we can rewrite the similarity measure using the more “geometric” language of principle angles:

\phi(A,B,i,j) = \frac{1}{\min\{m, n\}}\sum_{i=1}^{\min\{m, n\}} \cos^2(\theta_i)

where \theta_i are the principle angles between the columns spaces of U_A^i and U_B^j.

In words, we are taking the “mean of the squared cosines of the principle angles”, where we are only averaging over the smaller of the two dimensions of the subspaces.

Armed with this intuitive perspective several things become obvious:

  • The maximum value of \phi(A,B,i,j) is 1, and this is only achieved when the smaller subspace is contained in the larger one.
  • The minimum value of \phi(A,B,i,j) is 0, and this is only achieved when the smaller subspace is in the orthogonal complement of the larger one.

I hope that this similarity measure makes a lot more sense to you now!

`