The future of the Lake Arrowhead matrix
Cleve Moler just wrote a fantastic blog post about the Lake Arrowhead graph. The Lake Arrowhead graph, or the Golub Graph as I like call it, was a coauthor graph centered around Gene collected at the 1993 Householder meeting in Lake Arrowhead, California. We'll study this network using a few one-liner Matlab link-prediction techniques. These are all matrix equations that Gene would have enjoyed, so this is a fitting tribute. We'll see which predicts the most correct links in the top 25 (with at most 5 predictions for any author ...)
There are few issues with the Matlab code that would ruffle the feathers of any numerical analysis. Namely, I've used the inv function to greatly simplify the codes. Hopefully we can agree it's okay for this simple example.
I've decided to crowdsource the actual ranking! Please vote for which list you think looks "most right"!
Contents
- Download the network
- Load the network and look at it.
- Remove self-loops
- Katz scores
- Katz scores with a tight and nearly diagonally dominant, matrix
- Katz scores with a matrix that's very DD
- PageRank with alpha = 0.85
- PageRank with alpha = 0.5
- PageRank with alpha = 0.99
- The matrix exponential
- The matrix exponential with the normalized Laplacian
- Commute time
- Appendix
Download the network
Visit Gilbert's original website and download the tar.Z file, unzip it, proceed into the householder directory, and turn the "drawit.m" file into a function.
Load the network and look at it.
The drawit function is included in the tar file with the graph. We do need to edit it, so convert it into a function, otherwise you won't get nice figures below
load housegraph; n = size(A,1); drawit axis off;
![](linkpredict_01.png)
Remove self-loops
When we are predicting the future, allowing self-loops hardly seems fair! Although, in truth, they have a fairly predictable affect on each of the methods we study. It's just easier to remove them.
B = A - diag(diag(A));
Katz scores
Leo Katz wrote a fantastic paper in 1953 on a new way to determine a "status index" that measures how important someone is in a social network. More recently, the Katz score between a pair of nodes was shown to predict links that would form in the future. See this paper by Jon Kleinberg (http://www.cs.cornell.edu/home/kleinber/link-pred.pdf) and some recent work by Inderjit Dhillon (http://www.cs.utexas.edu/users/inderjit/Talks/2010-IPAM-Inderjit.pdf)
The Katz score between node i and j is the number of walks between the nodes, where the weight of a walk of length k is alpha^k. These scores are usually defined via the Neumann series:
K = alpha*A + alpha^2*A^2 + alpha^3*A^3 + ...
which expresses the score for all pairs of nodes simultaneously. The relationship with the Neumann series means that we can evaluate this infinite series in Matlab using a single matrix inverse:
K = inv(eye(n) - alpha*A)
Picking alpha is always the key question with this method. While the inverse exists as long as alpha isn't one of a very few special values related to the eigenvalues of A, most of these choices are rotten. Choosing alpha = 1/largest degree is common as it'll make the Neumann series converge. We evaluate this value as well as a smaller value of alpha that should favor short connections instead of longer connections.
That's a long-winded explaination for the next three lines of Matlab code to actually compute them!
Katz scores with a tight and nearly diagonally dominant, matrix
fprintf('\nKatz scores with DD matrix\n');
alpha = 1./max(sum(B,2));
showpreds(inv(eye(n) - alpha*B),A,25,name);
Katz scores with DD matrix Golub - Paige with value 0.005967 Golub - Demmel with value 0.005407 Golub - Schreiber with value 0.005146 Golub - Hammarling with value 0.004965 Schreiber - Heath with value 0.004887 VanLoan - VanDooren with value 0.004752 TChan - Demmel with value 0.004682 VanLoan - Heath with value 0.003874 Eisenstat - Plemmons with value 0.003845 Golub - Liu with value 0.003818 VanDooren - Heath with value 0.003796 VanDooren - Plemmons with value 0.003785 Golub - Funderlic with value 0.003717 Golub - Harrod with value 0.003699 Golub - Boley with value 0.003698 Moler - Demmel with value 0.003689 Golub - NTrefethen with value 0.003660 George - Gilbert with value 0.003645 Heath - Meyer with value 0.003609 Eisenstat - Ng with value 0.003602 Bjorck - VanDooren with value 0.003589 Kagstrom - Luk with value 0.003563 Gilbert - Eisenstat with value 0.003547 Park - VanDooren with value 0.003488 Schreiber - Bunch with value 0.003483
![](linkpredict_02.png)
Katz scores with a matrix that's very DD
fprintf('\nKatz scores with very DD matrix\n'); alpha = 1./max(sum(B,2)+100); % smaller alpha bound showpreds(inv(eye(n) - alpha*B),A,25,name);
Katz scores with very DD matrix Golub - Paige with value 0.000298 Golub - Demmel with value 0.000245 Golub - Schreiber with value 0.000242 Golub - Hammarling with value 0.000240 Schreiber - Heath with value 0.000240 VanLoan - VanDooren with value 0.000239 TChan - Demmel with value 0.000238 VanLoan - Heath with value 0.000182 Eisenstat - Plemmons with value 0.000182 Golub - Liu with value 0.000181 VanDooren - Heath with value 0.000181 VanDooren - Plemmons with value 0.000181 Moler - Demmel with value 0.000181 George - Gilbert with value 0.000180 Golub - Funderlic with value 0.000180 Golub - Harrod with value 0.000180 Golub - Boley with value 0.000180 Eisenstat - Ng with value 0.000180 Golub - NTrefethen with value 0.000180 Heath - Meyer with value 0.000179 Bjorck - VanDooren with value 0.000179 Kagstrom - Luk with value 0.000179 Gilbert - Eisenstat with value 0.000179 Park - VanDooren with value 0.000179 Schreiber - Bunch with value 0.000178
![](linkpredict_03.png)
PageRank with alpha = 0.85
PageRank is typically used for finding the most important links on a web-page. However, it can just as easy be used to determine which links are most likely to form in the future! We compute the PageRank value for each node, where it "teleports" back to itself as a link-prediction measure.
Alpha = 0.85 is the standard value of alpha in PageRank. Let's see what predictions we get with this method. The output is a non-symmetric matrix and so we'll symmetrize it by taking the best prediction for each user.
fprintf('\nPageRank with alpha = 0.85\n');
alpha = 0.85;
symmax = @(X) max(X,X');
showpreds(symmax(inv(eye(n) - alpha*B*diag(1./sum(B,2)))),A,25,name)
PageRank with alpha = 0.85 Crevelli - Ipsen with value 0.874633 Kincaid - Varga with value 0.866249 Wold - Kagstrom with value 0.699082 Golub - Modersitzki with value 0.661777 Golub - Davis with value 0.579365 Gutknecht - Starke with value 0.564489 Widlund - Boman with value 0.526378 He - Gragg with value 0.520894 Golub - Smith with value 0.512122 Varga - Hochbruck with value 0.508966 Schreiber - Boman with value 0.490197 Davis - Demmel with value 0.483887 TChan - MuntheKaas with value 0.478409 MuntheKaas - Demmel with value 0.451077 Golub - Ashby with value 0.447502 Modersitzki - Reichel with value 0.446283 Golub - Szyld with value 0.434651 Golub - Harrod with value 0.431189 Golub - Marek with value 0.420211 Marek - Widlund with value 0.412368 Golub - Benzi with value 0.406861 Golub - Boley with value 0.405866 Golub - Starke with value 0.403452 Varga - Szyld with value 0.401111 Golub - Hochbruck with value 0.395218
![](linkpredict_04.png)
PageRank with alpha = 0.5
fprintf('\nPageRank with alpha = 0.5\n');
alpha = 0.5;
symmax = @(X) max(X,X');
showpreds(symmax(inv(eye(n) - alpha*B*diag(1./sum(B,2)))),A,25,name)
PageRank with alpha = 0.5 Crevelli - Ipsen with value 0.159729 Kincaid - Varga with value 0.157497 Wold - Kagstrom with value 0.149887 Golub - Modersitzki with value 0.110927 Gutknecht - Starke with value 0.108330 Widlund - Boman with value 0.103562 Golub - Davis with value 0.101269 Schreiber - Boman with value 0.099594 Modersitzki - Reichel with value 0.098967 Davis - Demmel with value 0.098850 Varga - Hochbruck with value 0.098413 He - Gragg with value 0.094472 TChan - MuntheKaas with value 0.084800 MuntheKaas - Demmel with value 0.082447 Marek - Widlund with value 0.078260 MuntheKaas - Duff with value 0.076992 Varga - Szyld with value 0.076467 Golub - Smith with value 0.072827 Pothen - Demmel with value 0.066630 Cullum - Demmel with value 0.065399 Strakos - Demmel with value 0.065399 Kenney - Paige with value 0.063973 Chandrasekaran - Jessup with value 0.063891 Cullum - Hammarling with value 0.062903 Strakos - Hammarling with value 0.062903
![](linkpredict_05.png)
PageRank with alpha = 0.99
For a while, people were convinced that solving PageRank with alpha ~ 1 was the right thing to do as it kept the results as close to the graph as possible. For web-graphs, this is wrong as Vigna and his collaborators showed. However, for graphs that are strongly connected, it's still interesting. Let's see what we get!
fprintf('\nPageRank with alpha = 0.99\n');
alpha = 0.99;
symmax = @(X) max(X,X');
showpreds(symmax(inv(eye(n) - alpha*B*diag(1./sum(B,2)))),A,25,name)
PageRank with alpha = 0.99 Golub - Modersitzki with value 7.740538 Golub - Davis with value 7.551120 Golub - Smith with value 7.516129 Golub - Szyld with value 7.425185 Golub - Ashby with value 7.411030 Golub - Marek with value 7.408324 Golub - Starke with value 7.373870 Golub - Hochbruck with value 7.355581 Golub - Harrod with value 7.329255 Golub - Young with value 7.320046 Golub - Boley with value 7.316955 Golub - Nagy with value 7.284151 Golub - Pan with value 7.284151 Golub - Benzi with value 7.269946 Golub - Hansen with value 7.248056 Golub - Kincaid with value 7.246845 Golub - Funderlic with value 7.240728 Golub - Gu with value 7.235458 Golub - Paige with value 7.230160 Golub - Park with value 7.222309 Golub - NTrefethen with value 7.219603 Golub - Bjorstad with value 7.202310 Golub - VanHuffel with value 7.189794 Golub - Ruhe with value 7.182335 Golub - Liu with value 7.161991
![](linkpredict_06.png)
The matrix exponential
Benzi, Higham, and Estrada have written about using the matrix exponential for analyzing social networks. Let's see what we get here.
fprintf('\nMatrix exponential with adjacency\n');
showpreds(expm(B),A,25,name);
Matrix exponential with adjacency Golub - Demmel with value 200.667542 Golub - Schreiber with value 163.492101 Golub - Hammarling with value 134.885201 Golub - Paige with value 133.287838 Golub - Liu with value 107.695220 VanLoan - Heath with value 105.109816 VanDooren - Heath with value 104.281025 Golub - Moler with value 103.008898 VanDooren - Plemmons with value 96.089354 Golub - Laub with value 93.927783 Schreiber - Heath with value 93.858705 Golub - Gilbert with value 93.130294 Plemmons - Demmel with value 91.019041 VanLoan - Plemmons with value 89.104026 Golub - Barlow with value 88.970881 Golub - Ng with value 87.220015 Golub - Duff with value 87.169014 Eisenstat - Plemmons with value 87.001184 Golub - Funderlic with value 86.966110 Luk - Heath with value 85.078033 VanDooren - Demmel with value 81.426481 VanLoan - VanDooren with value 80.838813 Luk - Plemmons with value 79.727039 Golub - Harrod with value 78.072424 Kagstrom - Heath with value 77.667173
![](linkpredict_07.png)
The matrix exponential with the normalized Laplacian
Based on work with PageRank, and Fan Chung Graham's "Heat-Kernel PageRank" version, I've often wondered how well the matrix exponential with the normalized Laplacian does instead ...
fprintf('\nMatrix exponential with normalized Laplcian\n');
Dhalf = diag(sqrt(1./sum(B,2)));
L = eye(n) - Dhalf*A*Dhalf;
showpreds(expm(L),A,25,name);
Matrix exponential with normalized Laplcian Varga - Hochbruck with value 0.201888 Gutknecht - Starke with value 0.190164 Crevelli - Ipsen with value 0.187998 Kincaid - Varga with value 0.187151 Wold - Kagstrom with value 0.184371 BunseGerstner - Ammar with value 0.178577 Marek - Widlund with value 0.153282 Varga - Szyld with value 0.152184 He - Gragg with value 0.142192 Cullum - Strakos with value 0.138646 Widlund - Boman with value 0.131688 Chandrasekaran - Jessup with value 0.118900 Modersitzki - Reichel with value 0.117852 Smith - Szyld with value 0.117272 He - Byers with value 0.110082 Bunch - Hansen with value 0.107937 Bjorstad - Smith with value 0.102005 Fierro - OLeary with value 0.100629 Young - Starke with value 0.099905 Marek - Young with value 0.099319 OLeary - Smith with value 0.098833 Park - VanDooren with value 0.098414 Marek - Starke with value 0.098340 LeBorne - Fierro with value 0.097025 Bai - VanHuffel with value 0.093191
![](linkpredict_08.png)
Commute time
The commute time distance between two nodes of a graph is the expected time for a random walk started at one node to first the second node, and then return back to the first node. Pairs of nodes with no edges and small commute time distance are ones that are likely to form edges in the future. Let's see how this measure does!
fprintf('\nCommute time\n'); K = diag(sum(B,2)) - B; C = zeros(n,n); Ki = pinv(full(K)); for i=1:n, for j=1:n, C(i,j) = Ki(i,i) + Ki(j,j) - 2*Ki(i,j); end, end showpreds(-C,A,25,name); % small commute times are likely "new links"
Commute time Golub - Demmel with value -0.143612 Golub - Schreiber with value -0.165561 Golub - Hammarling with value -0.181163 Schreiber - Heath with value -0.217498 VanDooren - Demmel with value -0.223000 Golub - Moler with value -0.230225 Plemmons - Demmel with value -0.232642 VanDooren - Heath with value -0.241182 Golub - Paige with value -0.245504 VanDooren - Plemmons with value -0.252665 VanLoan - Heath with value -0.256789 Schreiber - Hammarling with value -0.257972 VanLoan - VanDooren with value -0.258443 Schreiber - VanDooren with value -0.259069 Hammarling - Heath with value -0.259965 Golub - Gragg with value -0.260218 Schreiber - Plemmons with value -0.262964 Moler - Demmel with value -0.265665 TChan - Demmel with value -0.276060 Moler - Heath with value -0.277328 VanLoan - Hammarling with value -0.280051 Hammarling - Plemmons with value -0.280855 VanLoan - Plemmons with value -0.288621 George - Schreiber with value -0.291199 Luk - Demmel with value -0.293697
![](linkpredict_09.png)
Appendix
You need two extra routines for this code:
type drawit
function drawit % DRAWIT Script to plot the graph of coauthors from Householder 93. load housegraph; % A is the adjacency matrix. % prcm is a permutation; any permutation could be substituted. % name is the vector of people's names. % xy is just xy coords of points on the unit circle (and the origin). Aperm = A(prcm,prcm); nameperm = name(prcm,:); nfolks = max(size(A)); clf gplot(Aperm,xy); axis off; axis square; x = xy(:,1) * 1.05; y = xy(:,2) * 1.05; x(1) = .08; % Put names(1) in the center of the circle. y(1) = .06; h = text(x,y,nameperm); for k = 2:nfolks % Shrink the font for the other names, and rotate them. set(h(k),'fontsize',10); set(h(k),'rotation',180/pi*atan2(y(k),x(k))); end;
type showpreds
function showpreds(P,G,k,names) n = size(P,1); P = P - spones(G).*P; % remove all the given links [i,j,v] = find(triu(P,1)); % remove any self-loops [~,p] = sort(v,'descend'); cnt = zeros(n,1); cur=0; Preds = zeros(k,2); for pi=p(:)' if cur >= k, break; end li=i(pi); lj=j(pi); if cnt(li) > 5 && cnt(lj) > 5, continue; end fprintf(' %20s - %-20s with value %f \n', ... deblank(names(li,:)), names(lj,:), v(pi)); cnt(li) = cnt(li)+1; cnt(lj)=cnt(lj)+1; cur = cur+1; Preds(cur,1) = li; Preds(cur,2) = lj; end % Display them Preds = sparse(Preds(:,1),Preds(:,2),1,n,n); load housegraph; drawit; hold on; gplot(Preds(prcm,prcm),xy,'r.-'); hold off; axis square;