Spectral Graph Partitioning and the Laplacian

Finding a Partition

In this segment, we'll plant a partition in a graph, and then use the second smallest eigenvector to find it.

As always, the first step is to generate our dataset. In this example, we'll be a little more ambitious and use a larger number of vertices.

>> n = 1000;

To plant an artificial partition in our dataset, we need to determine the vectices in each of the groups. To do this, we randomly generate a permutation of all the vertices and select one set as group 1 and the rest as group 2. The variable gs controls how big the size of the first group.

>> x = randperm(n); >> gs = 450; >> group1 = x(1:gs); >> group2 = x(gs+1:end);

Now, we need to decide on the probabilities of edges within each group and between the two groups. Because we are planting a partition, the probabilities of edges between the groups should be much lower than the probability of edges within each group. Suppose that group 1 is a little more tightly connected than group 2. (Please insert your own amusing names for an actual identification of group 1 and group 2, e.g. politicians and mathematicians.)

>> p_group1 = 0.5; >> p_group2 = 0.4; >> p_between = 0.1;

With these probabilities in hand, we can construct our graph. The last few operations symmetrize the matrix.

>> A(group1, group1) = rand(gs,gs) < p_group1; >> A(group2, group2) = rand(n-gs,n-gs) < p_group2; >> A(group1, group2) = rand(gs, n-gs) < p_between; >> A = triu(A,1); >> A = A + A';

Next, let's see if we can see the partition.

>> spy(A);

While some might claim to see a partition in this data, I think it is not particularly obvious.

Now, let's investigate what the second smallest eigenvector tells us about this graph.

>> L = laplacian(A); >> [V D] = eigs(L, 2, 'SA'); >> D(2,2) ans = 46.7158

The large second smallest eigenvalue states that we shouldn't expect to find any very good cuts in our dataset. To see what we found, let's plot the second smallest eigenvector V(:,2).

>> plot(V(:,2), '.-');

That isn't all that helpful. Maybe sorting the vector...

>> plot(sort(V(:,2)), '.-');

Now, that picture is much more informative! We see a large gap in the middle of the values. Interestingly enough, the number of points on the right gap is the same as gs, the size of our planted group. Let's see what happens when we permute the vertices of the graph to this ordering.

>> [ignore p] = sort(V(:,2)); >> spy(A(p,p));

Looks like we found our partitions!

Meaningful Partitions of Real Datasets

The previous example was rather compelling. However, it was also fake. We planted a partition there! How do we know such things occur in real data? In this two section, we'll see what happens when we try the same procedure on two matrices. Unfortunately, the data for these sets is private, so it's difficult to "follow along" with the Matlab. Nevertheless, the pictures should tell the story.

The dataset for this section is a bipartite graph from Yahoo! Search Marketing (formerly known as Overture). The two types of nodes are advertisers and keywords. Advertisers place bids on keywords.

The functions readSMAT and readList are my own functions for reading in datafiles from disk.

>> A = readSMAT('us.3k.2k.smat'); >> labels = readList('us.3k.2k.trms');

Because the data is bipartite, we need to form the full adjacency matrix for the graph.

>> [m n] = size(A); >> B = [sparse(m,m) A; A' sparse(n,n)];

As always, a first good step is to take a look at the matrix.

>> spy(B);

Except for the trivial structure of a bipartite graph, there isn't much else here, so let's use the second smallest eigenvector of the Laplacian matrix.

>> L = laplacian(B); >> [V D] = eigs(L, 2, 'SA'); >> D(2,2) ans = 0.6031

Here, the second smallest eigenvalue is significantly smaller than the first dataset. This means that we expect to find some fairly small cuts or rather tight clusters. In the previous test, we found that plotting the sorted eigenvector was most informative.

>> plot(sort(V(:,2)), '.-');

Widely generalizing from our previous results, the two large gaps in the sorted eigenvector state that we'd expect three large groups in our dataset.

>> [ignore p] = sort(V(:,2)); >> spy(B(p,p));

As expected, we see three large groups. For this graph, we can validate the identity of each group using the terms associated with part of the vertices.

Because the graph was bipartite, we need to find the terms that correspond to the vertices in the "tight" groups in the lower right. The first 3000 vertices in the graph correspond to the labels on the terms.

>> plabels = p(p <= 3000); >> labels(plabels(2500:2525)) ans = 'casino gambling online' 'online roulette' 'craps play' 'sports wagering' 'casino free online' 'play slot' 'betting' 'gambling sports' 'gambling online sports' 'bet bowl super' 'casino net' 'black jack online' 'casino free gambling online' 'game poker' 'betting online sports' 'gambling game' 'black jack play' 'online slot' 'black jack poker roulette' 'betting online shop' 'craps internet' 'book online sport wagering' 'casino free gambling' 'casino game online' 'roulette winning' 'machine poker video'

All the terms relate very strongly to gambling and betting. Thus, the partition in the data we found corresponds to a gambling cluster of terms.

We can also recursively apply this process (see the next section).

We can browse the same dataset that was ordered using a similiar (but distinct) procedure at this website:

http://research.yahoo.com/demo/clustergraph/top_level#img

Recursive Spectral Partitioning

In this section, we'll see yet another dataset and apply the idea not just once, but recursively to extract hierarchical structure in the dataset.

The dataset in this section is a similarity score between two musical artists formed by the ratings of 150,000 users. The data comes from Yahoo!'s LaunchCast service.

>> A = readSMAT('artistuser-md100.sim.smat'); >> labels = readList('artistuser-md100.labels'); >> spy(A);

Without any processing, it's hard to divine any structure in the dataset.

>> L = laplacian(A); >> [V D] = eigs(L, 2, 'SA'); >> D(2,2) ans = 0.0738 >> plot(sort(V(:,2)), '.-'); >> [ignore p] = sort(V(:,2)); >> spy(A(p,p));

Aha! That's better. It looks like there is a cohesive cluster of artists in the upper left hand corner of the adjacency matrix. Let's explore the artists in this area.

>> labels(p(25:50)) ans = 'The Williams Sisters' 'GMWA Women Of Worship' 'Bishop Paul S. Morton, Sr.' 'Tyrone Block' 'The Canton Spirituals' '1NC (One Nation Crew)' 'Michael Fletcher' 'The McClurkin Project' 'Lawrence Matthews' 'Woody Rock' 'Darwin Hobbs' 'Rev. James Moore' 'Anointed Pace Sisters' 'Colorado Mass Choir' 'Helen Baylor' 'Bishop Clarence E. McClendon' 'Full Gospel Baptist Fellowship Mass Choir' 'The Jackson Southernaires' 'Vanessa Williams [Gospel]' 'Calvin Bernard Rhone' 'Mighty Clouds Of Joy' 'New Direction' 'Deitrick Haddon' 'Rance Allen Group' 'Joann Rosario' 'The Soul Stirrers'

I hope it is pretty obvious that these artists are gospel artists.

While this shows a little bit about the data, it isn't all that much. Presumably, there is also more structure in the rest of the dataset, besides just one gospel group of artists.

To expose the remainder of the structure, we apply the second smallest eigenvector recursively. That is, use the second smallest eigenvector of the full graph to determine a good way to split the graph into two pieces (e.g. the gospel cluster and the rest of the data) and then repeat the process on each subgraph.

If we do this for the entire dataset using an optimized program written in C, we get the following adjacency matrix. (The file "au-out.rperm" is the final permutation produced by the recursive program.)

>> p = load('au-out.rperm')+1; >> spy(A(p,p))

In the new dataset, related artists should always be close by. Let's check this hypothesis for Metallica (mainstream rock), Britney Spears (teeny bob), and DJ Tiesto (techno).

>> strmatch('Metallica', labels(p)) ans = 5265 >> labels(p(5265-5:5265+5)) ans = 'Nazareth' 'Molly Hatchet' 'April Wine' 'Nine Inch Nails' 'Pantera' 'Metallica' 'Black Sabbath' 'Stone Temple Pilots' 'Ozzy Osbourne' 'Cypress Hill' 'Nirvana'

>> strmatch('Britney Spears', labels(p)) ans = 8806 >> labels(p(8806-5:8806+5)) ans = 'Paulina Rubio' 'Shakira' 'Anastacia' 'Anastacia' 'JC Chasez' 'Britney Spears' '*NSYNC' 'Celine Dion' 'Backstreet Boys' 'Ricky Martin' 'Lou Bega'

>> strmatch('DJ Tiesto', labels(p)) ans = 3918 >> labels(p(3918-5:3918+5)) ans = 'DJ Irene' 'Darude' 'Zombie Nation' 'Christopher Lawrence' 'Louie DeVito' 'DJ Tiesto' 'George Acosta' 'Sash!' 'Pete Tong' 'Peter Rauhofer' 'DJ Sammy'

Clearly, we can find similiar artists to our seed artists in these datasets.

Spectral Graph Partitioning and the Laplacian with Matlab

Finding a Partition

Meaningful Partitions of Real Datasets

Recursive Spectral Partitioning