Politics, unsupervised

Here’s a “quick” post on looking for structure in a data set. There are a lot of different approaches people are taking these days, so I figured I could do a tour of some diverse ones that give relatively interpretable results. Specifically, let’s see what three people would do:

To make it a little more interesting, let’s use some political data. Specifically, let’s take all of the 400+ House representatives as our observations and see how they voted on the ten major bills (our features) as defined by the New York Times (as of September, 2015). The observations have some kind of labeling: Democrat or Republican. But I’m really not sure how good it is. Does it represent the structure of the data very well? Are we really a two party system, or is there something else going on?

I don’t know the answers to these questions, so I’m going to write this as a procedural type of thing. I’ll show the code for how I’m trying things as I go. You can also find it here.

###Preliminaries

Getting the data

Thank you GovTrack. They record all of the votes FOR EVERY CONGRESS EVER as JSON and XML files. This Python script gets all of the major votes for the current House. We’ll load it into R as below. Our data frame has a row for each rep and a column for each bill, plus one for the party. We put a -1 if the representative voted no on the bill, a 0 if they didn’t vote, and a 1 if they voted yes. We cut out any representative who voted on fewer than half of the bills.

# Load the data frame
data <- read.csv('../data/votes.csv', header=T, row.names=1)
data$Party <- as.factor(data$Party)
levels(data$Party) <- c('D', 'R')
# Get good voters
vote.h <- ( rowSums(abs(subset(data, select=-c(Party)))) > (ncol(data)-1)/2 )
data <- data[vote.h,]

To run the code below, you’ll need to load a couple styling functions and variables. You can find them in the code repo.

Big picture: small dimension

What are we looking for when we ask if the party labels describe the structure of the system well? I’m not asking if I can predict the labels from the data; just a few boosted trees did perfectly in leave-one-out cross-validation. Here’s what I am getting at. Suppose the labels described the data perfectly, and the actions of a representative are solely a function of their party. Their vote on a bill is completely determined by whether they’re a Democrat or Republican, and so recording 10 bill votes is unnecessary since a representative’s vote on one bill implies their votes on all others assuming we knew all of their partisan content. Put another way, we wouldn’t need 10 dimensions to describe a representative; we’d only need one.

Speaking of dimensions, let’s look at the bill-by-bill voting breakdown a bit. This will give us a sense of how the as-labeled partisanship looks. There are a lot of ways to visualize this; here, we’ll just look at the fraction of representatives within each party who voted in support of each bill.

# Find support ratios
v.d <- colMeans(subset(data, Party=='D', select=-Party) == 1)
v.r <- colMeans(subset(data, Party=='R', select=-Party) == 1)
bill.df <- melt(data.frame(id=colnames(data)[1:ncol(data)-1],
                               Democrat=v.d, Republican=v.r), id='id')
# Order by Democrat support and plot
bill.df$id <- factor(bill.df$id, levels=bill.df$id[order(v.d, decreasing=TRUE)])
ggplot(bill.df, aes(x=id, y=value, fill=variable)) + 
  geom_bar(width=0.75, stat='identity', position=position_dodge(width=0.75)) +
  scale_fill_manual(values=c(colors$D,colors$R), name='Party') + 
  plot.style(t='Bill support by party', x='House bill', y='Fraction supporting')
Bill support by party
Fraction of representatives within each party who voted in support of each bill. We often see disagreement within parties.

It looks like there’s often disagreement within parties on votes, so we may need to dig to something deeper than just Democrat/Republic splits. Remember we’re looking for interpretable results, which we’ll take to mean linear combinations of vote support. Ok let’s get started.

Approach 1: Applied math all the way

Everything is SVD

When you hand Mathilda a data set, she’ll do two things. First, she’ll assume someone’s already munged it for her. Second, she’ll look at the spectrum. In a sense, this is just doing principal component analysis, a concept with which I think more people are familiar. PCA is just singular value decomposition - every applied mathematician’s number one homie - in a pretty bad disguise. Using the SVD of our data \(\mathbf{X} = \mathbf{U\Sigma V}^\intercal\), note that Since PCA amounts to the spectral decomposition of the correlation matrix as \( \mathbf{X}^\intercal\mathbf{X} = \mathbf{W\Lambda W}^\intercal \), we see \(\mathbf{W}=\mathbf{V}\) and \(\mathbf{\Lambda}=\mathbf{\Sigma}^2\). That is, the loading vectors (so-called in PCA terminology) are just the RSVs of the data matrix. And the eigenvalues of the correlation matrix are just the squares of the singular values of the data matrix. Our case is not ideal for PCA since our features are not continuous. However, they are ordinal (they give a sense of how much the representative supports the bill). We aren’t too concerned with the actual values on the principal components, just their structure. So we’re fine to use SVD here.

Spectral dimenstionality reduction

So what does this decomposition tell us? A lot. Way too much to cover here. So let’s focus on this question of dimensionality reduction. The key concept here is the Eckart-Young theorem: fixing a desired rank \(r < \mathrm{rank}(\mathbf{X}) \), the rank \( r \) matrix which gives the lowest Frobenius residual from \(\mathbf{X}\) is \(\mathbf{U}\tilde{\mathbf{\Sigma}}\mathbf{V}\) where \( \mathrm{diag}(\tilde{\mathbf{\Sigma}})\)\(=(\sigma_1, \sigma_2,…,\sigma_r,0,…0)\). That is, to get the best rank \(r\) approximation of \(\mathbf{X}\), we just take the SVD and chop off the singular values and vectors starting at the \(r+1\) largest. Let’s do precisely this. We’ll compute the SVD after mean-centering the data matrix, then look at the singular values to decide the rank we want.

# Mean center and SVD
X <- as.matrix(data[,1:ncol(data)-1])
X.mc <- X - matrix(colMeans(X), ncol=ncol(X), nrow=nrow(X), byrow=TRUE)
udv <- svd(X.mc)
# Compute SV energy
energy <- cumsum(udv$d^2)/sum(udv$d^2)
# Plot singular values
sing.df <-data.frame(sigma=udv$d, opt=(1:length(udv$d) == min(which(energy > 0.75))))
ggplot(sing.df, aes(x=1:length(udv$d), y=sigma)) + geom_line(size=1, linetype='dotted') + 
  geom_point(size=4, aes(color=opt)) + scale_color_manual(values=c('black', 'red')) +
  plot.style(t='Spectrum of House voting data', x='', y='Singular value') +
  theme(legend.position='none')
Singular values of the House voting data
Singular values of the mean-centered House voting data. The red point shows the 75% energy cut-off.

A general rule of thumb is to keep 75% or 90% of the energy (sum of the squares) of the singular values. Picking 75% will be a pretty hefty reduction, but we can see that three singular values are required. Put another way, reducing the dimensionality of the observations below three will result in a poor representation of the original data. This is looking like a non-one-dimensional problem after all. Let’s poke into the singular vectors. Remember that in the PCA school of thought, these are the three directions of maximal variance in the data. Remember that these are linear combinations of the original axes (i.e. the bills), so we can visualize the singular vectors by the weights on each bill.

rsv.df <- melt(data.frame(id=colnames(data)[1:ncol(data)-1], First=udv$v[,1],
                     Second=udv$v[,2], Third=udv$v[,3]), id='id')
# Sort bills in same order
rsv.df$id <- factor(rsv.df$id, levels=rsv.df$id[order(v.d, decreasing=TRUE)])
ggplot(rsv.df, aes(x=id, y=value, fill=variable)) + 
  geom_bar(width=0.75, stat='identity', position=position_dodge(width=0.75)) +
  scale_fill_manual(values=colors$other, name='Right singular vector') +
  plot.style(t='Right singular vector weights', x='House bill', y='Weight')
Right singular vector weights
Right singular vectors shown by weights on each bill. Bills are sorted as above by Democrat support.

Sorry for the funky colors; I wanted to avoid red and blue. The obvious trend is that the first right singular vector weights increase as Democrat-ness decreases. Weights for the second RSV appear to be low exactly when Republican-ness is very high. And the third one appears to be showing us something completely different. If we actually look at the content of the bills, the two bills for which the third RSV has highly negative weights both supported the Trans-Pacific Partnership. So the spectral approach suggests that the Democrat/Republican split is a strong one, but fails to capture the full structure of the data.

Approach 2: Network it

Building a network

The first few things Griff will do when you give him data is make a graph, try to visualize it, and then investigate its structure. So let’s make a network of the representatives. We define our graph \(G=(V,E,w)\) where \(V\) is the set of representatives and \(w(u,v) = (\langle\mathbf{X}_u, \mathbf{X}_v\rangle + 10)/10\) (the scaled inner product of rows \(u\) and \(v\)). We add \((u,v)\) to \(E\) if \(w(u,v)>0\). What is this crazy weight function? Nothing too fancy. If you think about what the inner product between two reps is, we see that for any given bill, we add 1 if they voted the same, -1 if they voted differently, and 0 if either didn’t vote. It’s a similarity metric which we then scale to range between 0 and 1. Now our original features are encoded in the edges of the network; each representative is described by his or her connections to others. To create the weighted adjacency matrix, we do

# Function to create weighted adjacency matrix
w.adj <- function(data) {
  # +1 similarity if agree, 0 if non-voter, -1 if disagree
  wk <- as.matrix(data) %*% t(as.matrix(data))
  # Normalize
  wk <- wk - min(wk)
  wk <- wk/max(wk)
  return(wk)
}
# Get weighted adjacency matrix
wk <- w.adj(subset(data, select=-c(Party)))
Sampling for visualization

Let’s try some visualizations. Our network has 431 nodes and almost 93,000 edges. That’s too big to visualize as a whole. Let’s sample it to create a smaller, representative subgraph that we can actually look at. What we want is a small set of prototypical vertices to stand in for local structures. I could use my prior knowledge of the House (e.g. pick John Boehner for the hardcore Republicans, Nancy Pelosi for the hardcore Democrats, Charlie Dent for the moderate Republicans, etc.). Or we could do this algorithmically. What we want is to pick a few prototypical nodes to stand in for areas of high similarity so that we can then see how these different regions interact with each other. One of the better performing graph sampling algorithms is based on the forest-fire modeling approach. We’ll do something like this for our similarity-weighted almost-complete graph. We initialize by “burning” a random node, and then burn each of its neighbors with probability proportional to the edge weight. We remove all of the incident edges to the burned node. The process repeats for each of the newly burned nodes and continues until the fire goes out. The sample is the set of unburned nodes. This isn’t the optimal case for forest fire-like sampling, as we’re reducing the node set by so much. As such, I’ve found a really good sampling comes every few trials, so don’t be afraid to run it a few times. It’s quick. Here’s the code.

# Forest fire-like sampling from weight matrix
# Want representative nodes so burn out edges 
# to highly weighted (highly similar) nodes
# with greater probability
fire.samp <- function(W) {
  # Predetermine if each edge will burn if reached
  # Set threshold by weight (weight near 1 means 
  # low burn thresh) and toss
  n <- dim(W)[1]
  toss <- matrix(runif(n^2), n)
  thresh <- 0.015*((W-min(W))/max(W))
  # Explicit symmetry
  toss[lower.tri(toss, diag = TRUE)] <- NA
  thresh[lower.tri(thresh, diag = TRUE)] <- NA
  # Decide if edges will burn; return symmetry
  B <- (toss < thresh)
  B[lower.tri(B, diag = FALSE)] <- 
    t(B)[lower.tri(B, diag = FALSE)]
  diag(B) <- FALSE
  # Random initialization, init queue
  burned <- c(sample(1:n, 1))
  b.q <- burned
  # Iterate while queue isn't empty
  while (length(b.q) > 0) {
    # Pop node
    f <- b.q[1]
    b.q <- b.q[-1]
    # Determine where fire spreads, add to burned and queue
    spread <- setdiff(which(B[f,]), union(burned, b.q))
    burned <- c(burned, spread)
    b.q <- c(b.q, spread)
    # Remove incident edges
    B[f,] <- FALSE
    B[,f] <- FALSE
  }
  return(burned)
}
# Insist on well-sized sample for viz
ff.samp <- c()
while(length(ff.samp) < 8 | length(ff.samp) > 12) {
  ff.samp <- setdiff(1:dim(wk)[1], fire.samp(wk))
}

Ok, let’s visualize it. No need to reinvent the wheel, so let’s use the igraph package. We’ll use the weight adjacency submatrix as given by the sample. For interpretability, we’ll color the nodes (representatives) by their party. If an edge goes from Republican to Republican, we’ll make it red. Democrat to Democrat is blue. And inter-party edges are purple. The edge widths will be controlled by the similarity measure (weight) between the nodes.

# Form graph from weighted adjacency matrix
G.sub <- graph_from_adjacency_matrix(wk[ff,ff], mode = "undirected",
                                     weighted = TRUE, diag = FALSE,
                                     add.colnames = NA, add.rownames = NA)
# Define node palette
v.pal <- revalue(data$Party[ff], c('D'=colors$Da, 'R'=colors$Ra))
# Define edge palette by counting number of incident Democrat nodes
incident.u <- data$Party[ff][ends(G.sub, E(G.sub))[,1]]
incident.v <- data$Party[ff][ends(G.sub, E(G.sub))[,2]]
e.pal <- (incident.u == 'D') + (incident.v == 'D')
e.pal <- revalue(as.factor(e.pal), c('0'=colors$R, '1'=colors$DR, '2'=colors$D))
# Draw graph
plot.igraph(G.sub, xlim=c(-0.75,0.75), ylim=c(-0.75,0.75),
            vertex.size=25, vertex.label=substr(rownames(data)[ff], 1, 5),
            vertex.label.family='sans', vertex.label.color='black', vertex.label.font=2,
            vertex.color=as.character(v.pal), edge.arrow.size=0,
            edge.width=10*E(G.sub)$weight, edge.color=as.character(e.pal))
Sampled Representative network
Sub-network on the sampled nodes. Blue nodes are Democrats, red are Republicans. Names have been truncated to five characters.

What do we notice? Well as expected, the purple inter-party edges are generally thinner. However there are certainly nodes which have stronger ties to their fellow party members than others. For example, the Barr-Gowdy-Barletta triangle is much stronger than the Barr-Jordan-Barletta triangle.

Clustering and community finding

Let’s do a little learning on the graph. Clustering is another common way to reduce dimensionality, and there are a bunch of neat graph clustering algorithms. And we’re in luck: they’re based on using edge weights as similarity measures. We’re going to use a weighted extension of a simple and fairly famous algorithm introduced in Finding Community Structure in Very Large Networks (A. Clauset et al., 2004). Essentially, we initialize every node to be in the same cluster and then greedily perform the cluster merges which give the highest modularity gain, a normalized ratio of in-cluster weight-defined similarity to between-cluster similarity. The igraph package includes an implementation of the algorithm, so all we have to do is

# Form graph from full weighted adjacency matrix
G <- graph_from_adjacency_matrix(wk, mode = "undirected", weighted = TRUE, diag = FALSE,
                                     add.colnames = NA, add.rownames = NA)
# Cluster using modulation
clus <- cluster_fast_greedy(G)
mem <- as.factor(clus$membership)

We obtain a two-component clustering. That’s suspicious. Let’s see what happens if we call one cluster “Democrat” and the other “Republican”, and measure the fraction of representatives not assigned correctly.

levels(mem) <- levels(data$Party)
mean(mem != data$Party)

We get an error of 0.69%, which is three misassigned representatives out of 431. Who are those three? Republicans Bob Dold, John Katko, and Bruce Poliquin. Here are some fun facts about those three.

I’d probably guess their party wrong too. So from this graph clustering perspective, the party labels work pretty damn well.

Approach 3: Trendy deep learning

Hidden layers, hidden structure

Admittedly, I don’t know a ton about deep learning. I try to stay on the statistical and mathematical side of learning approaches. But it’s an important, rapidly developing field. And I do know that one thing Deepak would try is an autoencoder. To implement an autoencoder, we design a network whose output is the same as the input, but we’ll throw in a single hidden layer with fewer nodes than the input. In effect, this will compress the data before we try to recover it again. If our bias is low for a given hidden layer size, then that number of nodes compressed the data well.

Let’s give it a go with the autoencoder R package. Without going into too much detail, there are several sparsity parameters to set in order to keep activations weights small (hence the full name “sparse autoencoder”). To save time for anyone replicating this, I did a large grid search over different parameter choices. It took quite a while, but if you want to give it a go, check out the autoencoder.grid.search function. To pick the size of the hidden layer, we can examine the parameter-dependent minimal average reconstruction error as a function of the hidden layer size.

Autoencoder reconstruction errors
Minimal average reconstruction errors for autoencoders with hidden layer sizes from 1 to 10. Note the kink at 4.

As with many things neural-networky, picking the hidden layer size is more art than science. There appears to be a kink at 4 hidden nodes, so let’s go with that. We’ll train our final autoencoder with

beta <- 0.1
lambda <- 0.0002
rho <- 0.02
epsilon <- 0.002
n.hidden <- 4
X <- as.matrix(subset(data, select=-c(Party)))
ae <- autoencode(X.train=X, nl=3, N.hidden=n.hidden, unit.type='logistic', 
                 lambda=lambda, beta=beta, rho=rho, epsilon=epsilon,
                 optim.method="BFGS", max.iterations=1000,
                 rescale.flag=TRUE, rescaling.offset=0.001)
Activation investigation

To get a sense of how the data was compressed, we’ll look at the activation weights. This is similar to the SVD analysis in some sense, but different in others. For example, the activations are a function of a linear combination of the 10 bill features which is reminiscent of the RSVs. However, there is no “importance” ordering on the nodes like what the singular values implied. Also note that we won’t obtain the same weights every time we learn the autoencoder, as BFGS optimization is non-deterministic. However, general patterns emerge in each run.

# Get weight matrix for first layer
w <- t(ae$W[[1]])
# Plot weights
weight.df <- melt(data.frame(id=colnames(data)[1:ncol(data)-1], One=w[,1],
                          Two=w[,2], Three=w[,3], Four=w[,4]), id='id')
weight.df$id <- factor(weight.df$id, levels=weight.df$id[order(v.d, decreasing=TRUE)])
ggplot(weight.df, aes(x=id, y=value, fill=variable)) + 
  geom_bar(width=0.75, stat='identity', position=position_dodge(width=0.75)) +
  scale_fill_manual(values=colors$other, name='Hidden nodes') +
  plot.style(t='Activation weights for hidden layer', x='House bill', y='Weight')
Autoencoder activation weights
Activation weights for the bill features in each hidden node.

There’s a lot to see here. For instance, we notice the same Trans-Pacific Partnership pattern in the second hidden node as we did in the third RSV. Similarly, H106 and H109 were both about passing Homeland Security funding. Node one seems to match up with this paradigm. I’ll leave it as an exercise to see what else you can find. So looking back to the party labels, what is the autoencoder telling us? If you want to compress the data well, you can’t just encode partisanship. You need to encode stances on specific issues, certainly a different perspective than what we saw in graph clustering and even spectral decomposition.

One last note

I originally planned on doing this whole thing entirely in Azure ML Studio to show how seamlessly you can work with both Python and R, and to check out some of their useful utilities. But it didn’t work out. I can deal with not having a console or a real dev environment. I could either work locally or use something like DataJoy. And providing a real IDE is also probably too much to ask of a free web service already offering state-of-the-art ML utilities. However, ML Studio offers only limited output from Python and R, and I can’t really use Python effectively if I can only output a single only-numerical matrix.