# BagNet

### 1. Introduction

Network science is big right now. Network inference, learning the structure of an unobserved edge set from data, is one of the most interesting subjects in the field. NetInf (along with its sibling NetRate) is one of my favorite network inference algorithms: it has an elegant theoretical backing and generally produces great results. NetInf uses data about cascades that propagated over the latent network to infer the edge set. However, performance drastically decreases when the number of observed cascades is far smaller than the network size. This is an important problem to address, since we can only observe a few cascades on many real-world networks. BagNet is a bagging-inspired algorithm which improves the performance of NetInf, in particular when the number of observed cascades is small. The following sections introduce NetInf and bagging, and then tie them together to see how BagNet work.

If you want to experiment for yourself, head over to the BagNet repo and look at the readme for instructions.

### 2. NetInf

The NetInf algorithm was developed in Inferring Networks of Diffusion and Influence (M. Gomez-Rodriguez, J. Leskovec, and A. Krause 2012). The algorithm is designed to efficiently reconstruct a "who-influences-whom" network from cascades of information diffusion. If that description sounds too abstract or overly-specific, here are two real-world examples of how NetInf can come in handy:

• Blogging networks: As news stories break, large media sites (such as the NYT or BBC) post articles. These stories then diffuse around the blogosphere. We can record the time at which each blog we track picks up the story. Using this information for many stories, we want to reconstruct the blogosphere network (perhaps in order to strategically inject a story so that it diffuses widely). That is, viewing the large media sites and blogs as nodes, we want to infer the directed edges indicating which sites influence others.
• Disease transmission networks: Consider a cluster of villages in the developing world. When a disease is introduced in one village, it will be transmitted to the others and we can record the time at which each village sees its first case. In order to prevent the transmission of future diseases, we want to infer the means of transmission (trade routes, etc.). That is, viewing the villages as nodes, we want to infer the directed edges indicating which villages transmit diseases directly to others.

We refer to the spread of a single news story or the diffusion of a single disease as a cascade. If we see that a certain node $$v$$ is frequently activated/infected after another node $$u$$, then we would assume that $$u$$ influences/transmits to $$v$$ . That's the idea behind NetInf. Given the complexity of the problem, it seems like we need to view many cascades before we can form a good reconstruction. In the first example, we certainly have many cascades: there are thousands of news stories that we can track every day. However, disease epidemics are fewer and farther between. This is the problem that BagNet addresses; more on that later.

Here's a taste of how NetInf actually works. The input data to NetInf is a list of tuples $$(v, t_v)_c$$ where $$v$$ is a node, $$t_v$$ is the time $$v$$ is activated, and $$c$$ is the cascade index for the recorded activation. The output is the edge set $$E$$ containing the inferred directed eges. We assert that during the course of any cascade, an infected node $$u$$ infects $$v$$ with probability $$\beta$$ if there is an edge from $$u$$ to $$v$$ (a mechanistic infection), and with small probability $$\varepsilon$$ if there is no edge from $$u$$ to $$v$$ (a random infection from external causes). The infection occurs after a sampled incubation time parameterized by the difference in activation time $$t_u$$ and $$t_v$$ (with $$t_v = 0$$ if $$v$$ is never infected during the cascade). We note that in most practical cases, there are many propagation trees and underlying networks (i.e. $$E$$) consistent with all of the observed cascades.

NetInf specifically chooses an edge set which maximizes the likelihood of observing the cascades given the defined infection model. If the incubation time is given by a distribution $$P(t_u, t_v)$$ then we define the following edge transmission probabilities

$$P'(u,v) = \begin{cases} \beta P(t_u, t_v) & \text{if t_u < t_v and (u,v) is a network edge} \cr \varepsilon P(t_u, t_v) & \text{if t_u < t_v and (u,v) is not a network edge} \cr 1 - \beta & \text{if t_v = \infty and (u,v) is a network edge} \cr 1 - \varepsilon & \text{if t_v = \infty and (u,v) is not a network edge} \cr 0 & \text{otherwise} \end{cases}$$

Now, the probability for a cascade $$c$$ to occur in a tree pattern $$T = (V_T, E_T)$$ is given by

$$P(c|T) = \prod_{u\in V_T} \prod_{v\in V} P'(u,v)$$

We now want $$P(C|G)$$, the probability of a cascade set occuring over a given graph. We will eventually maximize this expression over the possible graphs via submodular optimization. To achieve submodularity in the objective function, NetInf makes the following simplification: consider only the most likely propagation tree for a given cascade. We now have

$$P(C|G) = \prod_{c\in C} \sum_{T\in\mathcal{T}_c(G)} P(c|T) \approx \prod_{c\in C} \max_{T\in\mathcal{T}_c(G)} P(c|T)$$
where $$\mathcal{T}_c(G)$$ is the set of propagation trees on $$G$$ consistent with the cascade $$c$$. The final log-likelihood optimization is given as
$$G^* = \max_{G} \sum_{c\in C} \left( \max_{T\in\mathcal{T}_c(G)} \log P(c|T) - \max_{T\in\mathcal{T}_c(\hat{K})} \log P(c|T)\right)$$
where $$\hat{K}$$ is a complete graph on $$V$$ with only edges of probability $$\varepsilon$$ (think of $$\hat{K}$$ as a normalizer, the least likely network for the cascades to occur on). NetInf boils down to a monotone submodular optimization, and the algorithm greedily adds edges to the network to increase the log-likelihood of the observed cascade set as defined above.

### 3. Bagging

Bagging was introduced by Leo Breiman as a generalized algorithm for constructing an ensemble learner from several weak learners. The goal is to improve the predictive accuracy of the ensemble by reducing the variance as compared to the individual weak learners. The name "bagging" is a portmanteau of bootstrap aggregating. Let's look at each of these two terms.

• Bootstrapping: Given a training dataset, we create bootstrap datasets by sampling the observations uniformly with replacement as many times as we like. In general, bootstrapping is used to assess statistical accuracy. A model fit to the original dataset can be refit to each of the bootstrap datasets, and then the behavior of the models can be examined as it changes over the samples. For instance, we can get an unbiased estimate of the variance of a regression coefficient. Bootstrapping is very simple, incredibly versatile, and has linear complexity in the number of resamples. It's used a lot, but should be used more.
• Aggregating: It turns out that bootstrapping can do more than tell us about the fit of our model: it can actually improve the fit. All we need is a model, bootstrap datasets, and an appropriate aggregation scheme. For instance, we could average the regression coefficients of linear models fit to each of the bootstraps, or take a majority vote from binary classification trees fit to each of the bootstraps to classify a test case. These procedures leave the model bias untouched while simultaneously dropping the variance.

Like bootstrapping, bagging is simple, versatile, and produces (generally) great results. For instance, the random forest algorithm is an extension of bagging for decision trees.

### 4. Putting BagNet together

Ok, back to BagNet. Remember earlier when I said that NetInf works best when we have a lot of observed cascades? Let's see that in action. For all of the following experiments, we're going to use cascades generated on artificial Kronecker graphs, which mimic real-world networks. The figure below shows ROC curves for NetInf's edge recovery as compared to the ground truth network on 512 nodes and 1024 edges. The different curves show the results for different numbers of observed cascades.

When we have sufficiently many observed cascades, NetInf demonstrates near-perfect sensitivity and pretty good specificity. However as we can see, performance drastically decreases as we observe fewer cascades. This makes sense: when there are more cascades, we are more likely to see diffusion propagate across any particular edge more often. In this case, the assumptions made by NetInf to achieve submodularity (e.g. considering only the most likely propagation tree) exacerbate the problem. As noted, this use case occurs frequently in the real world, for instance, in a disease network setting.

BagNet addresses this few-observed-cascades problem (specifically $$|C| << |E|$$) by wrapping NetInf in a bagging-inspired procedure. We chose bagging for its generality in theory and application, as well as its suitedness to this case. Afterall, Breiman introduced bagging as an "improvement for unstable procedures".

We generate $$B$$ bootstrap datasets by sampling entire cascades. The weak learners are the results of NetInf for each of the bootstraps. If we set the aggregation threshold to be $$\rho$$, then an edge is included in the BagNet result if at least $$\rho B$$ of the NetInf results included the edge. The procedure is summarized in Algorithm 1.

We evaluated BagNet on a 512 node, 1024 edge Kronecker graph. A total of 50 cascades were generated (with 94% edge coverage); these were resampled into 100 bootstrap datasets with 100 observations each. The figure below shows an ROC curve comparison between standard NetInf and BagNet.
While BagNet does not mimic the performance when many cascades are observed, it outperforms NetInf at every complexity level. Take home message: there's no replacement for good data, but there's always something - like bagging - to be done.

Want to experiment with BagNet yourself? Check out the BagNet repo!