statistics – Danielle Notice

Fighting in the Karate Club: Stochastic block models

danielle-notice — Mon, 07 Mar 2022 09:00:00 +0000

Reading Time: 5 minutes

Imagine you love karate. You love the principles of the martial arts, you love the physical activity, and well… you love fighting. So you deal with your violent desires in a responsible way – you join your university’s Karate Club. Everything is going great, until there is some conflict over the price of lessons between the club president and the part-time karate instructor. Before you know it, the whole club is divided on the matter and it has become a conflict of ideology rather than just fees. Ultimately, the club leaders fire the instructor, and all of his supporters leave with him and form their own karate club. This is obviously not the type of fighting you had in mind when you joined.

This was studied by Wayne Zachary, an anthropologist. This and many other situations can be represented as networks to describe the social, physical and other structures where interactions between pairs of units are observed. These include social networks, biological structures, and collections of websites, documents and words.

Stochastic block models (SBMs) are a class of random graph models which are widely
studied and popular for statistical analysis of networks. These networks are modelled to discover and understand their underlying structure, which can be used to group similar elements or simulate how the network could grow. The goal is to infer unknown characteristics of the elements in the network from the observed measurements on pairwise properties.

In this blog, we discuss SBMs using the Karate club example.

Stochastic Block Models
Extensions of the model
1. Degree-corrected SBMs
2. Multi-membership SBMs
3. Assortative SBMs

1. Stochastic Block Models (SBMs)

Consider an SBM for the karate club. There are $n$ members, each representing a node in the graph. The members can be divided into $K$ mutually exclusive groups, $(B_1 \cdots B_K)$ . These groups are unknown. What is known about this club are all the relationships between its members, which can be represented in the adjacency matrix $\mathbf{Y}$ . This is an $n \times n$ matrix for the graph where for a pair of nodes $(p,q), \, \mathbf{Y}_{pq}$ is 1 if there is an edge (connection) between the nodes (members) $p$ and $q$ and 0 otherwise.

The SBM is defined by the stochastic block matrix, $\mathbf{C}$ , a $K \times K$ matrix for the graph where $\mathbf{C}_{ij} \in [0,1]$ is the probability that there is an edge from a node in $B_i$ to a node in $B_j$ . A key feature of SBMs is stochastic equivalence. This means that two persons in the same group have the same probability distribution of being connected to other persons, both those within and outside the group. It then follows that for node $p \in B_i$ and node $q \in B_j$ ,

$\mathbf{Y}_{pq} \sim \text{Bernoulli}(\mathbf{C}_{ij}).$

The Karate club network. Graph from SBM review paper listed below (1).

2. Modifications to the SBM

There are several limitations to the simple SBM described above:

Nodes within a group are expected to all have the same degree.
Nodes are restricted to belong to only one group.
The model is not guaranteed to produce assortative groups.

We will now look at a few modifications to the simple SBM that address each of these limitations.

Degree-corrected SBM

In the karate club (and many other networks), it is unlikely that every person in a particular group would have the same number of friends in the group. The extends the simple SBM to account for this possibility of different node degrees.

In an undirected graph, the degree of node $p$ is the number of edges between $p$ and another node. Consider the network as an undirected multi-graph. The stochastic block matrix $\mathbf{C}$ is redefined such that $\mathbf{C}_{ij} \in [0,1]$ is the expected number of edges between nodes in $B_i$ and $B_j$ .

Also included in the model is an n-vector $\phi$ where for each node $\phi_p$ which controls the expected degrees of node $p$ . $\phi_p$ is the probability that an edge connected to the group containing
p is connected to p itself. It then follows that for node $p \in B_i$ and node $q \in B_j$ ,

$\mathbf{Y}_{pq} \sim \text{Poisson}(\phi_p\phi_q\mathbf{C}_{ij}).$

Divisions of the karate club network found using the (a) uncorrected and (b) corrected SBMs. The size of each node is proportional to its degree and the shading reflects inferred group membership. The dashed line indicates the split observed in real life.

Mixed-membership SBMs

In the Karate club example, the 2nd issue is not all that major relevant (although it definitely could have been possible to be part of both clubs after the split), but it’s clear how for other networks, this can be a major limitation. Further, the strength of their affiliation with each group can be different. The (MMSBM) extends the simple SBM to accommodate these multi-faceted relationships using a mixed-membership approach.

In addition to the variables in the simple SBM, there is also an $n \times K$ matrix of membership probabilities $\Theta$ where each element $\Theta_{pi}$ represents the probability that node $p \in B_i$ . Each row is not restricted to have only 1 non-zero element, so each node can simultaneously belong to multiple groups.

Assortative SBMs

Another consideration is if we want to model a network with a particular goal of grouping
similar elements. For the karate club, there are probably members who are more casual about the situation (they’re just there to fight and go home, not to make friends), and so only have a couple of friends in the club. The simple SBM may conclude that all such persons belong to the same group even if there are hardly any connections in the group because they each have a comparatively small but similar number of friends. This is because the SBM is not designed to prioritize community detection. However, work has been done to extend the model to improve its usefulness in clustering tasks.

Community detection or assortativeness is the property of nodes being partitioned into blocks in such a way that the edge density is high within a group and low between groups. The (a-MMSB) is a special case of the MMSBM we looked at in the previous section. The model includes a parameter
for “community strength”, $\beta_i∈ \in (0, 1)$ for each group which represents how closely the nodes in the group are linked.

Learn More

A really good review of

Lead (probably not) in my water: zero-inflated models

danielle-notice — Mon, 31 Jan 2022 08:00:00 +0000

Reading Time: 5 minutes

It’s interesting that there are some problems that the younger generation, if they even know it exists, assume have been dealt with completely. That’s what I thought about lead piping. Yet the University of Edinburgh and Scottish Water have ongoing research related to this.

It became relatively common knowledge in the 1970s that lead is dangerous. However before its harmful health effects were discovered, lead was commonly used in water pipes. In Scotland, the water supplies do not naturally have high lead levels. Since the banning of lead pipes in 1969, Scottish Water has worked to remove lead pipes from the mains distribution system although some pipes carrying water to customers’ houses may still be made of lead and require replacement. Additionally, properties built before 1970 are at higher risk of containing internal lead piping or tanks and having contaminated tap water.

What is the problem?
What is count regression?
Why a zero-inflated model?
Other considerations

1. So just find and replace all the pipes…

In a world without limitations, Scottish Water would visit every household built before 1970 and test their tap water for lead-contamination. However that is not possible in reality. So instead it makes sense to model which areas (for example postcodes) have more houses with lead-contaminated water so that sampling efforts can be focused there. The goal is to identify the possible factors which increase the risk of more households in a postcode returning water samples with lead concentration greater than 1μg/L.

Data

To look at this problem, we consider data containing 308 observations, each representing a different postcode. The variables included are:

Location related variables: water operational area, region, postcode area and district; the coordinate location of the postcodes.
Scottish Water data: an indicator variable for if the postcode’s water is phosphate dosed and the dosage measured; an indicator variable for if old supply pipes have been replaced*.
Census data: the 2011 census households count; the Urban Rural classification code (2-fold and 8-fold).
Property Age related variables
Presence of lead: number of households sampled, number of samples with lead concentration > 1μg/L (lead presence count).

Pairwise correlation between numeric variables in the data

Map of Scotland showing locations that were sampled

2. What is count regression?

Simple regression has a response variable that can take any real value. The models we want to use need to account for the fact that the data are non-negative integers. There are some regression models specifically for count data.

Poisson Regression

Naturally when doing any model testing, we start with simplest – which is Poisson regression for count data. The response variable conditional on the regressors is Poisson distributed with mean parameter connected to the regressors and parameters by the exponential mean function.

Despite the name of the model, it is not necessary for the response variable to be Poisson distributed (marginal vs conditional distribution). However, for valid statistical inference it is necessary for the conditional mean and expectation to be equal (as is expected for the Poisson distribution).

Negative Binomial Regression

When the conditional variance exceeds the conditional mean, the data is said to be overdispersed. The negative binomial model is a standard approach to address overdispersion. It includes a dispersion parameter, $\alpha$ which is also estimated. Similar to the Poisson model, the data does not need to follow the negative binomial distribution as long as the mean and variance are correctly specified.

3. Why a zero-inflated model?

Let’s take a step back and think about the problem for a moment. The reason many people don’t know lead piping is still an issue (aside from youthful ignorance) is that well… it’s been dealt with for the most part. So in reality many of the postcodes won’t have any households with lead contaminated samples.

This is a situation where count data may still have more zeroes than predicted by a parametric models even when using a distribution like the negative binomial. In a zero-inflated count model, the processes generating the zero counts and the positive counts are not constrained to be the same.

The zero-inflated model specifies

$Pr[y=j] = \begin{cases} \pi + (1-\pi)f_1 (0), & \text{if}\ j=0 \\ (1-\pi)f_1 (j), & \text{if}\ j>0 \end{cases}$

The model is a mixture of a count model and a probability mass function degenerate at zero. The proportion of zeroes added ( $\pi$ ) may be determined by a binary outcome model or be set to a constant.

4. Other considerations

To account for some of the variation related to location, random effect terms were added to each of the models selected. For variables like these, were more appropriate than fixed categorical variables.

If you read the first section closely, you’ll remember that included in the data was an indicator variable for if old pipes have been replaced. And you’d think, well there you go! However this actually wasn’t included in the modelling at all. That’s because it wasn’t clear if the sample was taken before or after the supply pipe was replaced. So we couldn’t really infer whether any contamination recorded was already dealt with or if it was caused by internal lead pipes.

5. Conclusion

So in the end, I chose a zero-inflated negative binomial model with a random effect. This model handled both the overdispersion and the zero-inflation in the data. It is also more generalisable that the zero-inflated Poisson model. For example, if more extensive sampling is done, the data may be more or less overdispersed than the current sample and a Poisson model would not account for that.

The model identified 2 factors that have a large impact on the risk of a postcode no being free of lead contamination: whether or not a postcode receives its water from a WTW which conducts orthophosphate dosing and whether a postcode is in urban or rural Scotland. By considering these 2 major factors, further sampling can be done in postcodes that are classified as urban and are not orthophosphate dosed.

Learn more

Scottish Water –
Cameron, A. Colin, and Pravin K. Trivedi. 2013. . 2nd ed. Econometric Society Monographs. Cambridge University Press.
R package –