Statistics – Matthew Darlington

STOR-i Masterclass: Professor Peter Frazier

Matthew Darlington — Fri, 20 Mar 2020 11:58:00 +0000

This week we had the last masterclass of the year given by Professor Peter Frazier from Cornell University. Due to the current uncertainty caused by the coronavirus outbreak around the world, unfortunately Peter was not able to visit Lancaster in person. However, making the most of the situation we instead were able to still interact over the internet and it was an interesting half a week.

Peter’s area of expertise is in operations research and machine learning and gave us an introduction to Bayesian optimisation, specifically how we could implement it ourselves using the programming language Python.

The problem motivating Bayesian optimisation is as follows: suppose we have a function which we wish to find a maximum of, but for the purpose of this explanation we do not know what the function is. This is commonly thought of as having a “black box” where we pass in some inputs and get a output without getting to see what happens inside. Thus we can not simply differentiate the function as we might be used to. Furthermore, when estimating our function we may be getting noisy estimates, a common idea is to assume we are adding a $ \mathcal{N} (0,1) $ sample to each evaluation.

Credit: Ramraj Chandradevan

The way Bayesian optimisation tackles this problem is to estimate the function using a Gaussian process. Essentially we consider having a function for the mean of the function, and another for the variance. At each time step we can make one more evaluation of the function, and with each extra data point our Gaussian process becomes a better and better approximation of the “hidden” function.

The complicated part is choosing where we want to choose for our next evaluation. There are two concepts which need to be balanced: exploration and exploitation. Exploration relates to wanting to check over the whole range of the function, not leaving an area undiscovered. Exploitation is if we can see one area is looking better than others, we want to focus looking there to find the global maximum. The methods we learnt to deal with this is by using different functions that take both factors into account, called acquisition functions. We then simply choose the maximum of this for the next location to evaluate.

If you would like to learn more about the topic, Peter has an excellent set of resources on his .

STOR-i Masterclass: Professor Brendan Murphy

Matthew Darlington — Fri, 21 Feb 2020 10:09:00 +0000

Last week we had the first of this years STOR-i masterclasses given by Professor Brendan Murphy from University College Dublin. He introduced us to model-based clustering and classification. I hope to give a brief insight into his interesting talks over the two days.

The goal of clustering analysis is to place objects into groups that give some meaningful analysis. The idea is to get groups whose members share something in common, and different from members of other clusters.

Clustering as a concept has been around for millennia. Plato was the first to formalize the thinking with his ‘Theory of Forms” and Aristotle classified animals into groups based on their characteristics in his “History of Animals”.

Much later on Linnaeus began to cluster plants into hierarchical groups in his works “Species Plantarum” and “Systema Naturae”. He used features such as if the plants had flowers, and the number of stamen to divide them up into 24 different classes.

Brendan then explained how more recent clustering algorithms could be coded up on a computer to distinguish between different vectors of numbers. The masterclass was finished off with a live demonstration of how we could cluster runners in the 24 Hour World Championship of running, which led into thinking about some open questions in the area of clustering and classification.

Putting the methods into practice

I took the methods learnt in the masterclass and tried to apply them myself to cluster the pixels found in an image. If we think of a picture as the pixels with their red, green and blue co-ordinates plotted in a three dimensional grid, then we can cluster these into k groups and use this to compress the size of the image. Working in python I used this to cluster a picture of some of the mRes cohort.

The original image
Working with 5 colors
Working with 7 colors
Working with 10 colors

If you would like to find more out about clustering I would recommend looking at some of Brendan’s work in the field including the book:

Bouveyron, Charles & Celeux, Gilles & Murphy, Thomas & Raftery, Adrian. (2019). Model-Based Clustering and Classification for Data Science: With Applications in R. 10.1017/9781108644181.

Random Graphs

Matthew Darlington — Fri, 31 Jan 2020 13:49:00 +0000

A graph is a pair $ G = (V, E) $ where $V$ is a set of whose elements represent the verticies of the graph and $E$ is a set of pairs of vertices that represent the edges of the graph. This is better explained with an example:

An example of a graph on 4 verticies

Here we have $V = \{1,2,3,4\} $ and $ E = \{ \{1,3\} , \{1,4\}, \{2,3\} , \{2,4\} , \{3,4\} \} $

With this basic framework we can generate a random graph, the simplest of which being the Erdős–Rényi model. The method is simple, first decide on the number of vertices $ |V|= n $ and a value $ p \in [0,1] $. Now for every possible edge we include it with probability $p$ independently of every other edge.

Thus the probability of a single edge existing is a $Bernoulli(p)$ random variable. There are $ {n \choose 2} $ possible edges and so $ \mathbf{E} [ |E| ] = {n \choose 2} p $.

Another interesting property we can look at is the expected number of triangles found in the graph. A triangle exists if, like you would expect, there are 3 vertices all connected with edges. Calculating this is simpler than it might first sound…

There are $ t = {n \choose 3} $ possible triangles in our graph of n vertices. Now define random variables $ X_1, \dots, X_t $ such that $ X_j = 1 $ if triangle j exists and $0$ otherwise. Note due to how we are constructing our graph these random variables are independent and identically distributed so we do not need to worry about how we are assigning a random variable to a triangle. Then we have:

$$ \begin{align} \mathbf{E} [no. triangles] &= \mathbf{E} \left[ \sum_{j=1}^t X_j \right] &&\text{(by definition)} \\ &= \sum_{j=1}^t \mathbf{E} [X_j] &&\text{(by linearity of expectation)} \\ &= t \mathbf{E} [X_j] &&\text{(iid)} \\ &= tp^3 \end{align} $$

This seems a very simple result so to verify it I ran a simulation in R to check it out.

You can use my code to see for yourself if you would like:

rm(list = ls())
require(igraph)
set.seed(1)

#### Model Parameters
n.vec = 1:50
p = 0.5
reps = 10

#### Function to create a erdos-reyni graph
erdos <- function(N, p, plots){

  E = choose(N, 2)
  edges = rbinom(E, 1, prob = p)
  
  A = matrix(0, nrow = N, ncol = N)
  A[lower.tri(A, diag=FALSE)] = edges
  A = t(A) + A
  
  graph = graph_from_adjacency_matrix(A, mode = "undirected")
  
  if (plots == TRUE){
    plot(graph)
  }
  
  return(list(graph = graph, triangles = length(cliques(graph, min=3, max=3))))
}

#### Simulation
clique.vec = rep(0,length(n.vec))
for (i in 1:length(n.vec)){
  for (j in 1:reps){
    clique.vec[i] = clique.vec[i] + erdos(n.vec[i], p, FALSE)$triangles
  }
  clique.vec[i] = clique.vec[i] / reps
}

plot(n.vec, clique.vec - choose(n.vec,3)*p^3, lwd = 1,  type = "l", xlab = "n", ylab = "Abs Diff")

MCMCMC

Matthew Darlington — Fri, 17 Jan 2020 15:00:00 +0000

Over Christmas I was set the task of writing a report about something that interested my during my first term of the MRes. I chose to look into something called Metropolis-Coupled Markov Chain Monte Carlo (MCMCMC). This sounds like a mouthful but once broken down isn’t too scary.

I will start off by giving a brief explanation of what Markov Chain Monte Carlo (MCMC) is. Imagine you have a probability density function $ \pi(x) $ and you need to take random samples from it, but your function is too complicated to work this out by hand. What MCMC does is find a way to simulate taking random draws by constructing a random walk that moves by taking into account this complicated function. The simplest way to do this is called Random Walk Metropolis (RWM).

We start off with some initial position $ x $. Then we propose a new position by adding a normal sample to it $ y = x + \mathcal{N}(0, h) $ where $h$ is something the user can tune. There is then a Metropolis-Hastings acceptance step where we try to determine if $y$ is a sensible sample for our distribution to take. To do this we let $ A = \min \left( 1, \frac{\pi(y)}{\pi(x)} \right) $ and then accept $ y $ as our new position with probability $ A $. This is then repeated to get as many samples as you wish.

Of course the downside of such a simple method is that it will only work in certain cases. Having a multi-modal distribution is something that makes RWM perform very badly. What I mean by perform badly is that we get very dependent samples which is not desirable if we would like to use them for any kind of analysis about the distribution.

This motivates the need for more complicated methods, and one that works well in this scenario is MCMCMC. What we now do is we take multiple probability density functions $ \pi_1(x), \pi_2(x), \dots, \pi_m(x) $ and run RWM on each of these separately. We pick $ \pi_1(x) = \pi(x) $ and then for each other distribution we would like to ‘smooth out’ the function.

Examples of functions we could take if the top left was our target

The clever part though is that after each iteration we introduce a Metropolis-Hastings step that can switch the position between any two of the chains. This allows us to get better movement with our random walk on $ \pi(x) $ as I will show from the results I obtained:

Results with MCMCMC on the left and RWM on the right

The top graphs show a histogram of the simulated results with the true distribution in red over the top. From this we can see that we’re getting a much better representation with MCMCMC. The second graphs show what is called an ACF plot and is used to measure the correlation of the data with a lagged version of the data. We wish this to converge to 0 since then we will be getting what appear to be independent results, and MCMCMC is getting much closer than RWM. The last graphs show the random walk for the first 300 steps. What we see is that MCMCMC is moving between the two peaks whereas RWM gets stuck in the the right peak.

If you would like to read more about this you can read my Christmas report here:

Report

And if you would like to discuss MCMCMC or any other blog on my website please feel free to contact myself using the form below.

STOR-i Conference 2020

Matthew Darlington — Fri, 03 Jan 2020 14:07:00 +0000

Last week I got to start off the new year at Lancaster by attending the annual STOR-i conference. I was able to sit through a day and a half of talks from academics coming from around the UK as well as Europe and America. My first blog will be about my favourite talk from this year’s conference, ‘Multi-armed bandit problems with history dependent rewards’ by Ciara Pike-Burke a STOR-i Alumni.

What is a Multi-armed bandit?

Consider you have a row of $n$ slot machines in front of you each with a fixed rewards structure. You are given £$H$ to play the machines and each cost £$1$ to play, with your objective to be take home as large winnings as possible! How to tackle this problem is not as simple as it might first look and there have been many different approaches to come up with a strategy to optimize, i.e. the decision process.

What is the point?

This may sound like a rather abstract problem but in fact there are important applications we can use bandits for, and one of the most widely used is in advertising. Say now instead of being sat at a row of slot machines you are an internet advertiser considering which advert to display to a customer. The rewards in this case could be how many customers click, or how much they go and spend on the advertised link.

What are history dependent rewards?

This approach to internet advertising does not apply to all kinds of products that someone might wish to advertise. The example given was sofas, I may go and buy a sofa off an advert and thus make it appear I am interested in sofas. But now the traditional policies for bandits would make it more likely that I see the same advert next time I am displayed one. The point is now I have a sofa and do not want to see this advert in the immediate future. Hence, this gives the importance of having history dependent rewards.

This is an example of a periodic reward as the desire for a sofa increases and decreases over time. There are other kinds of history dependent rewards, the simplest being strictly increasing or decreasing rewards for example representing loyalty to a company. A more complex reward structure could be coupon rewards. Think of when you go to your favorite café and get stamps on your loyalty card and after so many visits you get a free coffee. This is essentially the set up but now we do not know the number of stamps we need to acquire or the prize we get at the end.

How can we solve this?

We can try to predict what the rewards will look like from all the arms in some small future time interval and use these to try to plan our next best move. For example, if a customer has just clicked on an advert for a sofa, I might want to wait a while until the same advert is shown and instead give adverts for coffee tables instead. In the presentation we were told about how you could use something called a Gaussian process to predict the reward function. Given our prediction of these we can then optimize our next d steps in order to achieve maximum profit. Obviously if we could look at every decision we would ever make then this would be optimal, but in practice a computer cannot do this thus we have to decide on a small fixed time period to consider.

References:

Ciara Pike-Burke, Steffen Grünewälder (2019) Recovering Bandits