Rebecca Hamm

This Week on the STOR-i Programme: Bayesian Optimization

Rebecca Hamm — Sun, 18 Apr 2021 09:38:25 +0000

At STOR-i we are currently looking through our potential PhD projects. For each project we have been given a page summarizing the topic with some papers listed at the bottom. As I was looking through these papers I came across one which particularly interested me in an area I knew very little about. I decided a good way for me to form a deep understanding of this paper would be to write a blog post on it. This way you can learn something too.

The paper in question is called . So you may remember in a previous blog post I explored a Bayesian approach to a multi armed bandit problem and in another I looked at a heuristic approach to an optimization problem. Well today (or which ever day you decide to read this) we are looking at a Bayesian approach to an optimization problem.

So lets outline the situation. We have a function f(x). We would like to find the maximum of this function, however we do not know the structure of the function, and it is expensive to evaluate the function at certain points. Basically we have a black box which we give inputs (our x values) and receive an output (f(x) value). Since evaluating points of the function is expensive we can only look at a limited number of points in the function. So we have to decide which points to evaluate in order to find the maximum.

So how do we do this? Firstly we fit a Bayesian model to our function. We can then use this to formulate something called an acquisition function. The highest value of the acquisition function is the point we evaluate next.

is used to fit the Bayesian model. We suppose that f values of some x points are drawn at random from some prior probability distribution. This prior is taken by the as a multivariate normal with a particular mean and covariance matrix. A mean vector is constructed using a mean function at each x. The covariance is constructed using a kernel which is formulated so two x’s close together have a large positive correlation. This is due to the belief closer x’s will have similar function values. The posterior mean is then an average of the prior mean and an estimation made by the data with a weight dependent on the kernel. The posterior variance is just the prior covariance of that point minus a term that corresponds to the variance removed by observed values. The posterior distribution is a multivariate normal.

Taken from the paper

Illustrated above we have an estimate of the function f(x) (solid line) and the dashed lines show Bayesian credible intervals (these are similar to confidence intervals) with observed points in blue. Using this we can form an acquisition function such as:

Taken from the paper

This function tells us which point of the equation to evaluate next. This is the maximum point of the function. There will be a balance between choosing points where we believe the global optimum and places with large amounts of variance.

There are many different types of acquisition functions. The most commonly used one is known as the expected improvement. In this case we assume we can only provide one solution as the maximum of f(x). So if we had no more evaluations left we would provide the largest point we have evaluated. However, if we did have just one more evaluation, our solution would remain the same if the new evaluation was no larger than the largest so far, but if the new evaluation is larger that would now be our solution. The improvement of the solution is the new evaluation minus the previous maximum. If this value is positive and zero otherwise. So when we choose our next evaluation we would like to choose one which maximizes improvement. It is not quite that simple as by the nature of the problem we do not know the value of an evaluation until we have chosen and evaluated that point. This is when we use a Bayesian model as we can use it to obtain the expected improvement of the points and then choose the point that has the largest expected improvement. This will choose points with high standard deviation and points with large means in the posterior distribution. This means there will be a balance between choosing points with high promise and points with large amounts of uncertainty.

This method can be applied to many problems including robotics, experimental particle physics and material design. This paper explains the application of Bayesian optimization to the development of pharmaceutical products. To learn more I’d advice reading the I used for this blog as well as this which discusses constrained Bayesian optimization and its applications.

This Week on the STOR-i Programme: Memes

Rebecca Hamm — Sun, 04 Apr 2021 10:26:56 +0000

Before you get excited, it’s not the type of memes you are thinking of. I do apologize, however I struggled to find a link between online memes and Statistics and Operational Research except for maybe this one:

If you are disappointed, my friend Robyn includes a tweet of the week at the end of each blog post which I’m sure there are plenty of memes there. What we are actually going to talk about is the concept of memes in heuristics used for optimization. This week, as part of the MRes course, I have been writing a report on that very topic. I thought it was interesting so I decided to share it with you.

Firstly, what even is optimization? Well a previous MRes now PhD student at STOR-i covered this in a similar blog on heuristics in optimization. But if you are too lazy to click that link then basically we wish to minimize or maximize a function subject to some constraints. A very famous example of this is the. In this problem, as you may have guessed, we have a travelling salesmen. This salesmen wishes to visit all of the cities only once. When he does this he wishes to incur the minimum cost or distance. So the problem may look at bit like this:

There are many exact methods that will produce the correct solution. However, as you increase the problem (the number of cities) the time to compute these methods increases greatly to a point where there is no way you have time to wait for a response. So in response to this, heuristic methods are used. One popular heuristic method is the genetic algorithm. As you can probably tell by the name it is based on genetics, particularly how they are past on.

The algorithm starts with a set of individuals each with a genetic code (this is the order they visit each city). Some of the individuals are then selected to be parents. Usually, the best individuals are selected (individuals with least cost/distance) but occasionally there is some randomization. The parents then “reproduce” which is where there is a crossover between two parents. This can happen in different ways for different problems, in particular there are multiple ways to do this for the travelling salesmen. Today we will discuss one called the Very Greedy Crossover found . In this crossover an initial city is chosen randomly then the next city is chosen from the adjacent cities in the parents. The closest cities to the initial cites out of the adjacent ones in the parents is chosen. This process is repeated until all cities have been chosen. For example:

These resulting “children” are then mutated with some probability. Again, there are many ways to do this. One way is to simply swap the places of two cities. The individuals now make up a new population and the process is repeated until some condition is met. This may be that a solution is found that is considered good enough or simply after so many iterations the best is chosen.

I understand that by now you have either completely forgotten this blog post was supposed to be about memes or you are understandably frustrated with me for not getting to the point. But don’t worry I’m doing it now. Dawkins coined the term to mean an analogous to the gene in the context of cultural evolution. So, ideas and behaviors that spread within our culture or image based jokes that spread through the internet. This has been adapted so we can include memes within the genetic algorithm to create what are known as memetic algorithms.

“How do we do this?” you may be asking. I hope you are cause I’m about to answer that very question. It’s simple. After the mutation stage we add what is called a . These are algorithms that go from solution to solution hoping to find an optimal solution. An example is Tabu search. Tabu search includes memory about where it is previously been to avoid cycling through the same solutions and getting trapped in local optima. Local optima means that it’s the best solution in that area. This does not mean its the best solution overall. Searches can get trapped in local optima as it appears to be the best solution if other areas are not searched. For example, the neighborhood of a solution may be all the solutions that can be found by removing a city and placing it in a different position. The search will then move to the best solution in the neighborhood. Since each time the best is chosen a local optima will be found, but other solutions that require choosing a worst option first to then move to a better option may be missed. The Tabu remembers where it’s been and blocks previous moves forcing the search to look at other solutions.

Memetic Heuristics are seen to be more effective and produce better results, however this comes at a price. As these algorithms are more complicated the computational time is longer. Hence, if you are looking for a quick answer I suggest you stick with a more simple heuristic. However if you have the time you should definitely give a memetic algorithm a go. The particular one I have outlined in this blog can be found . Another approach for using a memetic algorithm to solve the travelling salesmen using particle swarm optimization can be found . Find out more about the genetic algorithm , tabu search and is a good paper detailing many nature based heuristics. If you want to learn more about memetic algorithms read this .

A Swing and a Miss!

Rebecca Hamm — Sun, 21 Mar 2021 19:46:58 +0000

This blog is different from my others in that it has not been inspired by anything I have recently studied in my MRes course, but by a book I have read. In fact, this book is partially what inspired me to do a PhD in the first place. It was during the first lockdown and I had just finished my degree and although I had some ideas of my many hobbies and interests I was unsure of what path to take, let alone what my next step was. I found myself spending most of my days chasing the sunny parts of my garden while reading this book. It was while reading about all these different mathematical problems and how their solutions were found that I realised that was what I wanted to do. That or be a comedic writer, but as you may have noticed I don’t quite have the knack for that. I guess by this point you are wondering what book could have inspired me so much. Well here it is:

I appreciate that it may not be the most intellectual mathematic book on the market but I promise you it is clever, and if like me you are a fan of both the Simpsons and mathematics, you should read it!

I’m not here to tell you about the whole book, I just want to talk about an area of mathematics it touches on. (I’m not actually here at all you are reading this through a computer screen). This topic not only appears in the Simpsons episode ‘MoneyBart’ but also in the film Moneyball (which I also recommend you watch).

It is of course statistics in baseball. The game of baseball is said to be revolutionized by a man named Bill James. Bill James wasn’t a baseball player or even a coach or manager, in fact many of his revolutionary ideas came about when he was working as a security guard at a pork and beans canneray.

So how did he revolutionize baseball? You’ve probably realized by now he did it using statistics. Bill James noticed that the statistics currently being used to evaluate players weren’t appropriate or were being misunderstood. An example of this is fielders were evaluated on the numbers of mistakes made. This seems fair however if you look at what counts as a mistake it is not a fair evaluation. A fielder who is faster and gets to the ball quick but fumbles to catch it would class a mistake but a slow fielder who wouldn’t even get near the ball would be fine. Obviously you were prefer to have a fielder who gets close to the ball hence using this statistic to pick a team would be a mistake.

Bill James developed a statistic seemingly more relevant to evaluate a fielders performance: Range Factor as detailed . The range factor is calculated by dividing the sum of assists and putouts by the number of innings or games played. Just in case you are not familiar a putout is given to a player when they record an out. This can be done in many ways which are described . An assist is rewarded to a player when they touch the ball prior to a putout. For more information check out . Instead of observing errors this statistic shows how often a play does something well, which is easier to observe, therefore making it a better evaluation of a player.

Another analytic devised by Bill James is the Pythagorean Expectation, again detailed . This analytic is used to estimate the number of games a team should have won given the number of runs allowed and runs scored. It is also used to make predictions as well as evaluate which teams are over or under performing. The expected number of wins is the number of games played multiplied by something called the win ratio. The win ratio is

This analytic has not only been used in baseball but also basketball, ice hockey and American Football.

If you rushed off to watch Moneyball before finishing this blog then you will know that in 2002 baseball team the ‘Oakland A’s’ used statistics to change how they play. This led to them winning 20 consecutive games setting an American League record and changing the game of baseball. To learn more read the book or look .

This Week on the STOR-i Programme: Rcpp

Rebecca Hamm — Sun, 07 Mar 2021 14:55:23 +0000

This week I thought I would write a slightly different blog post. Instead of discussing an area of Statistics or Operational Research I have recently learnt about, I will be discussing a useful tool for Statistics or Operational Research I have recently learnt about. This tool is an R package, Rcpp.

What does Rcpp do?

Rcpp is a package which allows us to connect R and C++ code. In particular I will discuss using it to develop functions in C++ which can then be used in R.

Why do we want to do this?

If you have ever used C++ and are like me you will think that it is quite painful to code in. Writing in C++ can be a lot more time consuming than writing in R. R is also full of lots of packages and functions which can make whatever maths you are doing a lot nicer. You may be wondering if R is easier to write in why are we developing our functions in C++. The answer is relatively simple: speed. R tends to be used in an interpreter whereas C++ is compiled. Interpreters run each line of code at a time whereas a complier processes all the code at once. If you are doing a large task in R interpreting one line at a time may take a while. Creating a function in C++ that is compiled then able to use in R can speed up the program by more than 100 times faster.

How do we do this?

There may be different things to consider depending on what you are doing with your function. But for now lets just talk about the basics. There are two ways to go about this, one is to source your C++ file and another is to create an R package of your C++. Either way we need to add a few lines to our C++ code.

At the begin of your C++ code include:

The next line of code should be added just before the function that you wish to use. This is so Rcpp knows what function to include.

Sourcing your C++

Sourcing C++ in R is simple. Just open R in the directory you have saved your C++ file in. Install and library the Rcpp package, then source your file.

One issue is you have to repeat this every time you want to use it so if it’s a function you will be using multiple times it may be easier to make a package. This way you can also send it to other people to use a lot easier.

Making a Package

First, in R you need to library Rcpp and build the skeleton of your package.

Then in your terminal build and install the package.

You then have a package you can use in R.

Things to think about

Rcpp is good at converting common data structures. However, if you have defined your own data structures then you will bump into some issues. One way to remedy this is to have an input and output of your function be a common data type and convert to your defined data structure within your function.

Adding to your package

You may find that you want to edit or add to your package which you can do with the following simple steps:

Edit the file in your package or add a new file to your package folder.

Open R with your package as the directory library Rcpp and run the following line:

Rebuild your package and you are good to go.

I hope you find this useful! To learn more about the Rcpp package look or.

This Week on the STOR-i Programme: Fast Fashion and Multi-armed Bandits

Rebecca Hamm — Sun, 21 Feb 2021 19:37:36 +0000

This week as part of the MRes course we had to pick our next topic to write a report on. I was really stuck between two options but in the end had to chose one. I thought if I’m not going to write a report on the other option I can at least write a blog on it and here we are. So as you may or may not have guessed from the title of this post the option I didn’t chose was: Multi-armed bandits. At the end of the talk on this area the lecturer, Kevin Glazebrook, mentioned some areas of particular study. One that particularly caught my eye was fast fashion. Before deciding to do a Maths degree I wanted to be a fashion designer or just any job in fashion really. While I gave up that dream for my love of Maths, it is still an interest of mine. Hence, I was very excited by the combination of the two areas.

Previously clothing companies would have to make decisions on what products that were selling that season with very little information on where demand may lie that season. As you can imagine this leads to them missing opportunities of selling popular goods as well as having excess supply of unwanted products. As technology has improved, especially manufacturing schemes and means of transport, companies have been able to delay some of the production for that season. This means they will have more information on what’s in demand that season and then produce and sell goods accordingly.

Now you may be thinking that’s very nice but what’s that got to do with maths. Well I’ll tell you. I assume you remember one of the focuses of this blog was . If you are picturing something like this picture from this, do not worry as in this case we are talking about problems known as multi-armed bandits. In these problems we have a series of time steps and at each time step we have to make a decision (pull an arm). Before we pull an arm we are unsure if it will help us achieve what ever it is we wish to achieve, but by pulling that arm we will know more information about the arm. The aim is to minimize the regret we have for pulling an arm. To do this we have to balance between two things: exploitation and exploration. So we want to exploit any information we have from pulling arms previously in order to pull arms which give us successful results. However, we want to explore all our options to ensure we have found the best arm.

So how does this relate to fast fashion? Well if the company delays production so that they release a new selection of goods at T time steps. Then at each time step, t, they have to chose which products they will release in this selection. Picking a product to go in a selection is pulling an arm. By picking a product they can then see how well it sells and hence, its demand. They can then use this information to help make their decision at the next time step. To ensure that we are making the best decisions with the information we have at each time step we make a model.

So lets look at this model. To start with we have a set of S different products to chose from. As there is limited space within a shop we can only chose N of these products at each t. For this model we assume that a customer will buy one unit of a product at an unknown constant rate d_s. This is assumed to remain constant but the actual demand for the product will only be observed at times when the product is in the selection. To formulate this model we use some Bayesian statistics. If you are not clued up on Bayesian statistics I suggest you take a peak . In Bayesian statistics we can incorporate prior beliefs or information on parameter of our model. In this case our prior beliefs are represented as a with a shape parameter m_sand scale parameter a_s.Both are assumed to be positive and m_sassumed to be an integer. We are using a for a on any samples of data we have at a given time. As the Gamma distribution is a , our resulting distribution (posterior) is a Gamma distribution with shape parameter (m_s+n_s) and scale parameter (a_s+1), where n_sis the number of products, s, sold in a selection period. So, each time a product is selected its posterior distribution will be updated with the addition of n_ssales for that selection period to the shape parameter and and 1 to its scale parameter. The intuition is that the shape parameter is the units of products that will be sold in a number of periods equal to the scale parameter so the expected number of sales from a product in a period is the shape parameter divided by the scale parameter. This can be used to make decisions by choosing options with the largest expected sales. This model balances exploration and exploitation. If a product has a lot of sales n_swill be larger so will the shape parameter and hence the expectation will be as well. This means that product will likely be picked again however, the more times a product is picked the larger the scale parameter will get. A larger scale parameter means the expectation will be reduced hence, lowering the chances options picked frequently and increasing opportunities for exploring other options.

If we simplify the problem so that we have to chose one of a pair of shorts, a skirt or a skort at each time step as choices may go something like this:

The starting scale parameters are 16,17 and 13 for shorts, skirt and skort respectively and the shape parameter is 1 for all .

To read the paper that formulated this model as well as learn more about how maths is used to learn about demand within fast fashion click . I hope this blog post gave you a little insight in to maths being used in our everyday lives which you may not have previously thought about.

This Week on the Stor-i programme: Network Modelling

Rebecca Hamm — Sun, 07 Feb 2021 17:18:46 +0000

This week it was time to choose which research topic, from the lectures that I mentioned last week, we would like to write our first report on. The topic I have decided on is: Network Modelling. While I was refreshing my memory on what was covered in the lecture, given by Chris Nemeth, I thought it would be a nice idea to give you some insight on the topic.

What is Network Modelling? It is statistical modelling using network data. But what is network data? I think the best way to explain it is with examples such as friends on Facebook or other social media, voting similarity of politicians, connections in the brain or even connections between people who have co-authored papers together.

Comparison of the brain with and without Alzheimer’s. From [Zajac et al., Brain Sci, 2017]

US Senate Voting Similarity. From [Moody & Mucha, “Portrait of Political Polarization”, 2013]

Before we go into the details let’s discuss the notation of a network. Here we have a basic network:

We can represent this network as a graph G=(V,E).

V is a set of vertices in the network. So for this case we simply have V={1,2,3,4,5,6}. Depending on the network data vertices maybe named instead of just numbered.
E is the set of edges. Each edge, e=(u,v) represents a connection between two vertices u and v. For this graph we have E={(1,3),(1,4),(1,6),(2,3),(3,4),(5,6)}.

Different networks will require different information and so the graphs may look different:

Directed edges

Labelled vertices

Weighted edges

Labelled edges

Graphs can also be represented as an adjacency matrices. The way I like to think of this matrix is that if there was a horizontal and vertical list of the vertices, then if there is an edge between the two vertices there will be a one in that slot of the matrix and a 0 if not. Here is what the adjacency matrix looks like for our example.

So, what can we do with this? Well by summing up the rows we can calculate the degree of each vertex. We can also calculate the edge density which is the number of edges (the sum of all the elements in the adjacency matrix over two) divided by all the possible edges. Also, we can calculate the average degree.

Using the adjacency matrix we can look at the degree distribution which gives us the proportion of nodes with a specific degree. This has a simple probability mass function such that the probability of a specific degree is the number of nodes with that degree divided by the total number of nodes. Degree distributions show us how many neighbours nodes have and how properties vary over a network.

Distribution for our example

So now we have explained the networks it is time to look at the modelling. The particular model we are going to look at today is Erdos-Renyi model. For a network, we have graph G(n,p) where n is the number of nodes and p is the probability of there being an edge between two nodes. This means we have a binomial probability mass function where each edge added is seen as a success. But how do we go about setting p? Real networks tend to be sparse. If you think about it the number of friends you have on Facebook is a very small proportion of the total number of people on Facebook. Not to say you don’t have many friends there is just a lot of people on Facebook. As the size of the network increases, p will decrease. So, if you think of Facebook only containing people you went to school with there would be a high probability of you being friends with any one person. But if you expand it to your town or your country or even the whole world it will decrease. So, one appropriate way to set p may be equal to a constant over the number of nodes. Like any model this model has criticisms. One main one being that each pair of nodes have the same probability of having an edge which clearly is unlikely in reality. You are more likely to be friends with someone if you are friends with all their friends. There are adaptations of this model that take that in to account which you can find in this book: . This book also details the Erdos-Renyi model.

This week on the STOR-i Programme: Most likely pathways in the ocean.

Rebecca Hamm — Sun, 24 Jan 2021 18:40:37 +0000

Currently as part of the MRes at STOR-i we are participating in lectures on research topics given by multiple lecturers. One I found particularly interesting was given by Adam M. Sykulski on work he is currently doing with a PhD student at STOR-i, Michael O’Malley, about most likely pathways in the ocean. So, lets dive right in. (I’m sorry, don’t worry there are no more puns.)

You may be confused by what is meant by this so I’m going to try and briefly explain it to you. Imagine you are a particle in the ocean at point A and you wish to end up at point B. This research finds the path you will most likely take.

So what path is this? Your first thoughts may be that you want the shortest path so a straight line. Then maybe you remember the world is spherical so that would a geodesic line. But also, you’re a particle in the ocean being pushed and pulled by the currents so maybe you follow them?

The general idea. Current map was taken from the presentation slides.

But why do we even want to know which path is most likely. The motivation behind this is biologic research conducted by Genoscope in France on genetic samples of zoo plankton at different points in the ocean and wanting to know how connected they are. It can also be used for studying the transport of things such as debris, plastic, or oil. So, we are trying to observe the connectivity between points in the ocean.

Before any statistical analysis can be done some data is needed. This data is gathered by the global drifter programme. Drifters (free-floating buoys) are released into the ocean and the path that they take is recorded and so we have data that can now be used.

But how do we use this data to find the most likely path between each point. It is very rare that a drifter will go through two particular points in the ocean. This means we can not just look at which paths the drifters take most frequently. Even if we try to connect paths of two drifters we can’t know that this is the most likely path.

So, O’Malley and Sykulski have proposed a method to do this. To apply the method, they first decided on a method of splitting up the ocean into bins. This is used to define the probability of a particle going from one bin to another and determine the mostly likely path. Now an obvious choice would be to split it up into squares based on longitude and latitude. However, there are a few issues with this. Firstly, due to the earth being a sphere there are large variations of sizes in each square. As well, diagonal neighbours only share a vertex and no edges compared to other neighbours which share an edge and two vertices. So diagonal neighbours are disadvantaged. The method which has instead been used is a grid system named H3 by UBER. The advantages of this is that the bins the ocean is split into are fairly similar sizes and each neighbour shares an edge and two vertices.

Picture taken from presentation slides.

Markov transition matrices are then formed using data from the drifters by looking at their location at each time step. For those that are not in the know, a Markov transition matrix is a matrix with elements p_ij which are the probability of transitioning from state i to j. In this instance each hexagon is a state. An important property for a Markov transition matrix is there is lack of memory between states. This means previous states don’t affect the probability of future states. The appropriate time step O’Malley and Sykulski have chosen is five days as this follows oceanographic literature. Dijkstra’s algorithm is then used to find the most likely path. I understand you may not be familiar with Dijkstra’s algorithm but all you need to know is that it is an algorithm to find the shortest path.

Here’s an example from the South Atlantic Ocean. Again taken from the presentation slides.

Now we have our most likely path what do we do with it? Well this research is trying to quantify the connectivity of two points in the ocean. To look at the full picture it is important to consider that currents at different points of the ocean are different speeds. So, to find this connectivity of points we look at the expected time taken to travel the most likely path.

It is hard to say what exactly about this lecture I found so interesting. I think its more than the fact I’ve always had a fondness for the ocean and is actually the elegance of the method which really grasped my focus. Which I would say is quite impressive for the third hour of an early morning three-hour lecture.

To learn more about this, definitely check out the paper which this blog is based upon:

To learn more about Markov chains and transition matrices:

Or learn more about Dijkstra’s algorithm:

First Post

Rebecca Hamm — Thu, 31 Dec 2020 11:17:22 +0000

Welcome to my blog! Here I will be keeping updates of exciting topics and ideas i learn about while at STOR-i.