Optimization – Rebecca Hamm

This Week on the STOR-i Programme: Bayesian Optimization

Rebecca Hamm — Sun, 18 Apr 2021 09:38:25 +0000

At STOR-i we are currently looking through our potential PhD projects. For each project we have been given a page summarizing the topic with some papers listed at the bottom. As I was looking through these papers I came across one which particularly interested me in an area I knew very little about. I decided a good way for me to form a deep understanding of this paper would be to write a blog post on it. This way you can learn something too.

The paper in question is called . So you may remember in a previous blog post I explored a Bayesian approach to a multi armed bandit problem and in another I looked at a heuristic approach to an optimization problem. Well today (or which ever day you decide to read this) we are looking at a Bayesian approach to an optimization problem.

So lets outline the situation. We have a function f(x). We would like to find the maximum of this function, however we do not know the structure of the function, and it is expensive to evaluate the function at certain points. Basically we have a black box which we give inputs (our x values) and receive an output (f(x) value). Since evaluating points of the function is expensive we can only look at a limited number of points in the function. So we have to decide which points to evaluate in order to find the maximum.

So how do we do this? Firstly we fit a Bayesian model to our function. We can then use this to formulate something called an acquisition function. The highest value of the acquisition function is the point we evaluate next.

is used to fit the Bayesian model. We suppose that f values of some x points are drawn at random from some prior probability distribution. This prior is taken by the as a multivariate normal with a particular mean and covariance matrix. A mean vector is constructed using a mean function at each x. The covariance is constructed using a kernel which is formulated so two x’s close together have a large positive correlation. This is due to the belief closer x’s will have similar function values. The posterior mean is then an average of the prior mean and an estimation made by the data with a weight dependent on the kernel. The posterior variance is just the prior covariance of that point minus a term that corresponds to the variance removed by observed values. The posterior distribution is a multivariate normal.

Taken from the paper

Illustrated above we have an estimate of the function f(x) (solid line) and the dashed lines show Bayesian credible intervals (these are similar to confidence intervals) with observed points in blue. Using this we can form an acquisition function such as:

Taken from the paper

This function tells us which point of the equation to evaluate next. This is the maximum point of the function. There will be a balance between choosing points where we believe the global optimum and places with large amounts of variance.

There are many different types of acquisition functions. The most commonly used one is known as the expected improvement. In this case we assume we can only provide one solution as the maximum of f(x). So if we had no more evaluations left we would provide the largest point we have evaluated. However, if we did have just one more evaluation, our solution would remain the same if the new evaluation was no larger than the largest so far, but if the new evaluation is larger that would now be our solution. The improvement of the solution is the new evaluation minus the previous maximum. If this value is positive and zero otherwise. So when we choose our next evaluation we would like to choose one which maximizes improvement. It is not quite that simple as by the nature of the problem we do not know the value of an evaluation until we have chosen and evaluated that point. This is when we use a Bayesian model as we can use it to obtain the expected improvement of the points and then choose the point that has the largest expected improvement. This will choose points with high standard deviation and points with large means in the posterior distribution. This means there will be a balance between choosing points with high promise and points with large amounts of uncertainty.

This method can be applied to many problems including robotics, experimental particle physics and material design. This paper explains the application of Bayesian optimization to the development of pharmaceutical products. To learn more I’d advice reading the I used for this blog as well as this which discusses constrained Bayesian optimization and its applications.

This Week on the STOR-i Programme: Memes

Rebecca Hamm — Sun, 04 Apr 2021 10:26:56 +0000

Before you get excited, it’s not the type of memes you are thinking of. I do apologize, however I struggled to find a link between online memes and Statistics and Operational Research except for maybe this one:

If you are disappointed, my friend Robyn includes a tweet of the week at the end of each blog post which I’m sure there are plenty of memes there. What we are actually going to talk about is the concept of memes in heuristics used for optimization. This week, as part of the MRes course, I have been writing a report on that very topic. I thought it was interesting so I decided to share it with you.

Firstly, what even is optimization? Well a previous MRes now PhD student at STOR-i covered this in a similar blog on heuristics in optimization. But if you are too lazy to click that link then basically we wish to minimize or maximize a function subject to some constraints. A very famous example of this is the. In this problem, as you may have guessed, we have a travelling salesmen. This salesmen wishes to visit all of the cities only once. When he does this he wishes to incur the minimum cost or distance. So the problem may look at bit like this:

There are many exact methods that will produce the correct solution. However, as you increase the problem (the number of cities) the time to compute these methods increases greatly to a point where there is no way you have time to wait for a response. So in response to this, heuristic methods are used. One popular heuristic method is the genetic algorithm. As you can probably tell by the name it is based on genetics, particularly how they are past on.

The algorithm starts with a set of individuals each with a genetic code (this is the order they visit each city). Some of the individuals are then selected to be parents. Usually, the best individuals are selected (individuals with least cost/distance) but occasionally there is some randomization. The parents then “reproduce” which is where there is a crossover between two parents. This can happen in different ways for different problems, in particular there are multiple ways to do this for the travelling salesmen. Today we will discuss one called the Very Greedy Crossover found . In this crossover an initial city is chosen randomly then the next city is chosen from the adjacent cities in the parents. The closest cities to the initial cites out of the adjacent ones in the parents is chosen. This process is repeated until all cities have been chosen. For example:

These resulting “children” are then mutated with some probability. Again, there are many ways to do this. One way is to simply swap the places of two cities. The individuals now make up a new population and the process is repeated until some condition is met. This may be that a solution is found that is considered good enough or simply after so many iterations the best is chosen.

I understand that by now you have either completely forgotten this blog post was supposed to be about memes or you are understandably frustrated with me for not getting to the point. But don’t worry I’m doing it now. Dawkins coined the term to mean an analogous to the gene in the context of cultural evolution. So, ideas and behaviors that spread within our culture or image based jokes that spread through the internet. This has been adapted so we can include memes within the genetic algorithm to create what are known as memetic algorithms.

“How do we do this?” you may be asking. I hope you are cause I’m about to answer that very question. It’s simple. After the mutation stage we add what is called a . These are algorithms that go from solution to solution hoping to find an optimal solution. An example is Tabu search. Tabu search includes memory about where it is previously been to avoid cycling through the same solutions and getting trapped in local optima. Local optima means that it’s the best solution in that area. This does not mean its the best solution overall. Searches can get trapped in local optima as it appears to be the best solution if other areas are not searched. For example, the neighborhood of a solution may be all the solutions that can be found by removing a city and placing it in a different position. The search will then move to the best solution in the neighborhood. Since each time the best is chosen a local optima will be found, but other solutions that require choosing a worst option first to then move to a better option may be missed. The Tabu remembers where it’s been and blocks previous moves forcing the search to look at other solutions.

Memetic Heuristics are seen to be more effective and produce better results, however this comes at a price. As these algorithms are more complicated the computational time is longer. Hence, if you are looking for a quick answer I suggest you stick with a more simple heuristic. However if you have the time you should definitely give a memetic algorithm a go. The particular one I have outlined in this blog can be found . Another approach for using a memetic algorithm to solve the travelling salesmen using particle swarm optimization can be found . Find out more about the genetic algorithm , tabu search and is a good paper detailing many nature based heuristics. If you want to learn more about memetic algorithms read this .