Statistics – Hamish Thorburn

60 second stats – Monte Carlo Simulation

Hamish Thorburn — Sun, 19 Apr 2020 14:12:47 +0000

So, the plan for this post was to look at . However, in doing that I realised that:

In order to understand the paper and why it’s important, the reader (i.e. you) need to first understand what a Monte Carlo approximation is… which would be fine, except that;
The only real way I could fit that into the post would be to embed a YouTube video describing it into the start of the post… which would be fine, except that;
Pretty much all the YouTube videos on this are really boring.

Therefore, I thought the best thing to do would be go back to basics, and try and give a quick explanation of Monte Carlo Estimation myself. And what better way to give a quick overview is there than another edition of 60 second stats?

So get out the stopwatch, and get ready. The 60 seconds begins…NOW!

What is Monte Carlo Estimation?

Monte Carlo Estimation is a technique used to estimate quantities. It is based around simulating a bunch of random numbers, and then using these to make estimates.

Why is it called “Monte Carlo”?

It is named after the Monte Carlo Casino, in Monaco, which was frequented by the uncle of Stanislaw Ulam, one the methods founders. It is also a reference to the inherent randomness of the method (as all casino games are based on chance).

Some beautiful architecture that just screams “computational statistics”.

Ok. So how does it work?

The overall idea is very simple. Say you have a random variable X, and you want to estimate some value related to X (e.g. it’s average). You can then simply simulate a large number of realisations (i.e. copies) of X, and take the average of these. Then (because of the ) we know that the average of our copies will be a “good” estimation of X.

Wait, what do you mean by “good”?

What we mean by “good” is that as you take more and more realisations of X to average, the average will get closer and closer to the true value, as shown in the gif below.

The gif clearly shows that the more copies, the closer to the true value we get.

Alright. So how do I do it myself?

It’s easy! All you need is:

A method/program/algorithm to simulate copies of your random variable
Enough computational power/storage to simulate and store many copies of this variable (the more the better)

…and that’s time! Thanks for participating in another edition of 60 second stats. If you want to know more, there are many further aspects to this, such as known results about the , or what to do if you .

60 second stats – Agent-Based Simulation

Hamish Thorburn — Tue, 25 Feb 2020 20:34:30 +0000

I thought I’d try something different today, and instead of the regular post, I thought I’d try and do a bite-sized summary of a topic. To that end, please enjoy the first installment of 60 second stats! I can’t install a clock in the post, but if you’re really keen, time yourself to see if I’ve done a good job. Today’s topic will be the area of Agent-Based Simulation.

Are you ready?

….GO!

What is Agent-Based Simulation?

Agent-based simulation is any computer simulation in which “agents” interact with each other.

No, not like that.

The agents can be simulated cells in a biological system, or animals in nature, or even people. You then set your simulation running to see how the agents interact.

How do I make an agent-based simulation?

Basically, an agent-based simulation needs 3 elements:

Agents
Relationships/interactions between the agents
An environment the agents exist in

These will often be determined by either the structure of the simulation, or by parameters inputted by the user at the start of the simulation.

Are there any examples of this?

Yes! To pick just a couple of the countless examples, agent based simulation has been used to model:

Predator-Prey relationships between animals ()

And many others.

What do I use an agent-based simulation for?

Often it’s used to either:

Study how the agents interact under certain conditions, because studying these conditions in real-life is difficult/impossible
Predict how real-life agents would act in new situations/environments they haven’t experienced before

That seems simple enough. Are there any problems with it?

The main difficulty is calibration. That is, selecting the right parameters so that the simulation is behaving similarly to the real-life system it is modelling. Otherwise, you can’t trust any of the results you obtain from it.

Which can be heartbreaking when you realise this

How do you calibrate it?

The most common way seems to be trial-and-error – just keep testing new parameters until the output looks realistic. However, there has been a bigger post to start using heuristics to try and automatically calibrate agent-based simulations

That sounds familiar…

It should! It was the topic of

60 seconds is nearly up. Where do I look if I want to know more about this?

give a really good overview of general agent-based simulation. For some examples for the automatic calibration, see or .

Aaaand, that’s time. Hope you enjoyed that! If not, well, you only wasted one minute.

Model-based clustering

Hamish Thorburn — Sun, 23 Feb 2020 10:26:00 +0000

Today’s post is based on a Masterclass given to the STOR-i cohort by Brendan Murphy from University College Dublin.

Clustering

In data science, clustering is the process of grouping objects into groups, or clusters, such that members of the same cluster are more ‘similar’ to each other than they are to members of different clusters.

Example of some data with two clusters before (left) and after clustering (right). It is clear that the members of the red cluster are much more similar (i.e. closer) to each other than they are to members of the blue cluster, and vice versa.

Many common clustering methods (e.g. or ) are based off a metric known as the distance or dissimilarity between the points (an example of this distance is simply the straight Euclidean distance between the points). This is then used in a number of different ways to assign points to clusters – for example, in k-means clustering, each point is assigned to the cluster with the closest mean.

While these methods are very popular, they do suffer from drawbacks. Without assuming a model generating these points, it is hard to claim with certainty that future observations will fall into the same clusters. In addition, some of these algorithms can’t properly deal with many frequently repeated observations.

Model-based clustering

Professor Murphy’s Masterclass instead presented a framework for clustering continuous data known as a Gaussian Mixture Model. This is a form of clustering which assumes that the data comes from a particular probability model.

The model is based on 3 general assumptions:

We know the number of clusters before we start
Each observation in the data as a certain probability of belonging to each cluster.
The observations within each cluster follow a normal distribution (with the appropriate dimension)

These assumptions leave us with two problems to solve when fitting the model:

What are the means and covariances of each of the clusters?
Which cluster does each observation belong to?

It is clear that these two problems are related. The mean and variance of a cluster will depend on which observations are assigned to it, and the cluster an observation should be assigned to will depend on that clusters mean and variance. Fortunately, there is a way to simulataneous solve both problems using the Expectation-Maximisation (EM) Algorithm. The algorithm works by repeated performing an E-step, which assigns each observation to it’s most likely cluster, and an M-step, which then updates the cluster means and variances based on the assigned observations.

An example – Iris dataset

For a simple example, we will look at clustering different subspecies of iris flowers, based on the length and width of their petals and sepals.

Don’t worry, I didn’t know what sepals were either.

This is a famous data set consisting of 150 observations of 3 different subspecies of iris flowers – Setosa, Versicolor and Virginica (see for more information on the dataset, including a link to the data itself). I applied the EM algorithm to the data (Professor Murphy kindly left us some code that does this, but the “mclust” package in R also does this). The gif below nicely shows how the algorithm classifies observations and updates the cluster.

As nice as the above picture looks, the big question is how accurate is the classifier? That is, how well do our clusters match up with the true different subspecies? As shown in the table below, we nailed it on the Setosa. And we did classify all the Versicolor together. Unfortunately, the algorithm couldn’t really distinguish between Versicolor and Virginica. 39 out of 50 Virginica observations were classified as Versicolor. However, it is also worth noting that this dataset is notorious hard to classify, particularly between these two subspecies.

	Cluster 1 (red)	Cluster 2 (blue)	Cluster 3 (green)
Setosa	50	0	0
Versicolor	0	0	50
Virginica	0	11	39

Accuracy of our clusters against the true subspecies. It can be seen that the Setosa flowers were all perfectly clustered, but the majority of the Virginica were mis-classified as Versicolor.

Extensions and further work

We’ve barely scratched the surface of Gaussian Mixture Models. Professor Murphy has done extensive work on the different covariance possible covariance structures between clusters (e.g. if you assume all clusters have equal covariance matrices, or simply different scales, or can be whatever you would like). And that’s not even covering classifying with categorical variables, which can be done using Latent Class Analysis. Even taking Gaussian Mixture Models as presented here, you still need to assume the number clusters. While this can be determined using model selection criteria such as BIC, there is still much work to be done in this area.

Factor analysis and Interpretability

Hamish Thorburn — Fri, 24 Jan 2020 13:38:20 +0000

So, we had the STOR-i conference last week. 2 days of interesting talks, free wine, and somewhat awkward photos.

Our 2020 edition of the STOR-i conference commences today! We are glad to have talks with wonderful academic and industry professionals alongside STOR-i alumni and PhD students!
— STOR-i (@STORiCDT)

Exhibit A

Of the presentations, I was intrigued by one presented by Dolores Romero Morales, of the Copenhagen Business School (her personal website is ) mentioning a paper entitled “Enhancing Interpretability in Factor Analysis by Means of Mathematical Optimization”. Sadly, she ran out of time to go into this during the talk, so I decided to do a post discussing this paper.

But firstly, a background in factor analysis.

Factor analysis is a statistical method to try and reduce the number of important variables in a linear regression model. In a standard linear regression model, you have a number of variables, which you can see/observe, and you assume have a direct effect on the output.

Basic idea behind linear regression

In Factor analysis, there are a number of latent variables called ‘Factors’. These are unseen variables, which affect the seen variables you would use in your linear regression. The factors affect this variables in the manner of:

Variable = Effect of Factor 1 + Effect of Factor 2 + …+ Some Random Error

You then use the variables to determine their effect on the outcome.

The basic idea behind factor analysis.

An example is a study in which researchers are trying to model the income of participants. One can imagine that some unobservable qualities – such as intelligence and work ethic – would influence how much a person will earn. However, we can’t measure these – we can only measure things such as university results and years of work experience. The idea is that the (unbserved) factors of intelligence and work ethic will influence the (observed) variables of university results and years of work experience, for example:

University results = Effect of Intelligence + Effect of Work Ethic + Random error

Then, you use the university results and work experience to model income, such as:

Income = Effect of University Results + Effect of Work Experience + Some more random error

The structure of the variables

Factor analysis can be used to:

Examine if a hypothesised structure exists in the data – our toy example is an example of this
Examine the data to see if there is any structure present.

The latter example is known as exploratory factor analysis. The paper I’m going to look at today looks at an extension of this.

Enhancing Interpretability in Factor Analysis by Means of Mathematical Optimization

I’m not going to lie, I was very excited when I heard the title of this paper. Communication in mathematics and statistics is always a problem. Data Science, just through the sheer number of observations and variables is particularly susceptible to this.

The paper is trying to solve the following problem. Let’s say in our example above, that you didn’t have any idea what the two factors were. So you plug your data into your software package of choice to do some factor analysis

Hint: You should choose R

The issue is, the model R your software package fits isn’t guaranteed in any way to fit factors that make intuitive sense. For example, in our example above, you could get a model which has 13 factors, of which 9 affect university results and 7 affect work experience, with no real pattern between them.

While there are many different ways of dealing with the interpretability of factors, the (extremely condensed) idea of the paper is that you can assign the different explanatory variables to clusters. Then, you can force the model to fit factors that match these clusters. Therefore, you can be sure that you will end up with factors that “make sense”.

Example – California Reservoir Levels

The authors in the paper trial their method on a dataset of California reservoir levels (from Taeb et al. 2017). They assign one variable to each cluster, chosing:

Palmer Drought Severity Index
Colorado River Discharge
State-Wide Consumer Price Index
Hydroelectric Power
State-Wide Number of Agricultural Workers
Sierra Nevada Snow Pack
Temperature

as the seven variables, with one variable in each cluster. Therefore, when fitting the factors, we can tell that they will be related to one of these variables.

In fact, they find that the best fitting comes from having 2 factors – one for Hydroelectric power and one for temperature.

Issues with the paper

I’m torn here. On the whole, while I’m sympathetic to any attempt to make data science more accessible, I’m a bit confused by this paper (ironic, really). They assign each factor to a specific variable to make the model easier to interpret. But once this is done, it isn’t clear why this isn’t simply a linear regression. The whole idea is that factor analysis tries to find some hidden factors that govern the way the explanatory variables work. But for this approach, the factors are assigned to seen explanatory variables anyway, so it’s not clear what this process achieves. Furthermore, the code used in the paper isn’t available online, so it is hard to replicate what they’ve done and work out what they did for yourself.

However, I would be very interested to see how fitting models with these interpretability clusters to fitting standard exploratory factor analysis models. Having previously worked trying to explain complex data analysis to non-technical people, the idea that you could explain that your explanatory variables are governed by a simple, understandable process is golden. However, until I give this more of a go myself, I’m not sure how much this will add to the process.