Kes Ward

Academic Writing in Classic Style

Kes — Mon, 08 Jan 2024 17:18:12 +0000

A PhD involves reading an awful lot of what other people have written. You develop opinions on writing. Some of these opinions are quite hard to explain. They operate on an “I know it when I see it” basis.

I have recently read Steven Pinker’s work on why academic writing stinks (and how to fix it). He is a strong advocate for the classic style.

The guiding metaphor of classic style is seeing the world. The writer can see something that the reader has not yet noticed, and he orients the reader’s gaze so that she can see it for herself. The purpose of writing is presentation, and its motive is disinterested truth. It succeeds when it aligns with the truth, the proof of success being clarity and simplicity.
()��

Taken from Philosophical Disquisitions

Many people try to sound clever with their writing. This is because they’re trying to get their reader to believe them. One way to do that is to use words that the reader doesn’t understand, in order to make the reader feel that they don’t have the expertise the writer has in this area.

But on the whole I don’t believe that academics are bad people trying to deceive me. A far worse problem is that in order to keep their writing as short as they can, a writer uses terms that their readers might have to look up. I have been defeated by the TLA (three-letter acronym) and the ETLA (extended three-letter acronym) many times. The person who wrote (or even said!) them wasn’t trying to bamboozle me at all. They were pressed for time and tried to get their meaning across by invoking something that I didn’t already know. They ended up taking even more time when I asked them to please explain what a “” is.

Alas, an explanation of a concept that can only be understood by someone who already understands the concept is a rubbish explanation. I find that this is often a problem with Zen proverbs. Here’s a Zen proverb:

“Drink your tea slowly and reverently, as if it is the axis on which the world earth revolves – slowly, evenly, without rushing toward the future.”
~Thich Nhat Hanh

This is using an example of tea-drinking to tell you to slow down and enjoy life. I, as somebody who is already a fan of slowing down and enjoying life, understand and agree with it. Somebody who does not want to slow down and enjoy life is never going to understand or be convinced by something like this. So its explanatory power is really extremely limited. It’s the kind of thing that can be passed around with sage head-nods by people who already agree with it.

Of course, Zen proverbs aren’t designed to explain things you don’t know. They’re designed to remind you of things you already know. You can think of them like exam revision for the exam of life, rather than actually teaching the course.

“Proverbs are merely revision for the examination of life.”
~Kes Ward

I am currently attempting to write a literature review. The process is somewhat painful and mind-wrenching, especially for someone like me who spent my undergraduate days solving maths problems rather than writing essays. I feel as if I missed out on the formal teaching of academic writing. Reading about classic style is helping me formalize a sense I’ve developed over time but couldn’t quite put into words. It feels almost like a Zen proverb.

Survivorship Bias and Vaccines

Kes — Mon, 16 Nov 2020 15:24:54 +0000

Yet another reason why COVID is not the flu

With multiple vaccine Phase III trials coming back with promising interim results, I’d like to take a moment to talk about a bias in statistics called “survivorship bias”, and how it can skew our thinking.

Wikipedia’s definition of survivorship bias is as follows:
“Survivorship bias or survival bias is the logical error of concentrating on the people or things that made it past some selection process and overlooking those that did not, typically because of their lack of visibility. This can lead to false conclusions in several different ways. It is a form of selection bias.”

Usually, survivorship bias causes us to be unreasonably optimistic. For instance, a meta-analysis of scientific studies in a field will often produce more evidence supporting a hypothesis than is actually true, because the studies that didn’t find any evidence for that hypothesis never got published (shout-out to the for doing what it can to address this problem).

Sometimes survivorship bias can lead us to damaging conclusions. For example, look at this picture of what parts of planes got shot on WW2 air raids, and think about there you would put armour on the planes if you could edit their design.

Bullet holes on returning WW2 aircraft (source: Wikipedia article on survivorship bias)

The obvious (and wrong) answer is to put armour on the parts of the plane with the most holes so it will stop the most bullets. The correct answer, of course, is to put armour on the bits of the plane without bullet holes, because planes that got shot in those places didn’t make it back to be counted.

So how does this apply to COVID-19?

Here, the “people or things” we are interested in are infectious diseases. And the “selection process” is not getting relegated to the history books by the development of a successful vaccine. The battleground of public health is littered with these kinds of corpses. Mumps, rubella, tetanus, polio, whooping cough, smallpox… and yet there are also survivors, like HIV and influenza and measles.

Because HIV and influenza and measles are survivors, they’re the ones we hear about in the papers. And because of that, they’re the ones we unconsciously consider in the back of our minds when weighing up probabilities, including probabilities of success for vaccine development and deployal. COVID has been compared a lot to the flu, for example, but not much if at all to polio (despite the fact that the 2020 pandemic has revived iron lungs as ventilator substitutes in countries that don’t have enough of them).

But each of HIV, influenza, and measles has a reason why it hasn’t been consigned to the history books. They are, in summary (and with apologies for the oversimplifications):

HIV infects by getting eaten by the immune system and then slowly destroying it from the inside, so the standard vaccine idea of “make the immune system eat it quickly before it causes disease” is a non-starter, and most vaccine trials for HIV fail because of this (some even make things worse).
Influenza mutates really, really fast. It doesn’t have the “check your work for mistakes” proteins many other viruses do, having sacrificed them and the advantages they bring in favour of an infection strategy specifically based around mutating enough to infect the same people over and over. It’s so far been impossible to develop a vaccine that the flu can’t mutate its way around, though we have a good crack at it every year.
Measles is super duper infectious. Remember how everyone’s been talking about the R rate for COVID which is about 3 in a normal population (without social distancing and the like)? The R rate for measles is about 15, which is far higher than any of the other diseases we have vaccines for. This means that even though we have a vaccine, we need to have 95% coverage to stop measles causing outbreaks, and the presence of just a few people who can’t or won’t get vaccinated stops us from getting rid of it for good.

Enter COVID, stage left, November 2019. COVID is an infectious and dangerous disease, so in estimating how it will behave, we unconsciously compare it to the other infectious and dangerous diseases most prominent in the back of our minds. This leads us to an overly pessimistic outlook because of survivorship bias, which is a nice change from the usual.

Truth is, the answer to the question of “why haven’t we eradicated covid yet?” is probably “we haven’t had the time”, rather than something more substantial preventing successful development and deployment of a vaccine. This is borne out by the interim trial results, which report that the standard way of vaccine making works far better for COVID than it does for, for instance, HIV. COVID has “check your work for mistakes” proteins, so it mutates much more slowly than the flu, and we’re not seeing anywhere near the amount of reinfection that would cause us to be seriously worried on this front. And with an R rate of 3, we only need about 75% vaccine coverage to wipe COVID out from society, so the antivax movement isn’t really big enough (yet) to present a COVID-related health problem.

Hopefully, as this pandemic gets mopped up, we’ll see less comparing COVID to the flu, and more comparing COVID to polio, in the graves of the history books where it belongs.

An Introduction to Robust Statistics

Kes — Tue, 30 Jun 2020 17:04:39 +0000

Better MAD than mean.

Ah, the joys of the school maths classroom. I’m not sure how old I was when I first learned the “three Ms” – mean, median, mode – but I’m certain I was in primary school. Given some numbers, finding the mode was easy – just pick the number that appears the most times. Finding the median was easy – just line them up in order and pick the middle one. But the mean? Oh dear. The mean was mean. To find it you had to do a lot of Complicated Adult Maths like adding everything up and dividing and making sure you typed everything into your calculator just right: one badly-placed decimal point could spell your doom and make you do it all again. Even then, the answer you got was usually a gnarly fraction, and since it wasn’t actually one of the data points, could it really be said to be an “average” of them? Isn’t it kind of artificial somehow?

A millionaire walks into a bar. The ‘average’ wealth of all bar patrons has just tripled. What wonderful sharing of prosperity.

Nevertheless, when we think of an ‘average’ in our adult lives, we usually think of the mean. It’s the grown-up average. It “uses all of the information”, which… must be inherently good! After all, it couldn’t possibly be that some of the information we have is complete and utter junk, could it?

Spoiler: in the real world, far more often than you would think, some of the information is indeed complete and utter junk. We don’t live in a world of spherical cows in a vacuum and pretending we do doesn’t help us do good science.

“a system which has spherical symmetry, and whose state is changing because of chemical reactions and diffusion … cannot result in an organism such as a horse, which is not spherically symmetrical.” ~Alan Turing, The Chemical Bases of Morphogenesis

Trouble is, if you’ve got a lot of information, it’s pretty hard to work out which of it is good information and which of it is bad information. Sorting out the good from the bad like this is the central premise of my research in anomaly detection. Lots of people do it in lots of different ways, and it’s a bit of a complicated mess sometimes. But let’s put most of that aside for now, and go back to the childhood maths classroom.

Avoiding a Breakdown

Let’s say that the class is trying to count the number of daisies on the school field. To do this, each member of the class is given a hoop with an area of 1 square metre and told to throw it out to a random spot on the field and count how many daisies lie inside it when it lands. Find the average of the childrens’ counts, multiply by the area of the field, and there’s your estimate!

After some delighted hoop-thowing and flower-counting, you obtain the following data points from your study.

$$(31, 17, 14, 22, 185, 27, 236)$$

Rosie and Johnny both swear up and down that their hoops just landed that way in a big daisy clump and they definitely didn’t intentionally set them down there in a way that would bias the results, oh no. You’re slightly skeptical, but who are you to question the wisdom of six-year-olds? You run the calculations.

$$(31 + 17 + 14 + 22 + 185 + 27 + 236)/7 = 532/7 = 71$$

This upsets some of the other six-year-olds, who grumble that Rosie and Johnny ruined the experiment. Not only did they possibly set their hoops down unfairly, they also probably didn’t even count all of those daisies and just guessed! And since they’re guessing really big numbers, guessing them even a little bit wrongly can completely wipe out all the careful counting the rest of the children did.

I am reliably informed that the correct biological term for this is ‘quadrat sampling’ and it is usually done with squares. However, I do not care, because I am a mathematician and not a biologist, and my explanatory examples can be whatever I want them to be.

Those kids are absolutely correct to be upset. Anomalous values aren’t just bad because they’re anomalous, they’re bad because they disproportionately influence the dataset and overwhelm the sensitive non-anomalous information it contains.

In the field of robust statistics, there is a concept of a ‘breakdown point’ to determine how robust an estimator is. The breakdown point of an estimator is the proportion of arbitrarily incorrect observations that estimator can handle before giving an arbitrarily incorrect result. That is, how many junk data points (and if a data point is junk, it can be as junk as you want – say Rosie had reported 1000 daisies, or 10000, or a million and three) can be in your data before your estimator becomes junk (not just a little bit biased, but really junk).

The breakdown point of the mean is a big fat 0%. This is because one single junk data point can ruin the whole thing. However, the breakdown point of the median is, in this case, 3/7 (approximately 43%). because any three junk data points – no matter if they’re small junk or big junk or even negative numbers junk – can’t influence the median enough to make it one of the junk points. For really big samples of data, the breakdown point of the median will approach 50%.

To refresh your primary school mathematics, we calculate the median by ordering the values and taking the middle as follows:

$$(14, 17, 22, \textbf{27}, 31, 185, 236)$$

The kids are overall much happier about this method (Connie, in particular, is ecstatic about how her value of 27 was ‘chosen’).

Measures of Spread

After a measure of centrality like the mean (or the median), the most valuable one-number-summary to know about a dataset is a measure of spread. How far away are the data points from each other? How well does your centrality measure actually describe what an ‘average’ (randomly chosen) data point looks like?

In our non-robust world, the standard deviation is the go-to measure of spread. It’s the root mean square of the distances between all points and the mean. Obviously, since the mean is involved in the calculations, the standard deviation has a breakdown point of 0%. This is bad.

Drawing on our previous experience with the median fixing our problems, let’s examine the inter-quartile range (IQR) and see if it helps us. To recap, the IQR is the difference between the first and third quartiles: if the median can be thought of as 50% of the way ‘up’ the dataset, then the IQR is the difference between 25% of the way up and 75% of the way up.

For discrete datasets, how to pick actual numbers for the quartiles is a bit disputed, but for the sake of this post I’m using the .

$$(14, \textbf{17}, 22, 27, 31, \textbf{185}, 236)$$

$$185 – 17 = 168$$

Oh no. We have a problem!

The breakdown point of the IQR for this dataset is only 1/7 (or in the case of a larger dataset, approaching 25%). This is because, even though you could tolerate 50% of the data being anomalous provided the data anomalies were spread evenly in both directions, we are concerned with the worst case – all of the anomalies off to the same side.

Can we do any better? Robust statistics tells us that yes, we can.

The Median Absolute Deviation (MAD) is what you get when you apply median-thinking to the way of deriving the standard deviation. It’s best explained by a concrete example.

$$(14, 17, 22, \textbf{27}, 31, 185, 236)$$

Find the distance from each point to the median.

$$(13, 10, 5, 0, 4, 158, 209)$$

Reorder, and find the median of those distances.

$$(0, 4, 5, \textbf{10}, 13, 158, 209)$$

The MAD has a breakdown point of 50%, twice that of the IQR. This is because it doesn’t matter what direction the anomalies occur in – they’ll all be lumped up at the top end of the reordered data. It’s allowed us to calculate a measure of spread that isn’t massively affected by the anomalies in the dataset. Since we live in a world where many people only ‘get’ the standard deviation as a measure of spread, we rescale by a constant factor to make the MAD a (robust) estimate for the standard deviation that is consistent when the data is normally distributed. Turns out that constant is 1.4826. (And my careful choosing of examples to avoid having to deal with decimals was going so nicely, too…)

$$\text{MAD} = 10 * 1.4826 = 14.826$$

Can we ever do better than a breakdown point of 50%? Intuitively, no: if more than half of the data can be whatever we want it to be and not unduly unfluence our estimator, then we can flip our thinking as to which points are ‘anomalous’ and say that less than half the data does unduly influence our estimator. Contradiction.

50% is pretty good going, though, We can get useful results with almost half our dataset being junk. Many of the other robust estimators for other statistical quantities can only wish they had a breakdown point this high.

Breakdown points are important. Don’t get mean, get MAD.

For more reading on Robust Statistics, as well as how it applies to anomaly detection as a field specifically, check out this from 2018 by Rousseeuw and Hubert.

Tackling your Inner Critic

Kes — Thu, 11 Jun 2020 16:09:39 +0000

When I was a teenager, I wrote fanfiction. I also wrote a fair amount of original work, but most of it was fanfiction. The longest single piece of fanfiction I wrote was probably bigger, word-wise, than my PhD thesis will eventually end up being. (136,000 words, six months to write, if you’re wondering).

I mention this because I learned a lot from it – not least, how to use proper grammar and sentence structure – but especially because the world of creative writing was the first place I ever encountered the emotional ‘stretch’ associated with large-scale self-motivated projects. At first I was pretty terrible at things like keeping myself on track and maintaining a coherent plan and getting words on paper even when I wanted to curl up and die of anxiety and shame at how terrible my work was. But as I bounced from one story to another, the projects I undertook getting more coherent and complex each time, I knew I was on an upward trajectory heading… somewhere, I guess?

Then I got into university and the Cambridge Mathematical Tripos sacrificed my creative heart on the altar of problem sheets and examinations. But that’s a story for another time.

Returning to writing has been oddly nostalgic.

Now I stand on the brink of a new phase of my life, with an awful lot more maths know-how and relevant industrial experience and a respectable research topic direction, but many of the emotional struggles of my daily life are very familiar from my teenage fanfiction days. I recently attended a workshop organised for the STOR-i MRes cohort entitled “Building Resilience and Tackling your Inner Critic” and delivered by the amazing Tracy Stead. Most of what follows is my notes from that workshop, tidied up and interspersed with my own thoughts on the subject.

The leaky bucket model of motivation

Tracy introduced a metaphor that I particularly like: the idea of motivation and confidence as a bucket with holes in. Though the bucket is always draining, certain things drain the bucket faster, and certain things fill it up. It’s nearly impossible to do productive work when the bucket is close to empty. Therefore, we should make sure to keep the bucket topped up and avoid poking additional holes in it – that way, we’ll be more productive overall.

Here is a picture of a leaky bucket. I’d pretend that this would explain the point better, but really I just needed a picture to break up the wall of text.

According to the 2019-20 STOR-i MRes cohort, these things drain the bucket:

Conflicts with other people
Code doesn’t work, I don’t know why or how to fix it
When I forget what I’ve read or how to do something ‘basic’.
Working alone without people who I can talk to that understand what I’m doing.
Seeing a really good piece of academic work and thinking “I couldn’t possibly measure up to this, so why am I even trying?”

These things, however, top it up:

Talking to someone supportive
Getting good results, or quick wins
Taking a break and doing something else for a while
Breaking new ground, starting a new project

I note that the only thing that fills the bucket that is both under conscious control (we can’t have research breakthroughs on command) and isn’t ‘do something else’ involves talking to someone. Maybe STOR-i is onto something with its philosophy of “A PhD is not a solo activity”.

Tracy’s “official” answer to the question of things that fill the bucket (or build resilience) is organised into a five-point plan (and I love myself a good five-point plan):

Relationships – cohorts “part of a team”, networks “people find me valuable”, supervisors “I can ask for help”, getting outside perspective on my problems so they don’t seem so bad,
Optimism – focus on the future not the past, goals and milestones and visions, figuring out: who do I want to be? Continuously striving for improvement
Coping skills – build confidence, lead a healthy lifestyle, make your rituals and habits productive ones, minimise unwanted stressors in your life.
Competence – invest in skills and problem solving, experience, knowledge.
Emotional intelligence – If I notice, name, choose, and communicate my emotions, that will help me think more rationally about things.

Meetings with Supervisors

My MRes year, summed up in one (1) graphic.

I am a naturally anxious person, and especially so around other people. I absolutely dread meetings with my supervisors. (Sorry Idris & Paul if you’re reading this – it’s not you, I swear, you’re both lovely). My internal monologue in the lead-up to a meeting can get incredibly distorted and critical. This is Tracy’s titular inner critic, and is a problem definitely shared in various degrees by many of the MRes cohort. Here are some of the things our inner critics say about meetings with supervisors:

I might say something stupid.
I haven’t done enough this week and my supervisors will think I’m lazy.
The supervisors might think I’m a mistake and I’m not good enough for this project.
I’m not prepared for this meeting.
Why haven’t I come up with any good ideas?
I should know the answer to this.
What I’ve done didn’t work, so it’s useless and there’s no point sharing it.

Tracy talked about re-framing each of these distorted thoughts into positive to build our motivation (fill that bucket!) and help us move forward.

What I’ve done didn’t work, so it’s useless and there’s no point sharing it. What hasn’t worked is a stepping stone towards what will work, so it’s good to share it.
I should know the answer to this. The meeting is an opportunity to learn from an expert. The point of being here is to learn.
Why haven’t I come up with any good ideas? It’s going to be great to talk about this with someone who understands the field. It’ll probably help me come up with good ideas.
I’m not prepared for this meeting. This meeting is not a viva and my supervisors are not grading me on the quality of my preparation or performance.
The supervisors might think I’m a mistake and I’m not good enough for this project. This stuff is actually really hard and the only reason my supervisors find it ‘easy’ is because they’ve been here a lot longer than me. They know that and they’re not expecting perfection from me.
I haven’t done enough this week and my supervisors will think I’m lazy. I like my supervisors and it’s good to talk with them, especially during those times when I’m unmotivated and unproductive.
I might say something stupid. I am definitely going to say something stupid. My supervisors will hopefully correct me. This is a good thing.

A Song About Lit Reviews

I don’t just write stories. I also write and perform poetry and songs. (If you meet me in person and want to put me in a good mood, give me a ukulele and ask me to sing you the song about hiding the dead bodies – it’s one of my favourites).

I have recently started becoming acquainted with the literature in the field of anomaly detection. This has inspired the following song, to the tune of 99 bottles of beer:

99 papers on my to-read list
99 papers to read
Get through one, skim the references
127 papers on my to-read list.

Did I plagiarize the main concept of this song from a Reddit post about debugging? Why yes, yes I did.

“Projects” are constrained and time-bound. They have deadlines and well-defined stages and a (mostly) linear sense of progression. If you’re off-track or off-schedule, something is wrong. Contrast with “research” which is inherently expansive and messy, where going off-track is pretty much the norm and progression happens in jolts and lightbulbs interspersed with periods of what can at the time seem like aimless drifting through an ever-expanding mass of awful. (At least, that’s what the second-year PhDs tell me – they call it the “valley of shit”).

The Creative Diet

To get through this valley with sanity intact, Tracy asks us to pay attention to our environment around us, and design a ‘diet plan’ of daily activities designed to top up our motivation and creativity. Here’s what I came up with:

Exercise: in the morning and during breaks
Talk to people doing similar things, to enhance and refine your own perspective.
Spend time being bored: your mind wanders and you have new ideas.
Randomness and lack of routine. Breaking up habits.
Meditation, meditative activities (colouring books, cleaning)
Hold yourself accountable to self-imposed deadlines by telling other people about them.

Other Thoughts

I hereby declare I am part of the large percentage of PhD students with impostor syndrome.

Tracy covered a lot more than I’ve written about: impostor syndrome, task prioritisation, how to reflect and learn from experiences in a healthy way rather than dwelling on the negatives, etc. Here are two thoughts I had about writing that cropped up during the rest of the workshop that don’t relate much to anything I’ve said before:

Software. The brain was not actually designed to write documents in a linear fashion on a computer. We’re hampered so much by Microsoft Word and everything that looks like it (yes, this includes LaTeX) in ways that we don’t realise until we try something different. In my teenage years, I used Scrivener to organise my writing and thoughts in a heavily non-linear fashion. I’ve heard it’s pretty bad at rendering maths, though, so I’m on the lookout for better options for the PhD life.
Cycles of inspiration and criticism, drafting and re-drafting. You can’t edit an empty page. First drafts don’t need to be perfect. Creativity and perfectionism are antagonists, mood-wise. You need both, but you should separate them so they don’t screw each other up.

Coronavirus

Kes — Mon, 16 Mar 2020 09:33:13 +0000

If you’re tired of obsessively following non-scientific journalism on the COVID-19 situation and it’s causing you anxiety, why not try obsessively following scientific journalism and being in awe at pretty and informative graphs?

Head over to , an open-source project dedicated to genomic sequencing for real-time tracking of pathogen evolution, and teach yourself to interpret phylogenies!

Coupling From The Past

Kes — Mon, 24 Feb 2020 09:24:20 +0000

In Mathematics, often you don’t have a neat formula for the things you want to calculate. However, you can sometimes use an algorithm to get an answer that’s as close as you want, for example finding the root of an equation by rearranging it in your calculator and then pressing the = sign a lot of times.

These algorithms usually have starting values, and picking the wrong starting value can affect how long the algorithm needs to run to get a ”good enough” answer. But what if you didn’t need to pick a starting value or an amount of time to run the algorithm, and it just gave you the exact right answer? This report examines one way to do this, called Coupling From The Past or CFTP for short.

Instead of running our algorithm into the future (where no matter how long we run it we’ll always be a tiny bit away from the true answer), CFTP asks us to pretend that our algorithm has been running for an infinite time in the past up until now. It then finds out where we would be if that was the case. Since the algorithm has been running for an infinitely long time, it turns out we must be exactly at the right answer. CFTP works by considering every single possible starting value, and then tries to make them ”couple up” to end up at the same place as quickly as possible by fiddling about with the random numbers the algorithm uses. Then it doesn’t matter where the algorithm started from infinitely far in the past, because all the roads lead here!

Coupling From The Past makes every starting value end up at the same place. Source for image:

CFTP is quite a tricky method. It only works for some algorithms and needs to have random numbers, and running something ”from the past” is a lot harder to get your head around than running it the normal way. In particular, you need to be very careful where you’re getting those random numbers from, or the end answer won’t be right.

Luckily, we can tell the computer to give us the same random numbers we asked it for a while ago by setting a random ”seed”. Without this, CFTP wouldn’t be possible.

See the creator of CFTP’s website to get more into the mathematics of this algorithm.

External Methods for Explainable AI

Kes — Wed, 12 Feb 2020 11:09:02 +0000

At the STOR-i conference this year, there was a talk on “Optimising Explainable AI”. Broadly, it looked at ways to make an AI (by which I mean a classifier of some sort) more transparent so that you can see why it makes the decisions it’s making.

The central concept was this: any AI will have a big long explanation as to why it did what it did (maybe in the form of a polynomial or large matrix of weights) that is in most cases too long for a human to understand. You need the explanation to be shorter (while still being an explanation). The way you do this is to impose some kind of “penalty” related to the size of the explanation, and hope that you can get a tradeoff between an AI that classifies well and an AI with short explanations. Example penalties include sparsity constraints in generalised linear models (such as Lasso regression) and restrictions on the allowable depth of decision trees.

However, these are internal methods. They are things you consider when you are designing your AI. Suppose you weren’t allowed to do that – maybe you absolutely *must* get the best classification performance possible and weren’t allowed to factor in things like explainability, or maybe the internal structure of your classifier is just not suited to making its explanations shorter via any method you’ve found so far. What do you do then?

How To Train Your Classifier

A trained classifier will take an input, for example the image below, and output one of a preset number of categories to which it belongs, for instance either “dog” or “cat”. In actually, classifiers give a score, which is a number between 0 and 1 saying “how doggy vs catty is this picture”. All pictures above a threshold are classified as dogs and those below as cats – setting that threshold is the job of the person who trains the model, and is based on a tradeoff for the error in misclassification either way.

Markus, who is a very good boi. Source: my fellow STOR-i student Tessa

However, the score here is what’s important. The trained classifier can be thought of as a function f:{Space of all possible images of correct resolution} -> [0, 1], without caring about its internal workings at all. It’s even (usually) a continuous function for some kind of notion of “continuity”. And what do we do in OR when we find an interesting continous function?

We optimise it, of course!

Now, the good thing here is we’re not really looking for global optima. We’re not particularly interested in constructing the “doggiest dog that ever did dog”[1], and thank goodness we aren’t, because that image space is extremely high-dimensional (number of dimensions = number of pixels). Instead, we can look at what our optimiser does when we give it something fun from our test set as a starting point.

“Here is a dog – please make it more of a cat”. “Here is a cat – please make it EVEN MORE of a cat”. “Here is a cat that you misclassified as a dog – please make it more cat”[2]. By looking at what the optimiser does in response to this, you can learn what features the classifier is considering “important” in relation to that particular image. Does it make the ears rounder when asked to go more doglike, for instance?

Turns out you don’t even need an optimiser – just by using a high-dimensional gradient estimator, you can create a map of what pixels the classifier considers “locally important” in an image. There’s an old urban myth about a classifier for American vs Soviet tanks that was actually classifying the background light levels as cloudy vs sunny due to the days on which the training set photos were taken, and therefore failed spectacularly in live use[3]. Such a classifier problem is easy to spot if you can tell that the classifier is mainly looking at the backgrounds of an image rather than the features it is supposed to be identifying.

An EU tank (that is neither American nor Soviet)

By using an external method such as this to provide explainability to every decision an AI makes, we avoid the tradeoffs internal methods force us to make. External methods add rather than compromise.[4]

[1] Although I might be interested in the “cattiest cat that ever did cat”.

[2] Did I mention I like cats?

[3] Almost certainly a false myth; see for the story. It’s probably stuck around because it’s both such a useful explanatory device for elementary machine learning and it lets us poke fun at the incompetence of military science & technology.

[4] Ribiero et. al (2016) “Why Should I Trust You?”: Explaining the Predictions of Any Classifier is a brilliant resource for diving into this further.

Capturing Diminishing Returns in Agent-Based Models

Kes — Tue, 21 Jan 2020 17:37:55 +0000

Imagine you’re a Census field officer who knocks on non-respondents’ doors to encourage them to fill out the Census (it’s mandatory, and they could face a fine if they don’t!). You know that on average 40% of the doors you knock on will have someone answer the door. So you knock on every non-responding household’s door in your neighbourhood – some answer, and that’s great so you take them off your list and try the others again tomorrow. After a week of daily knocking, what’s the probability that a house you knock on will have someone answer you today? Is it still 40%?

Of course not. Some houses don’t have anyone there during the hours a field officer might be visiting them, and those houses will quickly become overrepresented in the sample of houses left on your list. This is an example of “diminishing returns” in action – the more effort you put into something, the less efficient each extra bit of effort gets – and is a feature of many real-world systems that one might want to simulate.

A Census field officer visiting a caravan of Travellers in the Netherlands, 1925. Source: Wikipedia

Now imagine that you’re a virtual Census officer knocking on virtual doors. The person coding up the simulation only knows that on average 40% of the doors knocked on will answer the door, but doesn’t have any data beyond this about what happens for each day. The naive method would be to, after every knock, independently have the door be answered with probability 0.4 – however as shown above, this won’t capture the right real-world behaviour.

Why does this matter? Well, consider the two uses for the simulation – to make decisions far in advance of the live operation, and as a benchmark to track progress against during the live operation.

If we don’t account for diminishing returns by day, we end up looking like we’re doing a lot better than we are halfway through the live operation, as our simulation predicts that we still have a 0.4 probability of people answering the door even though we’ve already visited all of the easiest addresses. This could cause false confidence and lead to eventual undercount because we don’t take appropriate countermeasures for things going wrong.
If we don’t account for diminishing returns by category (e.g. time of day a Census officer is out and about) then our upfront decisions about scheduling could be thrown off by trying to optimise a simulation that doesn’t reflect the real world.

To expand on that last point: assume we had data on answering the door split by time of day – we know that on average 30% of people answer their doors in the daytime as opposed to 50% in the evenings after work. If we naively plug this into our simulation and then use it to find the optimal schedule, it will tell us to always visit in the evenings. However, in reality some houses are better approached in the daytime, and some mix of daytime and evening will be a better approach.

The Advantages of Agent-Based Modelling

Agent-Based Modelling (ABM) is a type of simulating where you simulate each individual agent (in this case, Census officer and house) rather than just having numbers for “doors knocked on” and “doors answered”.

This method opens the door (heh) to a solution by having properties associated with an agent (such as a house’s “probability of answering the door in the daytime” and “probability of answering the door in the evening”). Values for these properties can vary among agents. Now, when the virtual officer knocks on the door, instead of a universal 0.3 or 0.5 we instead call on that individual agent’s probability. Easy-to-contact houses are thus removed from the pool of remaining agents and the system starts demonstrating the emergent behaviour of diminishing returns present in the real-world system.

Challenges still remain with this approach. From what distribution and why do you assign those properties to the agents, and how do you get the data needed to inform your choice? Daytime and evening contact rates of a house are probably correlated, but how correlated? Assuming you need a really well calebrated model and have the data to throw at it, is this really a better approach than just estimating a lookup table of contact rates by day? Nevertheless, it’s a powerful tool for the ABM kit.

This post was made drawing on my own experiences working on the collection operation for Census 2021.

Mathematical Fairness

Kes — Tue, 21 Jan 2020 16:43:27 +0000

A while ago, when I was in the “MOOCs are great!” phase of my life, I took an . I never finished the practical part of it, but I remember the theory being surprisingly mathematical. It centred around constructing theoretically sound mathematical concepts of “fairness” and then applying them to a variety of different scenarios to construct an “optimally fair to all parties” solution – accompanied, of course, by an explanation of the various ways people attempt to obscure this to skew the negotiation situation in their favour.

A simple example of mathematical fairness is known as the “Principle of the Divided Cloth”, and it goes all the way back to the Talmud. If there is a roll of cloth that person A claims to own all of and person B claims to own half of, how should it be divided? Well, assuming the claims are both reasonable, one would be tempted to split the cloth in proportion to the amount that was claimed by each side, in a 2:1 ratio. However, the Principle of the Divided Cloth instead proposes a 3:1 split, following the logic that only half of the cloth is in dispute – nobody is saying that A doesn’t own the first half – so the disputed part should be split equally.

The Principle fo the Divided Cloth, taken from a video in the Negotiation course

This principle can be applied to a wide variety of problems where two (or more) parties can come together to create something more valuable than they could working apart, and must decide how to divide that extra value.

Fairness in Optimisation

In an optimisation problem, you’re trying to locate the “best” solution. “Best” how and for whom, you ask? Good question. The usual answer is “best for whomever is paying me and however they define ‘best'”. However, in applying optimisation methods to complex problems when multiple parties have stakes in the outcome, you can find yourself unwillingly appointed to the position of arbitrator.

The example we’ve looked at in STOR-i relates to the OR-MASTER project about scheduling flight capacity on airport runways. The airport itself would like as much capacity as possible to be used (making it extra money), without going over capacity and causing delays as planes are held up for lack of runways. Each airline wants to be able to schedule its flights to take off and land whenever it wants, which may include during “peak periods” when the runways are too busy to accommodate everyone’s requests. The airlines as a collective want a solution they understand the mechanics of that means they don’t have to play convoluted guessing games with to get the flight slots they want.

There are two conflicts at play here:

Each airline is competing with every other airline that uses that airport’s runways. We need to ensure the optimisation result is “fair” to every airline.
The airlines as a collective are in competition with the airport when it comes to what they want from this optimisation. We need to balance the airlines’ need for “fairness” against the airport’s need for “maximum capacity in use”.

The first problem is solved by constructing a mathematical definition of “fairness”, which in this case is approximately “You get out of life what you put into it”. Airlines requesting slots in peak periods are likely to have their slot requests moved around. The more peak slots they request, the more movement they will have to put up with. We can construct a “coefficient of fairness” for each airline that quantified how fairly it has been treated by our optimiser, and then try to keep everyone’s fairness coefficient similar.

The second problem is harder to put in maths, because airlines and airports are not the same thing, so it’s hard to say what “equal treatment” means. Also, there’s nothing to say we should be treating them equally – that decision is better made by experts in the field of aviation than by us.

We solve this by constructing what’s called a “Pareto Frontier”, or a range of solutions that balance one objective (fairness) against another (capacity). Each point on the frontier is “optimal” in the sense that for the same capacity there is no fairer solution and for the same fairness there is no solution with more capacity.

Illustration of the Pareto Frontier, taken from Wikipedia

Then, we can give this solution set over to the representatives of the airlines and airports to negotiate over, washing our hands of the problem.

It turns out in practice we can get a lot of what’s called “easy wins” – by sacrificing a small amount from “optimal capacity”, we can make the solution a lot more fair. This is visible in the diagram above as the slope of the frontier being flatter near the top. So by implementing any fairness metric at all in our optimisation, we can keep multiple parties happier.

Ethics in Mathematics

Kes — Thu, 09 Jan 2020 19:36:12 +0000

Mathematical work can sometimes seem detached from the real world. The abstractness of statistics and datasets often mask the ways in which they can eventually be used – and misused. But for many of us, the maths that we do is not just for fun; we are producing real results, that will be used by real people, to change real things. As the creators and communicators of analysis, how much responsibility do we have to ensure our work does not cause harm through unintended consequences or misuse? Can we sit back and argue “We just do the numbers; it’s not our problem”?

Here is a link to a post I wrote last year on the Government Statistical Service website, about a talk I organised for them about Ethics in Mathematics.

The talk itself can be found as a recording on the Cambridge University Ethics in Mathematics Project website, here: