A Few of my Favourite Things – Danielle Notice

Solving Sudoku with Metaheuristics: GVNS

danielle-notice — Mon, 25 Apr 2022 09:00:00 +0000

Reading Time: 5 minutes

The last time I travelled, I saw a little old lady in the airport with her crossword puzzle book. My grandmother travels with her word search book. Me? In my old age, I will travel with a Sudoku book. Before I hit old age, I will take advantage of the opportunity to combine my studies with my favourite game. As it turns out, heuristic algorithms are pretty popular for solving, creating and rating Sudoku puzzles. In this post, we will look at how a metaheuristic, general variable neighbourhood search, has been used to solve Sudoku puzzle instances.

What is Sudoku?
What are metaheuristics?
General Variable Neighbourhood Search
Solving Sudoku

1. What is Sudoku?

Sudoku is Japanese puzzle which consists of an n² × n² grid divided into n² sub-grids each of size n × n. The word Sudoku is a combination of two Japanese words, Su (Number) and Duko (Single) and loosely translates to “solitary number”. n is the order of the puzzle, with n=3 being the most popular.

The objective of Sudoku is to fill each cell in a way that every row, column and sub-square contains each integer between 1 and n² inclusive
exactly once.

Sudoku is an example of a combinatorial optimisation (CO) problem, which is a class of problems whose solution is in a finite or countably infinite set. Constructing or completing a Sudoku puzzle from a partially filled grid are both NP-complete problems. This means that there is no deterministic algorithm which can solve all possible Sudoku problem instances in polynomial time. The solution space for an empty 9 × 9 Sudoku grid contains approximately 6.7 × 10²¹ possible combinations. However, the pre-filled cells serve as constraints and reduce the number of possible combinations.

2. What are metaheuristics?

When it comes to solving optimisation problems, there are 2 main types of approaches: exact methods and approximate methods. Exact methods are guaranteed to find an
optimal solution for every finite problem instance of an CO problem.

Approximate methods such as heuristic algorithms can be used when there is no known exact method that can be used to solve the problem or when the known exact methods are too computationally expensive to be used practically. In the context of optimisation problems, a heuristic is a well-defined intelligent procedure – based on intuition, problem context and structure – designed to find an approximate solution to the problem. Unlike exact methods, the solutions found may not be optimal, but are some type of “acceptable”. The effectiveness of a heuristic depends on the quality of the approximations that it produces.

The performance of heuristics can be improved using metaheuristics, which are high-level,
problem-independent strategies used to develop heuristic optimisation algorithms. They are designed to approximately solve a wide range of problems without needing to fundamentally change.

3. General Variable Neighbourhood Search

Variable neighbourhood search (VNS) algorithms, which were originally proposed by , are single-solution based metaheuristics. They successively explore a set of predefined neighbourhoods which are typically increasingly distant from the current candidate solution.

Illustration of the main idea of a basic VNS algorithm

VNS’ main cycle is composed of three phases: shaking, local search and move.

In the shake phase, a random solution ��′ is selected from the kth neighbourhood of the current solution s. This ��′ is then used as the initial solution for the local search algorithm being used which produces a new candidate solution ��′′. In the move phase, if ��′′ is better than the current solution s, then it replaces s and the cycle is restarted with this new solution; otherwise, the cycle is restarted with the same solution but a different neighbourhood.

Variable Neighbourhood Descent (VND) is a deterministic variant of the VNS algorithm. The VNS main cycle uses a best improvement method, choosing ��′′ as the local optimum in neighbourhood N_k.

The General Variable Neighbourhood Search (GVNS) uses the VND as the local search procedure (line 7 of the algorithm).

4. Solving Sudoku with GVNS

We will now look at the different elements from the algorithm above and how used this metaheuristic to solve 9 × 9 Sudoku puzzles.

Solution representation: each sub-grid is numbered from 1 to 9, and each cell in a sub-grid is numbered from 1 to 9. So x_ij denotes the jth cell in sub-grid i (see grid above for example of labelled cell).

Solution Initialisation: To initialise the solution, for each cell, a random number is selected
from a list of numbers that include all the numbers that could be assigned to the cell without violating any of the constraints with respect to the fixed cells. This is done in such a way that the sub-grid rule is satisfied. To reduce the solution space, they fixed the
cells that only have one possible value and repeated that until there were no more such cells.

Cost function f(s): evaluates the violation of the row and column constraints and counts how many values are repeated in each row and in each column (illustrated in figure below). The goal is to minimise the cost function. The optimal solution will have f(s)=0.

A candidate solution of the Sudoku puzzle with its fitness value. Repeated digits highlighted in first row and column.

Neighbourhood structures

Only one neighbourhood structure, Invert, is defined for the shake phase. In this structure two cells in the sub-grid are selected and the order of the sub-sequence of cells between them is reversed.

There are 3 neighbourhood structures defined which are used in the VND local search:

Insert – the value of a chosen cell in a sub-grid is inserted in front of another chosen cell.
Swap – the values of 2 unfixed cells in the same sub-grid are exchanged.
A Centered Point Oriented Exchange – a cell between the second and sixth cell in a sub-grid is selected as the center point to find exchange pairs. Values for pair of cells, each equidistant from the center, are swapped until at least one cell in the pair is fixed.

Each of these structures apply to a single sub-grid, and in the local search, the neighbourhoods of each of the sub-grids are explored. Within the VND local search, a deep local search algorithm is used. This uses the best improvement strategy which exploits the whole current neighbourhood structure search area to find the best neighbourhood solution.

Learn More

The Tidyverse: the best* -verse for data scientist

danielle-notice — Mon, 21 Mar 2022 14:00:00 +0000

Reading Time: 4 minutes

There are a couple popular universes out there like the MCU and its multiverse, and Zuckerberg’s Metaverse. My personal favourite however is actually a universe of R packages.

This post is by no means a tutorial for the tidyverse. Nor is it an introduction to these packages or style of coding using R. Instead, this is just a compilation of my favourite features of the packages that will hopefully convince you of its power and convert you to the tidy side.

What is in the tidyverse?
Tibbles!
Pipes & Purrr
A few (more) of my favourite things

1. What is in the tidyverse?

The tidyverse is a collection of R packages designed by Hadley Wickham for data science. It includes packages useful for loading, wrangling, modelling and visualising data, and a couple that make programming in R so much better. When you install and load the tidyverse, the following core packages will be loaded:

install.packages("tidyverse")
library(tidyverse)

readr – to import tabular data
tibble – the better* dataframe
dplyr – for data manipulation
tidyr – to make data tidy
ggplot2 – for data visualisation
purrr – for functional programming
stringr – for string manipulation
forcats – for factors

There are also a couple other packages that are also installed for working with specific types of vectors, importing other data types and for modelling.

2. Tibbles!

I must start this section by addressing the * I’ve included so far. The tidyverse developers themselves describe it as an opinionated collection of R packages on the . So when I say that tibbles are better than data frames, that’s just my opinion as someone who has drunk the Kool-Aid and loves it.

If you’ve ever used the data.frame or data.table, unless you’ve completely mastered using them, you may agree with me that it can be a bit confusing remembering how many commas are needed, whether to use square brackets or parentheses, if something is being done in place or if you need to make a copy. A tibble is “a modern reimagining of the data.frame”. The developers put it nicely when they said that tibbles are lazy and surly data.frames: they do less and complain more.

3. Pipes & Purrr

When you’re trying to manipulate data, or doing analysis that isn’t super simple, it’s very likely that you’ll end up with nested functions. Here’s a simple example: you want to create a table of random numbers using different distributions and then add a column for a new distribution.

There are a couple ways to approach this: you could create a new variable (or overwrite the variable) at each step. Or you could use pipes. Among other things, piping saves you having to rewrite variable names, avoid nested function calls and makes code look so much more elegant. The pipe operator %>% is included when you install the tidyverse.

#both bits of code do the same thing

no_pipe = tibble(N = rnorm(10), E = rexp(10))
no_pipe = mutate(no_pipe, G = rgamma(10,1))

with_pipe = 
  tibble(N = rnorm(10), E = rexp(10)) %>%
  mutate(G = rgamma(10,1))

If you want to take it to the next level, the package magrittr includes several other piping operators. My personal favourite is the assignment pipe %<>% which allows you to modify data in place.

4. A few more of my favourite things

As I said at the start, this is by not meant to be a comprehensive introduction to the tidyverse. Now that I’ve introduced a few of the basics, here are a couple other features (each of which could have its own post really) that make these packages so great:

Tibbles and what they can store

With tibbles, the columns type does not have to be a core data type. As well as the basics (integer, numeric, string, factor, logical), cells can contain vectors, lists, tibbles or almost anything really. And you can move between complex and simple data types using functions in the package dplyr has useful functions to nest, unnest and pivot data to the desired shape without much hassle.

Purrr, map and all its variants

You can make code a lot easier to read by using map functions to replace for loops. This is really helpful when you have nested tibbles that you want to perform a set of operations over.

Tidyselect and dplyr

The group_by function from dplyr and the many helper functions included in the package tidyselect make summarising and manipulating groups of data super straightforward.

Grammar of Graphics

I’ll admit that when you just start using ggplot2 in R, it may seem really complicated, especially when compared to the base graphics package included in R. But once you get a hang of the basics, you can create some spectacular visualisations.

R Markdown

R Markdown is an amazing way to combine code, results and commentary and save them as accessible file types. It is really good as both a lab notebook to keep track of your work and thoughts, and as a means of communicating every step of the analysis process.

Learn More

This (as well as lots of practice) is where I learnt most of what I know about the tidyverse.
A really helpful for ggplot2

A new reality TV show idea: the Stable Marriage algorithm

danielle-notice — Mon, 14 Feb 2022 09:00:00 +0000

Reading Time: 3 minutes

As a hopeful romantic, a believer in the principle of marriage and a lover of dating reality TV, I was immediately intrigued by this problem and solution. So to celebrate Valentine’s Day I thought it would be fitting to look at the stable marriage problem.

1. The Premise

Consider two disjoint sets with the same number of elements (for example a group of n men and a group of n women). A matching is a one-to-one mapping from one set onto the other (a set of n monogamous marriages between the men and women). Each man has an order of preference for the women and each woman an order of preference for the men.

A matching is unstable if there exists a possible pairing of a man and a woman (not currently married) that both prefer each other to their spouses. For example, Johnny is married to Bao but prefers Myrla and Myrla is married to Gil but prefers Johnny (IYKYK). Whereas this would making for entertaining TV, the stable marriage problem is to find a matching that avoids this situation.

2. The Pitch

Firstly, it is always possible to find a stable matching in this situation. One possible way to find a solution is the Gale-Shapley algorithm:

First Round

Each man proposes to the woman he prefers the most.
Each woman (if she received any proposals) tentatively accepts her favourite proposal and rejects all the others.

Subsequent Rounds

Each unengaged man proposes to the next woman he prefers the most who has not yet rejected him, regardless of whether she is currently engaged (scandalous!)
Each unengaged woman tentatively accepts her favourite proposal and rejects all the others.
Each engaged woman considers any new proposals and leaves her current partner if she prefers one of the new proposals. She tentatively accepts that better proposal and rejects all the others.

The subsequent rounds are repeated until everyone is engaged.

Example of the Gale-Shapley algorithm (from )

3. A Problem

Important for this algorithm is who makes the proposals – if the men propose, the overall outcome is better for them than for the women. If we score each marriage in the stable matching from both the male and female perspectives based on each person’s preferences and take total score for each gender, you can see a clear difference in the distribution of the scores. The difference is more drastic as the set size is increased.

Distribution of scores for stable matchings when males make proposal using randomly generated preference tables (female scores red, male scores blue)

4. In Practice

While I’ve introduced this problem as a pitch for a dramatic (even if biased) match-making show, Shapley and Roth won a for their applications of this problem and someone did his whole extending some of the ideas.

Here are some interesting situations that this algorithm or some variation of it have been used for in practice:

to transplant patients

Learn more

Gale, D., & Shapley, L. S. (1962). . The American Mathematical Monthly, 69(1), 9–15.