Voltorb Flip: Overanalyzing a Minigame in Pokemon HeartGold and SoulSilver -- Part 2.5

Significance Testing

Dec 04, 2021

This post is Part 2.5 of a four-part series on Voltorb Flip, a minigame found in Pokemon HeartGold and SoulSilver. If you haven’t read Part 2 of my series overanalyzing Voltorb Flip, it can be found here. Part 1 here.

As I was writing Part 3, on my attempts to use Optical Character Recognition to automate playing Voltorb Flip, it occurred to me that I had neglected a critical part of the algorithm analysis from Part 2 — significance testing. We’ll get to the OCR directly afterwards, in the real Part 3.

Significance Testing

In my last post, I analyzed a Figure of Merit (FoM) based algorithm for guessing well in Voltorb Flip. The algorithm is pretty straightforward. I calculate the remaining reward R for each row and column, divide each by the number of Voltorbs in that line (plus 1), and multiply them together to create a figure of merit for each tile.

FoM = R/(1+V)
R = S - 5 - yR - 2 zR, where S = P + V, and yR, zR are the number of revealed 2 tiles and 3 tiles respectively

The solution algorithm flips the tile with the highest figure of merit. In a tie, it chooses the upper left one. The FoM algorithm cleared the board 70.83% of the time in 10,000 simulated games of Voltorb Flip.

I also simulated a simpler algorithm, where instead of constructing the full figure of merit, we simply calculate R for each line, multiply out to get an R product for each tile, and choose the tile with the highest R product. This algorithm cleared the board 68.99% of times. That’s worse, but is it significantly worse?

Statistical Significance

Statistical significance is a much used and abused concept. The arbitrary p < 0.05 cutoff for a significant result has probably caused active harm to science as a discipline. That being said, it’s a super useful tool. So how does it work?

This is a Gaussian Distribution, also called a Normal Distribution. It’s referred to as “normal” because it is, in some sense, the normal distribution for any statistical phenomenon. That’s a very informal statement of the Central Limit Theorem, which formally states that when independent random variables are summed together, their distribution approaches the normal distribution, even if the random variables are not normally distributed. It’s actually spooky how well this works.

Necessary fact: adding two random variables is equivalent to convolving their probability distributions. Let’s illustrate the tendency of the universe to trend towards Gaussians. If you take completely random noise, and start convolving it with other completely random noise, this is what you get:

The appearance of the Gaussian is provable, but still spooky. Provably spooky.

If you were to assume that every statistical distribution is a Gaussian, you would be right more often that you were wrong, and pretty close in almost all cases.

What does this have to do with statistical significance? Let’s say we know that a statistical process has a Gaussian distribution with a mean value of μ and a standard deviation of σ.

If we then take a sample x from that process, we expect it to fall somewhere on this curve. 68.3% of the time, x will lie within plus or minus one standard deviation of the mean. 95.45% of the time, x will lie within two standard deviations, and so on.

When doing significance testing of a sample x’ which may or may not be drawn from our known probability distribution, we assume what is called the “null hypothesis.” The null hypothesis states that our sample is drawn from our known probability distribution. Letting x’ = μ + z σ and solving for z, we find the “z-score” of our sample. The z-score tells us how far along the normal distribution our sample x’ lies.

Intuitively, the further x’ falls away from μ, the less likely that it was drawn from our probability distribution. The p-value is the probability of observing a result at least as extreme as x’ in our test distribution. More formally, it is the integral of the normal distribution from z to infinity, or twice that for a two-tailed distribution (where the sign of the difference doesn’t matter). Looking at the figure, the fraction of the curve shaded green is the two-tailed p-value of events where x’ > μ + 2σ.

A recipe for finding the p-value of a sample x’

Define your null hypothesis distribution
- find μ and σ
Find z = x’/σ - μ/σ
p = 0.5 * erfc( 0.707 z), where erfc is the complementary error function, a Gaussian integral
multiply p by 2 if the direction of the difference doesn’t matter

Voltorb Flip Algorithm Analysis

The FoM algorithm won 7,083/10,000 games. The simpler R score algorithm won 6,899/10,000 games. Let’s ask whether the R score algorithm is significantly worse than the FoM algorithm. In other words, is the sample x’ = 0.6899 drawn from the distribution of the FoM algorithm with mean μ = 0.7083?

In order to calculate this, we need the standard deviation of our prior probability distribution. We could construct these analytically, or we could simulate playing 10,000 games of Voltorb Flip a thousand times and directly calculate the standard deviation of the win rate. In fact, we could simplify this even further — instead of running 10,000 games of Voltorb Flip for which we know the mean win rate is 70.83%, we could just simulate the distribution of an unfair coin that lands heads 70.83% of the time.

Here’s my code to do this:

import numpy as np
from tqdm import tqdm
import scipy.stats

def flipBiasedCoin(P): #flip a coin with P probability of landing heads
    number = np.random.rand(1)
    if number > P:
        return 0
    else:
        return 1

def experiment(P, trials): #flip a biased coin trials number of times and record the times it lands heads
    success = np.zeros(trials,)
    for i in range(0,trials):
        success[i] = flipBiasedCoin(P)
    winrate = np.sum(success)/trials
    return winrate

PFoM = 0.7083 #success rate of the FoM algorithm
games = 10000 #this rate was calculated over 10,000 trials

#run 1,000 iterations of the test we did to construct a probability distribution
trials = 1000
experiments = np.zeros(trials,)
for i in tqdm(range(0,trials)):
    experiments[i] = experiment(PFoM, games)
stdev = np.std(experiments)

The result is:

Our mean success rate is (obviously) very close to 0.7083, and our standard deviation appears to be 0.0044. Let’s check this against the analytical result, which is found by constructing the standard deviation of the binomial distribution with success probability μ.

σ = sqrt( N μ (1 - μ) ) = 45.45

We expect a standard deviation of 45 and a half games of Voltorb Flip for every 10,000 we play. Since this is the standard deviation in the number games won, and we want the standard deviation in win rate, we divide by our number of trials and get roughly σ* = σ/N = 0.0045. Our code works! This type of Monte Carlo solution is especially useful for problems without easy analytic solutions.

We calculate our z-score, and use the scipy.stats module to find a (two-tailed) p-value. In this case, I’m fairly sure it’s fine to use the one-tailed p-value, but let’s be extra conservative and use the two-tailed case.

PR = 0.6899 #success rate of the R score algorithm in 10,000 trials

Z = PR/stdev - PFoM/stdev #z score
print('Z score: '+str(Z))
p = 2* scipy.stats.norm.sf(abs(Z)) #p value of Z score
print('p value: '+str(p))

The result is:

The probability that the R score algorithm is worse purely by chance is 0.00288%. We reject the null hypothesis — the Figure of Merit Voltorb Flip Algorithm is significantly better than the R score algorithm with p = 0.0000288. This p-value would satisfy all but the particle physicists, whose standards for a discovery are famously stringent at p < 0.0000003.

There we are! Our first-pass Figure of Merit algorithm is superior to the simpler “remaining reward product” algorithm with very high probability. After this interlude into nasty statistics, it’s time to return to the fun of actually playing Voltorb Flip, by semi-automating it. That can be found in Part 3 of my series. Stay tuned!

An additional note suggested by my statistician brother: Instead of using the integral of a Gaussian distribution to calculate the p-value, as I did in this post, it is more correct to use the binomial distribution for this class of problem. This can matter, particularly with very low probability events, because the tails of the binomial distribution are “fatter” than the tails of the Gaussian. For this analysis, I used a two-tailed Gaussian to estimate the p-value, and got p = 2.9 * 10^-5, with each tail contributing ~1.5*10^-5. If instead, we integrate a single tail of the binomial distribution, we get approximately p = 3 * 10^-5. Two Gaussian tails is roughly equal to one binomial tail. Neither distribution gives a p-value anywhere close to the significance threshold, and so we can firmly reject the null hypothesis using either method.

Maximum Effort, Minimum Reward

Discussion about this post